close inspection of RVV as an example, the basic arithmetic
operations are massively duplicated: scalar-scalar from the base
is joined by both scalar-vector and vector-vector *and* predicate
-mask management, and transfer instructions between all the sane,
+mask management, and transfer instructions between all the same,
which goes a long way towards explaining why there are twice as many
-Vector instructions in RISC-V as there are in the RV64GC base.
+Vector instructions in RISC-V as there are in the RV64GC Scalar base.
The question then becomes: with all the duplication of arithmetic
operations just to make the registers scalar or vector, why not
Remarkably this is not a new idea. Intel's x86 `REP` instruction
gives the base concept, but in 1994 it was Peter Hsu, the designer
-of the MIPS R8000, who first came up with the idea of Vector
-prefixing. Relying on a multi-issue Out-of-Order Execution Engine,
+of the MIPS R8000, who first came up with the idea of Vector-augmented
+prefixing of an existing Scalar ISA. Relying on a multi-issue Out-of-Order Execution Engine,
the prefix would mark which of the registers were to be treated as
Scalar and which as Vector, then perform a `REP`-like loop that
jammed multiple scalar operations into the Multi-Issue Execution
with primarily Load and Store being able to handle 8/16/32/64
and sometimes 128-bit (quad-word), where Vector ISAs need to
go as low as 8-bit arithmetic, even 8-bit Floating-Point for
- high-performance AI.
+ high-performance AI. Rather than waste opcode space adding all
+ such operations at different bitwidths, let the prefix
+ *redefine* the element width.
* "Reordering" of the assumption of linear sequential element
access, for Matrices, rotations, transposition, Convolutions,
DCT, FFT, Parallel Prefix-Sum and other common transformations
that require significant programming effort in other ISAs.
+From there, several more "Modes" can be added, including saturation,
+which is needed for Audio and Video applications, "Reverse Gear"
+which runs the Element Loop in reverse order (needed for Prefix
+Sum), and more.
+
**What is missing from Power Scalar ISA that a Vector ISA needs?**
-Remarkably, very little.
+Remarkably, very little: the devil is in the details though.
* The traditional `iota` instruction may be
synthesised with an overlapping add, that stacks up incrementally
* The Condition Register Fields of the Power ISA make a great candidate
for use as Predicate Masks, particularly when combined with
Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
+
+It is only when looking slightly deeper into the Power ISA that
+certain things turn out to be missing, and this is down in part to IBM's
+primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or
+so Scalar ones.