+Vector instructions in RISC-V as there are in the RV64GC Scalar base.
+
+The question then becomes: with all the duplication of arithmetic
+operations just to make the registers scalar or vector, why not
+leverage the *existing* Scalar ISA with some sort of "context"
+or prefix that augments its behaviour? Then, the Instruction Decode
+phase is greatly simplified, reducing design complexity and leaving
+plenty of headroom for further expansion.
+
+Remarkably this is not a new idea. Intel's x86 `REP` instruction
+gives the base concept, but in 1994 it was Peter Hsu, the designer
+of the MIPS R8000, who first came up with the idea of Vector-augmented
+prefixing of an existing Scalar ISA. Relying on a multi-issue Out-of-Order Execution Engine,
+the prefix would mark which of the registers were to be treated as
+Scalar and which as Vector, then, treating the Scalar "suffix" instruction
+as a guide and making "scalar instruction" synonymous with "Vector element",
+perform a `REP`-like loop that
+jammed multiple scalar operations into the Multi-Issue Execution
+Engine. The only reason that the team did not take this forward
+into a commercial product
+was because they could not work out how to cleanly do OoO
+multi-issue at the time.
+
+In its simplest form, then, this "prefixing" idea is a matter
+of:
+
+* Defining the format of the prefix
+* Adding a `setvl` instruction
+* Adding Vector-context SPRs and working out how to do
+ context-switches with them
+* Writing an awful lot of Specification Documentation
+ (4 years and counting)
+
+Once the basics of this concept have sunk in, early
+advancements quickly follow naturally from analysis
+of the problem-space:
+
+* Expanding the size of GPR, FPR and CR register files to
+ provide 128 entries in each. This is a bare minimum for GPUs
+ in order to keep processing workloads as close to a LOAD-COMPUTE-STORE
+ batching as possible.
+* Predication (an absolutely critical component for a Vector ISA),
+ then the next logical advancement is to allow separate predication masks
+ to be applied to *both* the source *and* the destination, independently.
+* Element-width overrides: most Scalar ISAs today are 64-bit only,
+ with primarily Load and Store being able to handle 8/16/32/64
+ and sometimes 128-bit (quad-word), where Vector ISAs need to
+ go as low as 8-bit arithmetic, even 8-bit Floating-Point for
+ high-performance AI. Rather than waste opcode space adding all
+ such operations at different bitwidths, let the prefix
+ *redefine* the element width.
+* "Reordering" of the assumption of linear sequential element
+ access, for Matrices, rotations, transposition, Convolutions,
+ DCT, FFT, Parallel Prefix-Sum and other common transformations
+ that require significant programming effort in other ISAs.
+
+All of these things come entirely from "Augmentation" of the Scalar operation
+being prefixed: at no time is the Scalar operation significantly
+altered.
+From there, several more "Modes" can be added, including saturation,
+which is needed for Audio and Video applications, "Reverse Gear"
+which runs the Element Loop in reverse order (needed for Prefix
+Sum), and more.
+
+**What is missing from Power Scalar ISA that a Vector ISA needs?**
+
+Remarkably, very little: the devil is in the details though.
+
+* The traditional `iota` instruction may be
+ synthesised with an overlapping add, that stacks up incrementally
+ and sequentially. Although it requires two instructions (one to
+ start the sum-chain) the technique has the advantage of allowing
+ increments by arbitrary amounts, and is not limited to addition,
+ either.
+* Big-integer addition (arbitrary-precision arithmetic) is an
+ emergent characteristic from the carry-in, carry-out capability of
+ Power ISA `adde` instruction. `sv.adde` as a BigNum add
+ naturally emerges from the
+ sequential chaining of these scalar instructions.
+* The Condition Register Fields of the Power ISA make a great candidate
+ for use as Predicate Masks, particularly when combined with
+ Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
+
+It is only when looking slightly deeper into the Power ISA that
+certain things turn out to be missing, and this is down in part to IBM's
+primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or
+so Scalar ones. Examples include that transfer operations between the
+Integer and Floating-point Scalar register files were dropped approximately
+a decade ago after the Packed SIMD variants were considered to be
+duplicates. With it being completely inappropriate to attempt to Vectorise
+a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
+the Scalar ISA, a much better all-round candidate for Vectorisation is
+left anaemic.
+
+A particular key instruction that is missing is `MV.X` which is
+illustrated as `GPR(dest) = GPR(GPR(src))`. This horrendously
+expensive instruction causing a huge swathe of Register Hazards
+in one single hit is almost never added to a Scalar ISA but
+is almost always added to a Vector one. When `MV.X` is
+Vectorised it allows for arbitrary
+remapping of elements within a Vector to positions specified
+by another Vector. A typical Scalar ISA will use Memory to
+achieve this task, but with Vector ISAs the Vector Register Files are
+usually so enormous, and so far away from Memory, that it is easier and
+more efficient, architecturally, to provide these Indexing instructions.
+
+Fortunately, with the ISA Working Group being willing
+to consider RFCs (Requests For Change) these omissions have the potential
+to be corrected.
+
+One deliberate decision in SVP64 involves Predication. Typical Vector
+ISAs have quite comprehensive arithmetic and logical operations on
+Predicate Masks, and if CR Fields were the only predicates in SVP64
+it would put pressure on to start adding the exact same arithmetic and logical
+operations that already exist in the Integer opcodes.
+Instead of taking that route the decision was made to allow *both*
+Integer *and* CR Fields to be Predicate Masks, and to create Draft
+instructions that provide better transfer capability between CR Fields
+and Integer Register files.
+
+Beyond that, further extensions to the Power ISA become much more
+domain-specific, such as adding bitmanipulation for Audio, Video
+and Cryptographic use-cases, or adding Transcendentals (`LOG1P`,
+`ATAN2` etc) for 3D and other GPU workloads. The huge advantage here
+of the SVP64 "Prefix" approach is that anything added to the Scalar ISA
+*automatically* is inherently added to the Vector one as well, and
+because these GPU and Video opcodes have been added to the CPU ISA,
+Software Driver development and debugging is dramatically simplified.
+
+Which brings us to the next important question: how is any of these
+CPU-centric Vector-centric improvements relevant to power efficiency
+and making more effective use of resources?
+
+# Simpler more compact programs saves power
+
+The first and most obvious saving is that, just as with any Vector
+ISA, the amount of data processing requested
+and controlled by each instruction is enormous, and leaves the
+Decode and Issue Engines idle, as well as the L1 I-Cache. With
+programs being smaller, chances are higher that they fit into
+L1 Cache, or that the L1 Cache may be made smaller.
+
+Even a Packed SIMD ISA could take limited advantage of a higher
+bang-per-buck for limited specific workloads, as long as the
+stripmining setup and teardown is not required. However a
+2-wide Packed SIMD instruction is nowhere near as high a bang-per-buck
+ratio as a 64-wide Vector Length.
+
+Realistically, for general use cases however it is extremely common
+to have the Packed SIMD setup and teardown. `strncpy` for VSX is an
+astounding 240 hand-coded assembler instructions where it is around
+12 to 14 for both RVV and SVP64. Worst case (full algorithm unrolling
+for Massive FFTs) the L1 I-Cache becomes completely ineffective, and in
+the case of the IBM POWER9 with a little-known design flaw not
+normally otherwise encountered this results in
+contention between the L1 D and I Caches at the L2 Bus, slowing down
+execution even further. Power ISA 3.1 MMA (Matrix-Multiply-Assist)
+requires loop-unrolling to contend with non-power-of-two Matrix
+sizes: SVP64 does not, as hinted at below.
+
+Additional savings come in the form of `SVREMAP`. This is a hardware
+index transformation system where the normally sequentially-linear
+element access may be "Re-Mapped" to limited but algorithmic-tailored
+commonly-used deterministic schedules, for example Matrix Multiply,
+DCT, or FFT. A full in-register-file 5x7 Matrix Multiply or a 3x4 or
+2x6 may be performed in as little as 4 instructions, one of which
+is to zero-initialise the accumulator Vector used to store the result.
+If addition to another Matrix is also required then it is only three
+instructions. Not only that, but because the "Schedule" is an abstract
+concept separated from the mathematical operation, there is no reason
+why Matrix Multiplication Schedules may not be applied to Integer
+Mul-and-Accumulate, Galois Field Mul-and-Accumulate, Logical
+AND-and-OR, or any other future instruction such as Complex-Number
+Multiply-and-Accumulate that a future version of the Power ISA might
+support. The flexibility is not only enormous, but the compactness
+unprecedented. RADIX2 in-place DCT Triple-loop Schedules may be created in
+around 11 instructions. The only other processors well-known to have
+this type of compact capability are both VLIW DSPs: TI's TMS320 Series
+and Qualcom's Hexagon, and both are targetted at FFTs only.
+
+There is no reason at all why future algorithmic schedules should not
+be proposed as extensions to SVP64 (sorting algorithms,
+compression algorithms, Sparse Data Sets, Graph Node walking
+for example). Bear in mind that
+the submission process will be
+entirely at the discretion of the OpenPOWER Foundation ISA WG,
+something that is both encouraged and welcomed by the OPF.
+
+One of SVP64's current limitations is that it was initially designed
+for 3D and Video workloads as a hybrid GPU-VPU-CPU. This resulted in
+a heavy focus on adding hardware-for-loops onto the *Registers*.
+After more than three years of development the realisation hit that
+the SVP64 concept could be expanded to Coherent Distributed Memory,
+This astoundingly powerful concept is explored in the next section.
+
+# Coherent Deterministic Hybrid Distributed Memory-Processing
+
+It is not often that a heading in an article can legitimately
+contain quite so many comically-chained buzzwords, but in this section
+they are justified. As hinted at in the first section, the last time
+that memory was the same speed as processors was the Pentium III
+and Motorola 88100 era: 133 and 166 mhz SDRAM was available, and
+CPUs were about the same rate. DRAM bitcells *simply cannot exceed
+these rates*, yet the pressure from Software Engineers is to
+make *sequential* algorithm processing faster and faster because
+parallelising of algorithms is simply too difficult to master, and always
+has been. Thus whilst DRAM has to go parallel (like RAID Striping) to
+keep up, CPUs are now at 8-way Multi-Issue 5 ghz clock rates and
+are at an astonishing four levels of cache (L1 to L4).
+
+It should therefore come as no surprise that attempts are being made
+to move (distribute) processing closer to the DRAM Memory, firmly
+on the *opposite* side of the main CPU's L1/2/3/4 Caches. However
+the alarm bells ring here at the keyword "distributed", because by
+moving the processing down next to the Memory, the speed of any
+of the parallel Processing Elements (PEs) has dropped
+by almost two orders of magnitude (5 ghz down to 100 mhz),
+the simplicity of each PE has, for pure pragmatic reasons,
+to drop by several
+orders of magnitude as well.
+Things that the average "sequential algorithm"
+programmer
+takes for granted such as SMP, Cache Coherency, Virtual Memory,
+spinlocks (atomic locking), all of these are either outright gone
+or expected that the programmer shall explicitly contend with
+(even if that programmer is the Compiler Developer).
+
+To give an extreme example: Aspex's Array-String Processor, which
+was 4096 2-bit SIMD PEs each with 256 bytes of Content Addressable
+Memory, was capable of literally a hundred-fold improvement in
+performance over Scalar CPUs such as the Pentium III of its era,
+all on a 3 watt budget at only 250 mhz in 130 nm. Yet to take
+proper advantage of its capability required an astounding 5-10
+*days* per line of assembly code because multiple versions of
+an algorithm had to be hand-crafted then compared, and
+the best one selected, all others discarded. 20 lines of optimised
+Assembler taking six months to write can in no way be termed
+"productive", yet this extreme level of unproductivity is an inherent
+side-effect of going down the parallel-processing rabbithole.
+
+**In short, we are in "Programmer's nightmare" territory**
+
+Having dug a proverbial hole that rivals the Grand Canyon, and
+jumped in it feet-first, the next
+task is to piece together a strategy to climb back out and show
+how falling back in can be avoided. This takes some explaining,
+and first requires some background on various research efforts and
+commercial designs. Once the context is clear, their synthesis
+can be proposed. These are:
+
+* [ZOLC: Zero-Overhead Loop Control](https://ieeexplore.ieee.org/abstract/document/1692906/)
+* [OpenCAPI and Extra-V](https://dl.acm.org/doi/abs/10.14778/3137765.3137776)
+* [Snitch](https://arxiv.org/abs/2002.10143)
+
+**ZOLC: Zero-Overhead Loop Control**
+
+The simplest longest commercially successful deployment of Zero-overhead looping
+has been in Texas Instruments TMS320 DSPs. Up to fourteen sub-instructions
+within the VLIW word may be repeatedly deployed on successive clock
+cycles until a countdown reaches zero. This extraordinarily simple
+concept needs no branches, and has no complex Register Hazard
+Management in the hardware
+because it is down to the programmer (or, the compiler),
+to ensure data overlaps do not occur.
+
+The key aspect of these
+very simplistic countdown loops is: *they are deterministic*.
+
+Zero-Overhead Loop Control takes this basic "single loop" concept
+way further: both nested loops and conditional exit are included,
+but also arbitrary control-jumping from the current inner loop
+out to an entirely different loop, all based on conditions determined
+dynamically at runtime.
+
+Even when deployed on as basic a CPU as a single-issue in-order RISC
+core, the performance and power-savings were astonishing: between 20
+and **80%** reduction in algorithm completion times were achieved compared
+to a more traditional branch-speculative in-order RISC CPU. MPEG
+Decode, the target algorithm specifically picked by the researcher
+due to its high complexity with 6-deep nested loops and conditional
+execution that frequently jumped in and out of at least 2 loops,
+came out with an astonishing 43% improvement in completion time. 43%
+less instructions executed is an almost unheard-of level of optimisation:
+most ISA designers are elated if they can achieve 5 to 10%. The reduction
+was so compelling that ST Microelectronics put it into commercial
+production in one of their embedded CPUs.
+
+The kicker: when implementing SVP64's Matrix REMAP Schedule, the VLSI
+design of its triple-nested for-loop system
+turned out to be remarkably similar to the
+core nested for-loop engine of ZOLC. In hindsight this should not
+have come as a surprise, because both are basically nested for-loops
+that do not need branches to issue instructions.
+
+The important insight is, however, that if ZOLC can be general-purpose
+and apply deterministic nested looped instruction
+schedules to more than just registers
+(unlike SVP64 in its current incarnation) then so can SVP64.
+
+**OpenCAPI and Extra-V**
+
+OpenCAPI is a deterministic high-performance, high-bandwidth, low-latency
+cache-coherent Memory-access Protocol that is integrated into IBM's Supercomputing-class POWER9 and POWER10 processors. POWER10 *only*
+has OpenCAPI Memory interfaces, and requires an OpenCAPI-to-DDR4/5 Bridge PHY
+to connect to standard DIMMs.
+
+Extra-V appears to be a remarkable research project that, by leveraging
+OpenCAPI, assuming that the map of edges in any given arbitrary data graph
+could be kept by the main CPU in-memory, could distribute and delegate
+a limited-capability deterministic but most importantly *data-dependent*
+node-walking schedule actually right down into the memory itself (on the other side of that L1-4 cache barrier). A miniature processor analysed
+the data it had read (at the Memory), and determine if it should
+notify the main processor that this "Node" is worth investigating,
+or if the Graph node-walk should split in a different direction.
+Thanks to the OpenCAPI Standard, which takes care of Virtual Memory
+abstraction, locking, and cache-coherency, many of the nightmare problems
+of other more explicit parallel processing paradigms disappear.
+
+The similarity to ZOLC should not have gone unnoticed: where ZOLC
+has nested conditional for-loops Extra-V appears to have just the
+one conditional for-loop, but the key strategically-crucial
+part of this multi-faceted puzzle is that due to the deterministic and
+coherent nature of Extra-V, the processing of the loops, which
+requires a tiny processor, is not
+done close to the CPU at all: it is
+*embedded right next to the memory*.
+
+The similarity to the D-Matrix Systolic Array Processing, Aspex Microelectronics
+Array-String Processing, and Elixent 2D Array Processing, should
+also not have gone unnoticed. All of these solutions utilised
+or utilise
+a more comprehensive Turing-complete von-Neumann "Management Core"
+to coordinate data passed in and out of PEs: none of them had or
+had something
+as powerful as OpenCAPI as part of that picture.
+
+**Snitch**