Clean up remap matrix instruction sections

[libreriscv.git] / openpower / sv / SimpleV_rationale.mdwn
diff --git a/openpower/sv/SimpleV_rationale.mdwn b/openpower/sv/SimpleV_rationale.mdwn

index aecdc58efbfb177ed0fa3fe81e51b7ecc5ccbb60..ea1bf2b9e6f0b9652eee9f54e00dec0ecc21525f 100644 (file)
--- a/openpower/sv/SimpleV_rationale.mdwn
+++ b/openpower/sv/SimpleV_rationale.mdwn
@@ -2,8 +2,12 @@
  
  **Revision History**
  
-* v0.00 05may2021 first created
-* v0.01 06may2021 initial first draft
+* v0.00 05may2022 first created
+* v0.01 06may2022 initial first draft
+* v0.02 08may2022 add scenarios / use-cases
+* v0.03 09may2022 add draft image for scenario
+* v0.04 14may2022 add appendix with other research
+* v0.05 14jun2022 update images (thanks to Veera)
  
  **Table of Contents**
  
@@ -15,8 +19,9 @@
  
  Inventing a new Scalar ISA from scratch is over a decade-long task
  including simulators and compilers: OpenRISC 1200 took 12 years to
-mature.  A Vector or Packed SIMD ISA to reach stable *general-purpose*
-auto-vectorisation compiler support has never been achieved in the
+mature.  Stable Open ISAs require Standards and Compliance Suites that
+take more. A Vector or Packed SIMD ISA to reach stable *general-purpose*
+auto-vectorization compiler support has never been achieved in the
  history of computing, not with the combined resources of ARM, Intel,
  AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted
  assembler and direct use of intrinsics is the Industry-standard norm
@@ -33,15 +38,22 @@ this task, and what, in Computer Science, actually needs solving?
  
  First hints are that whilst memory bitcells have not increased in speed
  since the 90s (around 150 mhz), increasing the bank width, striping, and
-datapath widths and speeds to the same has allowed
-significant apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
-and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI,
+datapath widths and speeds to the same has, with significant relative
+latency penalties, allowed
+apparent speed increases: 3200 mhz DDR4 and even faster DDR5,
+and other advanced Memory interfaces such as HBM, Gen-Z, and OpenCAPI's OMI,
  all make an effort (all simply increasing the parallel deployment of
  the underlying 150 mhz bitcells), but these efforts are dwarfed by the
  two nearly three orders of magnitude increase in CPU horsepower
  over the same timeframe. Seymour
  Cray, from his amazing in-depth knowledge, predicted that the mismatch
-would become a serious limitation, over two decades ago.  Some systems
+would become a serious limitation, over two decades ago.
+
+The latency gap between that bitcell speed and the CPU speed can do nothing to help Random Access (unpredictable reads/writes). Cacheing helps only so
+much, but not with some types of workloads (FFTs are one of the worst)
+even though
+they are fully deterministic.
+Some systems
  at the time of writing are now approaching a *Gigabyte* of L4 Cache,
  by way of compensation, and as we know from experience even that will
  be considered inadequate in future.
@@ -87,7 +99,8 @@ it.  ARM's SVE/SVE2 is critically flawed (lacking the Cray `setvl`
  instruction that makes a truly ubiquitous Vector ISA) in ways that
  will become apparent over time as adoption increases. In the meantime
  programmers are, in direct violation of ARM's advice on how to use SVE2,
-trying desperately to use it as if it was Packed SIMD NEON.  The advice
+trying desperately to understand it by applying their experience
+of Packed SIMD NEON.  The advice from ARM
  not to create SVE2 assembler that is hardcoded to fixed widths is being
  disregarded, in favour of writing *multiple identical implementations*
  of a function, each with a different hardware width, and compelling
@@ -116,7 +129,7 @@ performance is concerned.
  
  Slowly, at this point, a realisation should be sinking in that, actually,
  there aren't as many really truly viable Vector ISAs out there, as the
-ones that are evolving in the general direction of Vectorisation are,
+ones that are evolving in the general direction of Vectorization are,
  in various completely different ways, flawed.
  
  **Successfully identifying a limitation marks the beginning of an
@@ -157,7 +170,8 @@ Software Ecosystem? Debian supports most of these including s390:
    Andes in Audio DSPs, WD in HDDs and SSDs. These are all
    astoundingly commercially successful
    multi-billion-unit mass volume markets that almost nobody
-  knows anything about. Included for completeness.
+  knows anything about, outside their specialised proprietary
+  niche. Included for completeness.
  
  In order of least controlled to most controlled, the viable
  candidates for further advancement are:
@@ -170,7 +184,8 @@ candidates for further advancement are:
    (Agreements between RISC-V *Members* to not engage in patent litigation
     does nothing to stop third party patents that *legitimately pre-date*
     the newly-created RISC-V ISA)
-* MIPS, SPARC, ARC, and others, simply have no viable ecosystem.
+* MIPS, SPARC, ARC, and others, simply have no viable publicly
+  managed ecosystem. They work well within their niche markets.
  * Power ISA: protected by IBM's extensive patent portfolio for Members
    of the OpenPOWER Foundation, covered by Trademarks, permitting
    and encouraging contributions, and having software support for over
@@ -237,11 +252,13 @@ of magnitude increase in the number of hand-written lines of assembler
  compared to a well-designed Cray-style Vector ISA with a `setvl`
  instruction.
  
-*Packed SIMD looped algorithms actually have to
+*<blockquote>
+Packed SIMD looped algorithms actually have to
  contain multiple implementations processing fragments of data at
  different SIMD widths: Cray-style Vectors have just the one, covering not
  just current architectural implementations but future ones with
-wider back-end ALUs as well.*
+wider back-end ALUs as well.
+</blockquote>*
  
  Assuming then that variable-length Vectors are obviously desirable,
  it becomes a matter of how, not if.  Both Cray and NEC SX Aurora
@@ -260,7 +277,9 @@ Vector instructions in RISC-V as there are in the RV64GC Scalar base.
  The question then becomes: with all the duplication of arithmetic
  operations just to make the registers scalar or vector, why not
  leverage the *existing* Scalar ISA with some sort of "context"
-or prefix that augments its behaviour? Make "Scalar instruction"
+or prefix that augments its behaviour? Separate out the
+"looping" from "thing being looped on" (the elements),
+make "Scalar instruction"
  synonymous with "Vector Element instruction" and through nothing
  more than contextual
  augmentation the Scalar ISA *becomes* the Vector ISA.
@@ -269,8 +288,11 @@ the Instruction Decode
  phase is greatly simplified, reducing design complexity and leaving
  plenty of headroom for further expansion.
  
+[[!img "svp64-primer/img/power_pipelines.svg" ]]
+
  Remarkably this is not a new idea.  Intel's x86 `REP` instruction
-gives the base concept, but in 1994 it was Peter Hsu, the designer
+gives the base concept, and the Z80 had something similar.
+But in 1994 it was Peter Hsu, the designer
  of the MIPS R8000, who first came up with the idea of Vector-augmented
  prefixing of an existing Scalar ISA.  Relying on a multi-issue Out-of-Order Execution Engine,
  the prefix would mark which of the registers were to be treated as
@@ -281,7 +303,8 @@ jammed multiple scalar operations into the Multi-Issue Execution
  Engine.  The only reason that the team did not take this forward
  into a commercial product
  was because they could not work out how to cleanly do OoO
-multi-issue at the time.
+multi-issue at the time (leveraging Multi-Issue is the most logical
+way to exploit the Vector-Prefix concept)
  
  In its simplest form, then, this "prefixing" idea is a matter
  of:
@@ -320,8 +343,8 @@ of the problem-space:
    that require significant programming effort in other ISAs.
  
  All of these things come entirely from "Augmentation" of the Scalar operation
-being prefixed: at no time is the Scalar operation significantly
-altered.
+being prefixed: at no time is the Scalar operation's binary pattern decoded
+differently compared to when it is used as a Scalar operation.
  From there, several more "Modes" can be added, including
  
  * saturation,
@@ -337,6 +360,10 @@ Sum)
    Boolean Logic in a Vector context, on top of an already-powerful
    Scalar Branch-Conditional/Counter instruction
  
+All of these festures are added as "Augmentations", to create of
+the order of 1.5 *million* instructions, none of which decode the
+32-bit scalar suffix any differently.
+
  **What is missing from Power Scalar ISA that a Vector ISA needs?**
  
  Remarkably, very little: the devil is in the details though.
@@ -354,7 +381,7 @@ Remarkably, very little: the devil is in the details though.
    sequential carry-flag chaining of these scalar instructions.
  * The Condition Register Fields of the Power ISA make a great candidate
    for use as Predicate Masks, particularly when combined with
-  Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
+  Vectorized `cmp` and Vectorized `crand`, `crxor` etc.
  
  It is only when looking slightly deeper into the Power ISA that
  certain things turn out to be missing, and this is down in part to IBM's
@@ -362,17 +389,17 @@ primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or
  so Scalar ones.  Examples include that transfer operations between the
  Integer and Floating-point Scalar register files were dropped approximately
  a decade ago after the Packed SIMD variants were considered to be
-duplicates.  With it being completely inappropriate to attempt to Vectorise
+duplicates.  With it being completely inappropriate to attempt to Vectorize
  a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
-the Scalar ISA, a much better all-round candidate for Vectorisation is
-left anaemic.
+the Scalar ISA, a much better all-round candidate for Vectorization 
+(the Scalar parts of Power ISA) is left anaemic.
  
  A particular key instruction that is missing is `MV.X` which is
  illustrated as `GPR(dest) = GPR(GPR(src))`. This horrendously
  expensive instruction causing a huge swathe of Register Hazards
  in one single hit is almost never added to a Scalar ISA but
  is almost always added to a Vector one. When `MV.X` is
-Vectorised it allows for arbitrary
+Vectorized it allows for arbitrary
  remapping of elements within a Vector to positions specified
  by another Vector. A typical Scalar ISA will use Memory to
  achieve this task, but with Vector ISAs the Vector Register Files are
@@ -452,31 +479,34 @@ on any source or destination Matrix
  may be performed in as little as 4 instructions, one of which
  is to zero-initialise the accumulator Vector used to store the result.
  If addition to another Matrix is also required then it is only three
-instructions. Not only that, but because the "Schedule" is an abstract
+instructions.
+
+Not only that, but because the "Schedule" is an abstract
  concept separated from the mathematical operation, there is no reason
  why Matrix Multiplication Schedules may not be applied to Integer
  Mul-and-Accumulate, Galois Field Mul-and-Accumulate, Logical
  AND-and-OR, or any other future instruction such as Complex-Number
-Multiply-and-Accumulate that a future version of the Power ISA might
+Multiply-and-Accumulate or Abs-Diff-and-Accumulate
+that a future version of the Power ISA might
  support.  The flexibility is not only enormous, but the compactness
-unprecedented.  RADIX2 in-place DCT Triple-loop Schedules may be created in
-around 11 instructions. The only other processors well-known to have
+unprecedented.  RADIX2 in-place DCT may be created in
+around 11 instructions using the Triple-loop DCT Schedule. The only other processors well-known to have
  this type of compact capability are both VLIW DSPs: TI's TMS320 Series
  and Qualcom's Hexagon, and both are targetted at FFTs only.
  
  There is no reason at all why future algorithmic schedules should not
  be proposed as extensions to SVP64 (sorting algorithms,
  compression algorithms, Sparse Data Sets, Graph Node walking
-for example). Bear in mind that
+for example). (*Bear in mind that
  the submission process will be
  entirely at the discretion of the OpenPOWER Foundation ISA WG,
-something that is both encouraged and welcomed by the OPF.
+something that is both encouraged and welcomed by the OPF.*)
  
  One of SVP64's current limitations is that it was initially designed
  for 3D and Video workloads as a hybrid GPU-VPU-CPU. This resulted in
  a heavy focus on adding hardware-for-loops onto the *Registers*.
  After more than three years of development the realisation hit that
-the SVP64 concept could be expanded to Coherent Distributed Memory,
+the SVP64 concept could be expanded to Coherent Distributed Memory.
  This astoundingly powerful concept is explored in the next section.
  
  # Coherent Deterministic Hybrid Distributed In-Memory Processing
@@ -527,7 +557,17 @@ an algorithm had to be hand-crafted then compared, and only
  the best one selected: all others discarded. 20 lines of optimised
  Assembler taking three to six months to write can in no way be termed
  "productive", yet this extreme level of unproductivity is an inherent
-side-effect of going down the parallel-processing rabbithole.
+side-effect of going down the parallel-processing rabbithole where
+the cost of providing "Traditional" programmabilility (Virtual Memory,
+SMP) is worse than counter-productive, it's often outright impossible.
+
+*<blockquote>
+Similar to how GPUs achieve astounding task-dedicated
+performance by giving
+ALUs 30% of total silicon area and sacrificing the ability to run
+General-Purpose programs, Aspex, Google's Tensor Processor and D-Matrix
+likewise took this route and made the same compromise.
+</blockquote>*
  
  **In short, we are in "Programmer's nightmare" territory**
  
@@ -540,15 +580,17 @@ commercial designs.  Once the context is clear, their synthesis
  can be proposed.  These are:
  
  * [ZOLC: Zero-Overhead Loop Control](https://ieeexplore.ieee.org/abstract/document/1692906/)
+ available [no paywall](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.301.4646&rep=rep1&type=pdf)
  * [OpenCAPI and Extra-V](https://dl.acm.org/doi/abs/10.14778/3137765.3137776)
  * [Snitch](https://arxiv.org/abs/2002.10143)
  
  **ZOLC: Zero-Overhead Loop Control**
  
  Zero-Overhead Looping is the concept of automatically running a set sequence
-of instructions for a predetermined number of times, without requiring
-a branch. This is slightly different from using Power ISA `bc` in `CTR`
-(Counter) Mode, because in ZOLC the branch-back is automatic.
+of instructions a predetermined number of times, without requiring
+a branch. This is conceptually similar but
+slightly different from using Power ISA `bc` in `CTR`
+(Counter) Mode to create loops, because in ZOLC the branch-back is automatic.
  
  The simplest longest commercially successful deployment of Zero-overhead looping
  has been in Texas Instruments TMS320 DSPs. Up to fourteen sub-instructions
@@ -561,11 +603,14 @@ to ensure data overlaps do not occur.  Careful crafting of those
  14 instructions can keep the ALUs 100% occupied for sustained periods,
  and the iconic example for which the TI DSPs are renowned
  is that an entire inner loop for large FFTs
-can be done with that one VLIW word: no stalls, no stopping, no fuss.
+can be done with that one VLIW word: no stalls, no stopping, no fuss,
+an entire 1024 or 4096 wide FFT Layer in one instruction.
  
+<blockquote>
  The key aspect of these
  very simplistic countdown loops as far as we are concerned:
  is: *they are deterministic*.
+</blockquote>
  
  Zero-Overhead Loop Control takes this basic "single loop" concept
  way further: both nested loops and conditional exit are included,
@@ -574,10 +619,10 @@ out to an entirely different loop, all based on conditions determined
  dynamically at runtime.
  
  Even when deployed on as basic a CPU as a single-issue in-order RISC
-core, the performance and power-savings were astonishing: between 20
-and **80%** reduction in algorithm completion times were achieved compared
+core, the performance and power-savings were astonishing: between 27
+and **75%** reduction in algorithm completion times were achieved compared
  to a more traditional branch-speculative in-order RISC CPU.  MPEG
-Decode, the target algorithm specifically picked by the researcher
+Encode's timing, the target algorithm specifically picked by the researcher
  due to its high complexity with 6-deep nested loops and conditional
  execution that frequently jumped in and out of at least 2 loops,
  came out with an astonishing 43% improvement in completion time. 43%
@@ -601,9 +646,15 @@ schedules to more than just registers
  **OpenCAPI and Extra-V**
  
  OpenCAPI is a deterministic high-performance, high-bandwidth, low-latency
-cache-coherent Memory-access Protocol that is integrated into IBM's Supercomputing-class POWER9 and POWER10 processors. POWER10 *only*
-has OpenCAPI Memory interfaces, and requires an OMI-to-DDR4/5 Bridge PHY
-to connect to standard DIMMs.
+cache-coherent Memory-access Protocol that is integrated into IBM's Supercomputing-class POWER9 and POWER10 processors.
+
+<blockquote>(Side note:
+POWER10 *only*
+has OpenCAPI Memory interfaces: an astounding number of them,
+with overall bandwidth so high it's actually difficult to conceptualise.
+An OMI-to-DDR4/5 Bridge PHY is therefore required
+to connect to standard Memory DIMMs.)
+</blockquote>
  
  Extra-V appears to be a remarkable research project based on OpenCAPI that,
  by assuming that the map of edges (excluding the actual data)
@@ -652,22 +703,26 @@ the efficiency and effectiveness
  of these Load-Store-with-Increment instructions has been
  forgotten until Snitch.
  
-What the designers did however was not to add new Load-Store
-or Arithmetic instructions to RISC-V, but instead to "mark"
-registers with a tag.  These tags tell the CPU: when you are asked to
+What the designers did however was not to add any new Load-Store
+or Arithmetic instructions to the underlying RISC-V at all, but instead to "mark"
+registers with a tag which *augmented* (altered) the behaviour
+of *existing* instructions.  These tags tell the CPU: when you are asked to
  carry out
  an add instruction on r6 and r7, do not take r6 or r7 from the register
  file, instead please perform a Cache-coherent Load-with-Increment
-on each, using special Address Registers for each.  Each new use
+on each, using special (hidden, implicit)
+Address Registers for each.  Each new use
  of r6 therefore brings in an entirely new value *directly from
  memory*. Likewise on the second operand, r7, and likewise on
  the destination result which can be an automatic Coherent
  Store-and-increment
-directly into Memory. In essence:
+directly into Memory. 
  
+<blockquote>
  *The act of "reading" or "writing" a register has been decoupled
  and intercepted, then connected transparently to a completely
  separate Coherent Memory Subsystem*
+</blockquote>
  
  On top of a barrel-architecture the slowness of Memory access
  was not a problem because the Deterministic nature of classic
@@ -776,10 +831,30 @@ execute the exact same ISA (or a subset of it). If however the
  concept of Hybrid PE-Memory Processing were to become a JEDEC Standard,
  which would increase adoption and reduce cost, a bit more thought
  is required here because ARM or Intel or MIPS might not necessarily
-be happy that a Processing Element has to execute Power ISA binaries.
+be happy that a Processing Element (PE) has to execute Power ISA binaries.
  At least the Power ISA is much richer, more powerful, still RISC,
  and is an Open Standard, as discussed in a earlier sections.
  
+A reasonable compromise as a JEDEC Standard is illustrated with
+the following diagram: a 3-way Bridge PHY that allows for full
+direct interaction between DRAM ICs, PEs, and one or more main CPUs
+(* a variant of the Northbridge and/or IBM POWER10 OMI-to-DDR5 PHY concept*).
+It is also the ideal location for a "Management Core".
+If the 3-way Bridge (4-way if connectivity to other Bridge PHYs
+is also included) does not itself have PEs built-in then the ISA
+utilised on any PE or CPU is non-critical.  The only concern regarding
+mixed ISAs is that the PHY should be capable of transferring all and
+any types of "Management" packets, particularly PE Virtual Memory Management
+and Register File Control (Context-switch Management given that the PEs
+are expected to be ALU-heavy and not capable of running a full SMP Operating
+System).
+
+There is also no reason why this type of arrangement should not be deployed
+in Multi-Chip-Module (aka "Chiplet") form, giving all the advantages of
+the performance boost that goes with smaller line-drivers.
+
+<img src="/openpower/sv/bridge_phy.svg" width=600 />
+
  # Transparently-Distributed Vector Processing
  
  It is very strange to the author to be describing what amounts to a
@@ -797,13 +872,230 @@ not that straightforward: programs
  have to be "massaged" by tools that insert intrinsics into the
  source code, in order to identify the Basic Blocks that the Zero-Overhead
  Loops can run. Can this be merged into standard gcc and llvm
-compilers? As intrinsics: of course. Can it become part of auto-vectorisation? Probably,
+compilers? As intrinsics: of course. Can it become part of auto-vectorization? Probably,
  if an infinite supply of money and engineering time is thrown at it.
  Is a half-way-house solution of compiler intrinsics good enough?
  Intel, ARM, MIPS, Power ISA and RISC-V have all already said "yes" on that,
  for several decades, and advanced programmers are comfortable with the
  practice.
  
+Additional questions remain as to whether OpenCAPI or its use for this
+particular scenario requires that the PEs, even quite basic ones,
+implement a full RADIX MMU, and associated TLB lookup? In order to ensure
+that programs may be cleanly and seamlessly transferred between PEs
+and CPU the answer is quite likely to be "yes", which is interesting
+in and of itself.  Fortunately, the associated L1 Cache with TLB
+Translation does not have to be large, and the actual RADIX Tree Walk
+need not explicitly be done by the PEs, it can be handled by the main
+CPU as a software-extension: PEs generate a TLB Miss notification
+to the main CPU over OpenCAPI, and the main CPU feeds back the new
+TLB entries to the PE in response.
+
+Also in practical terms, with the PEs anticipated to be so small as to
+make running a full SMP-aware OS impractical it will not just be their TLB
+pages that need remote management but their entire register file including
+the Program Counter will need to be set up, and the ZOLC Context as
+well. With OpenCAPI packet formats being quite large a concern is that
+the context management increases latency to the point where the premise
+of this paper is invalidated. Research is needed here as to whether a
+bare-bones microkernel
+would be viable, or a Management Core closer to the PEs (on the same
+die or Multi-Chip-Module as the PEs) would allow better bandwidth and
+reduce Management Overhead on the main CPUs.  However if
+the same level of power saving as Snitch (1/6th) and
+the same sort of reduction in algorithm runtime as ZOLC (20 to 80%) is not
+unreasonable to expect, this is
+definitely compelling enough to warrant in-depth investigation.
+
+**Use-case: Matrix and Convolutions**
+
+<img src="/openpower/sv/sv_horizontal_vs_vertical.svg" />
+
+First, some important definitions, because there are two different
+Vectorization Modes in SVP64:
+
+* **Horizontal-First**: (aka standard Cray Vectors) walk
+  through **elements** first before moving to next **instruction**
+* **Vertical-First**: walk through **instructions** before
+  moving to next **element**.  Currently managed by `svstep`,
+  ZOLC may be deployed to manage the stepping, in a Deterministic manner.
+
+Second:
+SVP64 Draft Matrix Multiply is currently set up to arrange a Schedule
+of Multiply-and-Accumulates, suitable for pipelining, that will,
+ultimately, result in a Matrix Multiply. Normal processors are forced
+to perform "loop-unrolling" in order to achieve this same Schedule.
+SIMD processors are further forced into a situation of pre-arranging rotated
+copies of data if the Matrices are not exactly on a power-of-two boundary.
+
+The current limitation of SVP64 however is (when Horizontal-First
+is deployed, at least, which is the least number of instructions)
+that both source and destination Matrices have to be in-registers,
+in full.  Vertical-First may be used to perform a LD/ST within
+the loop, covered by `svstep`, but it is still not ideal.  This
+is where the Snitch and EXTRA-V concepts kick in.
+
+<img src="/openpower/sv/matrix_svremap.svg" />
+
+Imagine a large Matrix scenario, with several values close to zero that
+could be skipped: no need to include zero-multiplications, but a
+traditional CPU in no way can help: only by loading the data through
+the L1-L4 Cache and Virtual Memory Barriers is it possible to
+ascertain, retrospectively, that time and power had just been wasted.
+
+SVP64 is able to do what is termed "Vertical-First" Vectorization,
+combined with SVREMAP Matrix Schedules.  Imagine that SVREMAP has been
+extended, Snitch-style, to perform a deterministic memory-array walk of
+a large Matrix.
+
+Let us also imagine that the Matrices are stored in Memory with PEs
+attached, and that the PEs are fully functioning Power ISA with Draft
+SVP64, but their Multiply capability is not as good as the main CPU.
+Therefore:
+we want the PEs to conditionally
+feed sparse data to the main CPU, a la "Extra-V".
+
+* The ZOLC SVREMAP System running on the main CPU generates a Matrix
+  Memory-Load Schedule.
+* The Schedule is sent to the PEs, next to the Memory, via OpenCAPI
+* The PEs are also sent the Basic Block to be executed on each
+  Memory Load (each element of the Matrices to be multiplied)
+* The PEs execute the Basic Block and **exclude**, in a deterministic
+  fashion, any elements containing Zero values
+* Non-zero elements are sent, via OpenCAPI, to the main CPU, which
+  queues sequences of Multiply-and-Accumulate, and feeds the results
+  back to Memory, again via OpenCAPI, to the PEs.
+* The PEs, which are tracking the Sparse Conditions, know where
+  to store the results received
+
+In essence this is near-identical to the original Snitch concept
+except that there are, like Extra-V, PEs able to perform
+conditional testing of the data as it goes both to and from the
+main CPU.  In this way a large Sparse Matrix Multiply or Convolution
+may be achieved without having to pass unnecessary data through
+L1/L2/L3 Caches only to find, at the CPU, that it is zero.
+
+The reason in this case for the use of Vertical-First Mode is the
+conditional execution of the Multiply-and-Accumulate.
+Horizontal-First Mode is the standard Cray-Style Vectorization:
+loop on all *elements* with the same instruction before moving
+on to the next instruction. Horizontal-First
+Predication needs to be pre-calculated
+for the entire Vector in order to exclude certain elements from
+the computation. In this case, that's an expensive inconvenience 
+(remarkably similar to the problems associated with Memory-to-Memory
+Vector Machines such as the CDC Star-100).
+
+Vertical-First allows *scalar* instructions and
+*scalar* temporary registers to be utilised
+in the assessment as to whether a particular Vector element should
+be skipped, utilising a straight Branch instruction *(or ZOLC
+Conditions)*.  The Vertical Vector technique
+is pioneered by Mitch Alsup and is a key feature of his VVM Extension
+to MyISA 66000.  Careful analysis of the registers within the
+Vertical-First Loop allows a Multi-Issue Out-of-Order Engine to
+*amortise in-flight scalar looped operations into SIMD batches*
+as long as the loop is kept small enough to entirely fit into
+in-flight Reservation Stations in the first place.
+
+*<blockquote>
+(With thanks and gratitude to Mitch Alsup on comp.arch for
+spending considerable time explaining VVM, how its Loop
+Construct explicitly identifies loop-invariant registers,
+and how that helps Register Hazards and SIMD amortisation
+on a GB-OoO Micro-architecture)
+</blockquote>*
+
+Draft Image (placeholder):
+
+<img src="/openpower/sv/zolc_svp64_extrav.svg" width=800 />
+
+The program being executed is a simple loop with a conditional
+test that ignores the multiply if the input is zero.
+
+* In the CPU-only case (top) the data goes through L1/L2
+  Cache before reaching the CPU.
+* However the PE version does not send zero-data to the CPU,
+  and even when it does it goes into a Coherent FIFO: no real
+  compelling need to enter L1/L2 Cache or even the CPU Register
+  File (one of the key reasons why Snitch saves so much power).
+* The PE-only version (see next use-case) the CPU is mostly
+  idle, serving RADIX MMU TLB requests for PEs, and OpenCAPI
+  requests.
+
+**Use-case variant: More powerful in-memory PEs**
+
+An obvious variant of the above is that, if there is inherently
+more parallelism in the data set, then the PEs get their own
+Multiply-and-Accumulate instruction, and rather than send the
+data to the CPU over OpenCAPI, perform the Matrix-Multiply
+directly themselves.
+
+However the source code and binary would be near-identical if
+not identical in every respect, and the PEs implementing the full
+ZOLC capability in order to compact binary size to the bare minimum.
+The main CPU's role would be to coordinate and manage the PEs
+over OpenCAPI.
+
+One key strategic question does remain: do the PEs need to have
+a RADIX MMU and associated TLB-aware minimal L1 Cache, in order
+to support OpenCAPI properly? The answer is very likely to be yes.
+The saving grace here is that with
+the expectation of running only hot-loops with ZOLC-driven
+binaries, the size of each PE's TLB-aware
+L1 Cache needed would be miniscule compared
+to the average high-end CPU.
+
+**Comparison of PE-CPU to GPU-CPU interaction**
+
+The informed reader will have noted the remarkable similarity between how
+a CPU communicates with a GPU to schedule tasks, and the proposed
+architecture.  CPUs schedule tasks with GPUs as follows:
+
+* User-space program encounters an OpenGL function, in the
+  CPU's ISA.
+* Proprietary GPU Driver, still in the CPU's ISA, prepares a
+  Shader Binary written in the GPU's ISA.
+* GPU Driver wishes to transfer both the data and the Shader Binary
+  to the GPU. Both may only do so via Shared Memory, usually
+  DMA over PCIe (assuming a PCIe Graphics Card)
+* GPU Driver which has been running CPU userspace notifies CPU
+  kernelspace of the desire to transfer data and GPU Shader Binary
+  to the GPU. A context-switch occurs...
+
+It is almost unfair to burden the reader with further details.
+The extraordinarily convoluted procedure is as bad as it sounds. Hundreds
+of thousands of tasks per second are scheduled this way, with hundreds
+or megabytes of data per second being exchanged as well.
+
+Yet, the process is not that different from how things would work
+with the proposed microarchitecture: the differences however are key.
+
+* Both PEs and CPU run the exact same ISA.  A major complexity of 3D GPU
+  and CUDA workloads  (JIT compilation etc) is eliminated, and, crucially,
+  the CPU may directly execute the PE's tasks, if needed. This simply
+  is not even remotely possible on GPU Architectures.
+* Where GPU Drivers use PCIe Shared Memory, the proposed architecture
+  deploys OpenCAPI.
+* Where GPUs are a foreign architecture and a foreign ISA, the proposed
+  architecture only narrowly misses being defined as big/LITTLE Symmetric
+  Multi-Processing (SMP) by virtue of the massively-parallel PEs
+  being a bit light on L1 Cache, in favour of large ALUs and proximity
+  to Memory, and require a modest amount of "helper" assistance with
+  their Virtual Memory Management.
+* The proposed architecture has the markup points emdedded into the
+  binary programs
+  where PEs may take over from the CPU, and there is accompanying
+  (planned) hardware-level assistance at the ISA level.  GPUs, which have to
+  work with a wide range of commodity CPUs, cannot in any way expect
+  ARM or Intel to add support for GPU Task Scheduling directly into
+  the ARM or x86 ISAs!
+
+On this last point it is crucial to note that SVP64 began its inspiration
+from a Hybrid CPU-GPU-VPU paradigm (like ICubeCorp's IC3128) and
+consequently has versatility that the separate specialisation of both
+GPU and CPU architectures lack.
+
  **Roadmap summary of Advanced SVP64**
  
  The future direction for SVP64, then, is:
@@ -823,6 +1115,10 @@ The future direction for SVP64, then, is:
    exploiting both the Deterministic nature of ZOLC / SVREMAP
    combined with the Cache-Coherent nature of OpenCAPI,
    to the maximum extent possible.
+* To explore "Remote Management" of PE RADIX MMU, TLB, and
+  Context-Switching (register file transferrance) by proxy,
+  over OpenCAPI, to ensure that the distributed PEs are as
+  close to a Standard SMP model as possible, for programmers.
  * To make the exploitation of this powerful solution as simple
    and straightforward as possible for Software Engineers to use,
    in standard common-usage compilers, gcc and llvm.
@@ -836,8 +1132,88 @@ expand SVP64's capability for Matrices, currently limited to
  around 5x7 to 6x6 Matrices and constrained by the size of
  the register files (128 64-bit entries), to arbitrary (massive) sizes.
  
+**Summary**
+
+There are historical and current efforts that step away from both a
+general-purpose architecture and from the practice of using compiler
+intrinsics in general-purpose compute to make programmer's lives easier.
+A classic example being the Cell Processor (Sony PS3) which required 
+programmers to use DMA to schedule processing tasks. These specialist
+high-performance architectures are only tolerated for
+as long as there is no equivalent performant alternative that is
+easier to program.
+
+Combining SVP64 with ZOLC and OpenCAPI can produce an extremely powerful
+architectural base that fits well with intrinsics embedded into standard
+general-purpose compilers (gcc, llvm) as a pragmatic compromise which makes
+it useful right out the gate. Further R&D may target compiler technology
+that brings it on-par with NVIDIA, Graphcore, AMDGPU, but with intrinsics
+there is no critical product launch dependence on having such
+advanced compilers.
+
  Bottom line is that there is a clear roadmap towards solving a long
  standing problem facing Computer Science and doing so in a way that
  reduces power consumption reduces algorithm completion time and reduces
  the need for complex hardware microarchitectures in favour of much
-smaller distributed coherent Processing Elements.
+smaller distributed coherent Processing Elements with a Heterogenous ISA
+across the board.
+
+# Appendix
+
+**Samsung PIM**
+
+Samsung's
+[Processing-in-Memory](https://semiconductor.samsung.com/emea/newsroom/news/samsung-brings-in-memory-processing-power-to-wider-range-of-applications/)
+seems to be ready to launch as a
+[commercial product](https://semiconductor.samsung.com/insights/technology/pim/)
+that uses HBM as its Memory Standard,
+has "some logic suitable for AI", has parallel processing elements,
+and offers 70% reduction
+in power consumption and a 2x performance increase in speech
+recognition. Details beyond that as to its internal workings
+or programmability are minimal, however given the similarity
+to D-Matrix and Google TPU it is reasonable to place in the
+same category.
+
+* [Samsung PIM IEEE Article](https://spectrum.ieee.org/samsung-ai-memory-chips)
+  explains that there are 9 instructions, mostly FP16 arithmetic,
+  and that it is designed to "complement" AI rather than compete.
+  With only 9 instructions, 2 of which will be LOAD and STORE,
+  conditional code execution seems unlikely.
+  Silicon area in DRAM is increased by 5% for a much greater reduction
+  in power. The article notes, pointedly, that programmability will
+  be a key deciding factor.  The article also notes that Samsung has
+  proposed its architecture as a JEDEC Standard.
+
+**PIM-HBM Research**
+
+[Presentation](https://ieeexplore.ieee.org/document/9073325/) by Seongguk Kim
+and associated [video](https://www.youtube.com/watch?v=e4zU6u0YIRU)
+showing 3D-stacked DRAM connected to GPUs, but notes that even HBM, due to
+large GPU size, is less advantageous than it should be.  Processing-in-Memory
+is therefore logically proposed. the PE (named a Streaming Multiprocessor)
+is much more sophisticated, comprising Register File, L1 Cache, FP32, FP64
+and a Tensor Unit.
+
+<img src="/openpower/sv/2022-05-14_11-55.jpg" width=500 />
+
+**etp4hpc.eu**
+
+[ETP 4 HPC](https://etp4hpc.eu) is a European Joint Initiative for HPC,
+with an eye towards
+[Processing in Memory](https://www.etp4hpc.eu/pujades/files/ETP4HPC_WP_Processing-In-Memory_FINAL.pdf)
+
+**Salient Labs**
+
+[Research paper](https://arxiv.org/abs/2002.00281) explaining
+that they can exceed a 14 ghz clock rate Multiply-and-Accumulate
+using Photonics.
+
+**SparseLNR**
+
+[SparseLNR](https://arxiv.org/abs/2205.11622) restructures sparse
+tensor computations using loop-nest restructuring.
+
+**Additional ZOLC Resources**
+
+* <https://www.researchgate.net/publication/3351728_Zero-overhead_loop_controller_that_implements_multimedia_algorithms>