(no commit message)

[libreriscv.git] / openpower / sv / SimpleV_rationale.mdwn
diff --git a/openpower/sv/SimpleV_rationale.mdwn b/openpower/sv/SimpleV_rationale.mdwn

index f27479779180ec88f4b79c67f9892f7f3623b583..9d3f89a1a631c66c38e8af43104cc254d87ede08 100644 (file)
--- a/openpower/sv/SimpleV_rationale.mdwn
+++ b/openpower/sv/SimpleV_rationale.mdwn
@@ -11,7 +11,7 @@
  
  # Why in the 2020s would you invent a new Vector ISA
  
-*The short answer: you don't. Extend existing technology: on the shoulders of giants*
+*(The short answer: you don't. Extend existing technology: on the shoulders of giants)*
  
  Inventing a new Scalar ISA from scratch is over a decade-long task
  including simulators and compilers: OpenRISC 1200 took 12 years to
@@ -21,7 +21,7 @@ history of computing, not with the combined resources of ARM, Intel,
  AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted
  assembler and direct use of intrinsics is the Industry-standard norm
  to achieve high-performance optimisation where it matters*).
-GPUs full this void both in hardware and software terms by having
+GPUs fill this void both in hardware and software terms by having
  ultra-specialist compilers (CUDA) that are designed from the ground up
  to support Vector/SIMD parallelism, and associated standards
  (SPIR-V, Vulkan, OpenCL) managed by
@@ -74,7 +74,7 @@ which illustrates a catastrophic rabbit-hole taken by Industry Giants
  ARM, Intel, AMD, since the 90s (over 3 decades) whereby SIMD, an
  Order(N^6) opcode proliferation nightmare, with its mantra "make it
  easy for hardware engineers, let software sort out the mess" literally
-overwhelming programmers.  Specialists charging
+overwhelming programmers with thousands of instructions. Specialists charging
  clients for assembly-code Optimisation Services are finding that AVX-512,
  to take an
  example, is anything but optimal: overall performance of AVX-512 actually
@@ -167,6 +167,9 @@ candidates for further advancement are:
  * RISC-V, touted as "Open" but actually strictly controlled under
    Trademark License: too new to have adequate patent pool protection,
    as evidenced by multiple adopters having been hit by patent lawsuits.
+  (Agreements between RISC-V *Members* to not engage in patent litigation
+   does nothing to stop third party patents that *legitimately pre-date*
+   the newly-created RISC-V ISA)
  * MIPS, SPARC, ARC, and others, simply have no viable ecosystem.
  * Power ISA: protected by IBM's extensive patent portfolio for Members
    of the OpenPOWER Foundation, covered by Trademarks, permitting
@@ -186,7 +189,7 @@ candidates for further advancement are:
    to use the x86 ISA need only look at Transmeta, SiS, the Vortex x86,
    and VIA EDEN processors, and see how they fared.
  * s390, IBM's mainframe ISA. Nowhere near as well-known as x86 lawsuits,
-  but the 800lb Gorilla Syndrome seems not to have deterred one
+  but the 800lb "Corporate Gorilla Syndrome" seems not to have deterred one
    particularly disingenuous group from performing illegal
    Reverse-Engineering.
  
@@ -236,7 +239,7 @@ instruction.
  
  *Packed SIMD looped algorithms actually have to
  contain multiple implementations processing fragments of data at
-different SIMD widths: Cray-style Vectors have one, covering not
+different SIMD widths: Cray-style Vectors have just the one, covering not
  just current architectural implementations but future ones with
  wider back-end ALUs as well.*
  
@@ -316,10 +319,20 @@ of the problem-space:
  All of these things come entirely from "Augmentation" of the Scalar operation
  being prefixed: at no time is the Scalar operation significantly
  altered.
-From there, several more "Modes" can be added, including saturation,
-which is needed for Audio and Video applications, "Reverse Gear"
+From there, several more "Modes" can be added, including
+
+* saturation,
+which is needed for Audio and Video applications
+*  "Reverse Gear"
  which runs the Element Loop in reverse order (needed for Prefix
-Sum), and more.
+Sum)
+* Data-dependent Fail-First, which emerged from asking the simple
+  question, "If modern Vector ISAs have Load/Store Fail-First,
+  and the Power ISA has Condition Codes, why not make Conditional
+  early-exit from Arithmetic operation looping?"
+* over 500 Branch-Conditional Modes emerge from application of
+  Boolean Logic in a Vector context, on top of an already-powerful
+  Scalar Branch-Conditional/Counter instruction
  
  **What is missing from Power Scalar ISA that a Vector ISA needs?**
  
@@ -335,7 +348,7 @@ Remarkably, very little: the devil is in the details though.
    emergent characteristic from the carry-in, carry-out capability of
    Power ISA `adde` instruction. `sv.adde` as a BigNum add
    naturally emerges from the
-  sequential chaining of these scalar instructions.
+  sequential carry-flag chaining of these scalar instructions.
  * The Condition Register Fields of the Power ISA make a great candidate
    for use as Predicate Masks, particularly when combined with
    Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
@@ -369,9 +382,12 @@ to be corrected.
  
  One deliberate decision in SVP64 involves Predication. Typical Vector
  ISAs have quite comprehensive arithmetic and logical operations on
-Predicate Masks, and if CR Fields were the only predicates in SVP64
+Predicate Masks, and it turns out, unsurprisingly, that the Scalar Integer
+side of Power ISA already has most of them.
+If CR Fields were the only predicates in SVP64
  it would put pressure on to start adding the exact same arithmetic and logical
-operations that already exist in the Integer opcodes.
+operations that already exist in the Integer opcodes, which is less
+than desirable.
  Instead of taking that route the decision was made to allow *both*
  Integer *and* CR Fields to be Predicate Masks, and to create Draft
  instructions that provide better transfer capability between CR Fields
@@ -397,7 +413,8 @@ ISA, the amount of data processing requested
  and controlled by each instruction is enormous, and leaves the
  Decode and Issue Engines idle, as well as the L1 I-Cache. With
  programs being smaller, chances are higher that they fit into
-L1 Cache, or that the L1 Cache may be made smaller.
+L1 Cache, or that the L1 Cache may be made smaller: either way
+is a considerable O(N^2) power-saving.
  
  Even a Packed SIMD ISA could take limited advantage of a higher
  bang-per-buck for limited specific workloads, as long as the
@@ -419,10 +436,12 @@ sizes: SVP64 does not, as hinted at below.
  
  Additional savings come in the form of `SVREMAP`. This is a hardware
  index transformation system where the normally sequentially-linear
-element access may be "Re-Mapped" to limited but algorithmic-tailored
+Vector element access may be "Re-Mapped" to limited but algorithmic-tailored
  commonly-used deterministic schedules, for example Matrix Multiply,
  DCT, or FFT.  A full in-register-file 5x7 Matrix Multiply or a 3x4 or
-2x6 may be performed in as little as 4 instructions, one of which
+2x6 with optional *in-place* transpose, mirroring or rotation
+on any source or destination Matrix
+may be performed in as little as 4 instructions, one of which
  is to zero-initialise the accumulator Vector used to store the result.
  If addition to another Matrix is also required then it is only three
  instructions. Not only that, but because the "Schedule" is an abstract
@@ -452,7 +471,7 @@ After more than three years of development the realisation hit that
  the SVP64 concept could be expanded to Coherent Distributed Memory,
  This astoundingly powerful concept is explored in the next section.
  
-# Coherent Deterministic Hybrid Distributed Memory-Processing
+# Coherent Deterministic Hybrid Distributed In-Memory Processing
  
  It is not often that a heading in an article can legitimately
  contain quite so many comically-chained buzzwords, but in this section
@@ -469,7 +488,9 @@ are at an astonishing four levels of cache (L1 to L4).
  
  It should therefore come as no surprise that attempts are being made
  to move (distribute) processing closer to the DRAM Memory, firmly
-on the *opposite* side of the main CPU's L1/2/3/4 Caches.  However
+on the *opposite* side of the main CPU's L1/2/3/4 Caches,
+where a simple `LOAD-COMPUTE-STORE-LOOP` workload easily illustrates
+why this approach is compelling.  However
  the alarm bells ring here at the keyword "distributed", because by
  moving the processing down next to the Memory, even onto
  the same die as the DRAM, the speed of any
@@ -516,6 +537,11 @@ can be proposed.  These are:
  
  **ZOLC: Zero-Overhead Loop Control**
  
+Zero-Overhead Looping is the concept of automatically running a set sequence
+of instructions for a predetermined number of times, without requiring
+a branch. This is slightly different from using Power ISA `bc` in `CTR`
+(Counter) Mode, because in ZOLC the branch-back is automatic.
+
  The simplest longest commercially successful deployment of Zero-overhead looping
  has been in Texas Instruments TMS320 DSPs. Up to fourteen sub-instructions
  within the VLIW word may be repeatedly deployed on successive clock
@@ -550,7 +576,7 @@ came out with an astonishing 43% improvement in completion time. 43%
  less instructions executed is an almost unheard-of level of optimisation:
  most ISA designers are elated if they can achieve 5 to 10%. The reduction
  was so compelling that ST Microelectronics put it into commercial
-production in one of their embedded CPUs.
+production in one of their embedded CPUs, the ST120 DSP-MCU.
  
  The kicker: when implementing SVP64's Matrix REMAP Schedule, the VLSI
  design of its triple-nested for-loop system
@@ -568,11 +594,12 @@ schedules to more than just registers
  
  OpenCAPI is a deterministic high-performance, high-bandwidth, low-latency
  cache-coherent Memory-access Protocol that is integrated into IBM's Supercomputing-class POWER9 and POWER10 processors. POWER10 *only*
-has OpenCAPI Memory interfaces, and requires an OpenCAPI-to-DDR4/5 Bridge PHY
+has OpenCAPI Memory interfaces, and requires an OMI-to-DDR4/5 Bridge PHY
  to connect to standard DIMMs.
  
-Extra-V appears to be a remarkable research project that, by leveraging
-OpenCAPI, assuming that the map of edges in any given arbitrary data graph
+Extra-V appears to be a remarkable research project based on OpenCAPI that,
+by assuming that the map of edges (excluding the actual data)
+in any given arbitrary data graph
  could be kept by the main CPU in-memory, could distribute and delegate
  a limited-capability deterministic but most importantly *data-dependent*
  node-walking schedule actually right down into the memory itself (on the other side of that L1-4 cache barrier).  A miniature processor 
@@ -723,10 +750,9 @@ much easier for the main CPU to take over in the event that PEs are
  currently occupied.  Plus, the twin lessons that inventing ISAs, even
  a small one, is hard (mostly in compiler writing) and how complex
  GPU Task Scheduling is, are being heard loud and clear.
-Put another way:
  
-* if the PEs run a foriegn ISA, then the Basic Blocks embedded inside
-  the ZOLC Loops must be in that ISA **OR**
+Put another way: if the PEs run a foriegn ISA, then the Basic Blocks embedded inside the ZOLC Loops must be in that ISA and therefore:
+
  * In order that the main CPU can execute the same sequence if necessary,
    the CPU must support dual ISAs: Power and PE **OR**
  * There must be a JIT binary-translator which either turns PE code
@@ -751,12 +777,12 @@ and is an Open Standard, as discussed in a earlier sections.
  It is very strange to the author to be describing what amounts to a
  "Holy Grail" solution to a decades-long intractable problem that
  mitigates the anticipated end of Moore's Law: how to make it easy for
-well-defined workloads, expressed as a perfecly normal
+well-defined workloads, expressed as a perfectly normal
  sequential program, compiled to a standard well-known ISA, to have
  the potential of being offloaded transparently to Parallel Compute Engines,
  all without the Software Developer being excessively burdened with
-a Parallel-Processing Paradigm that is alien to both their experience
-and training, as well as common knowledge.
+a Parallel-Processing Paradigm that is alien to all their experience
+and training, as well as Industry-wide common knowledge.
  
  Will it be that easy? ZOLC is, honestly, in its current incarnation,
  not that straightforward: programs