Clean up remap matrix instruction sections

[libreriscv.git] / openpower / sv / SimpleV_rationale.mdwn
diff --git a/openpower/sv/SimpleV_rationale.mdwn b/openpower/sv/SimpleV_rationale.mdwn

index d18551ea01753cfb0145409193c76cd55f9e68bb..ea1bf2b9e6f0b9652eee9f54e00dec0ecc21525f 100644 (file)
--- a/openpower/sv/SimpleV_rationale.mdwn
+++ b/openpower/sv/SimpleV_rationale.mdwn
@@ -2,11 +2,12 @@
  
  **Revision History**
  
  
  **Revision History**
  
-* v0.00 05may2021 first created
-* v0.01 06may2021 initial first draft
-* v0.02 08may2021 add scenarios / use-cases
-* v0.03 09may2021 add draft image for scenario
-* v0.04 14may2021 add appendix with other research
+* v0.00 05may2022 first created
+* v0.01 06may2022 initial first draft
+* v0.02 08may2022 add scenarios / use-cases
+* v0.03 09may2022 add draft image for scenario
+* v0.04 14may2022 add appendix with other research
+* v0.05 14jun2022 update images (thanks to Veera)
  
  **Table of Contents**
  
  
  **Table of Contents**
  
@@ -20,7 +21,7 @@ Inventing a new Scalar ISA from scratch is over a decade-long task
  including simulators and compilers: OpenRISC 1200 took 12 years to
  mature.  Stable Open ISAs require Standards and Compliance Suites that
  take more. A Vector or Packed SIMD ISA to reach stable *general-purpose*
  including simulators and compilers: OpenRISC 1200 took 12 years to
  mature.  Stable Open ISAs require Standards and Compliance Suites that
  take more. A Vector or Packed SIMD ISA to reach stable *general-purpose*
-auto-vectorisation compiler support has never been achieved in the
+auto-vectorization compiler support has never been achieved in the
  history of computing, not with the combined resources of ARM, Intel,
  AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted
  assembler and direct use of intrinsics is the Industry-standard norm
  history of computing, not with the combined resources of ARM, Intel,
  AMD, MIPS, Sun Microsystems, SGI, Cray, and many more. (*Hand-crafted
  assembler and direct use of intrinsics is the Industry-standard norm
@@ -128,7 +129,7 @@ performance is concerned.
  
  Slowly, at this point, a realisation should be sinking in that, actually,
  there aren't as many really truly viable Vector ISAs out there, as the
  
  Slowly, at this point, a realisation should be sinking in that, actually,
  there aren't as many really truly viable Vector ISAs out there, as the
-ones that are evolving in the general direction of Vectorisation are,
+ones that are evolving in the general direction of Vectorization are,
  in various completely different ways, flawed.
  
  **Successfully identifying a limitation marks the beginning of an
  in various completely different ways, flawed.
  
  **Successfully identifying a limitation marks the beginning of an
@@ -276,7 +277,9 @@ Vector instructions in RISC-V as there are in the RV64GC Scalar base.
  The question then becomes: with all the duplication of arithmetic
  operations just to make the registers scalar or vector, why not
  leverage the *existing* Scalar ISA with some sort of "context"
  The question then becomes: with all the duplication of arithmetic
  operations just to make the registers scalar or vector, why not
  leverage the *existing* Scalar ISA with some sort of "context"
-or prefix that augments its behaviour? Make "Scalar instruction"
+or prefix that augments its behaviour? Separate out the
+"looping" from "thing being looped on" (the elements),
+make "Scalar instruction"
  synonymous with "Vector Element instruction" and through nothing
  more than contextual
  augmentation the Scalar ISA *becomes* the Vector ISA.
  synonymous with "Vector Element instruction" and through nothing
  more than contextual
  augmentation the Scalar ISA *becomes* the Vector ISA.
@@ -285,8 +288,11 @@ the Instruction Decode
  phase is greatly simplified, reducing design complexity and leaving
  plenty of headroom for further expansion.
  
  phase is greatly simplified, reducing design complexity and leaving
  plenty of headroom for further expansion.
  
+[[!img "svp64-primer/img/power_pipelines.svg" ]]
+
  Remarkably this is not a new idea.  Intel's x86 `REP` instruction
  Remarkably this is not a new idea.  Intel's x86 `REP` instruction
-gives the base concept, but in 1994 it was Peter Hsu, the designer
+gives the base concept, and the Z80 had something similar.
+But in 1994 it was Peter Hsu, the designer
  of the MIPS R8000, who first came up with the idea of Vector-augmented
  prefixing of an existing Scalar ISA.  Relying on a multi-issue Out-of-Order Execution Engine,
  the prefix would mark which of the registers were to be treated as
  of the MIPS R8000, who first came up with the idea of Vector-augmented
  prefixing of an existing Scalar ISA.  Relying on a multi-issue Out-of-Order Execution Engine,
  the prefix would mark which of the registers were to be treated as
@@ -297,7 +303,8 @@ jammed multiple scalar operations into the Multi-Issue Execution
  Engine.  The only reason that the team did not take this forward
  into a commercial product
  was because they could not work out how to cleanly do OoO
  Engine.  The only reason that the team did not take this forward
  into a commercial product
  was because they could not work out how to cleanly do OoO
-multi-issue at the time.
+multi-issue at the time (leveraging Multi-Issue is the most logical
+way to exploit the Vector-Prefix concept)
  
  In its simplest form, then, this "prefixing" idea is a matter
  of:
  
  In its simplest form, then, this "prefixing" idea is a matter
  of:
@@ -374,7 +381,7 @@ Remarkably, very little: the devil is in the details though.
    sequential carry-flag chaining of these scalar instructions.
  * The Condition Register Fields of the Power ISA make a great candidate
    for use as Predicate Masks, particularly when combined with
    sequential carry-flag chaining of these scalar instructions.
  * The Condition Register Fields of the Power ISA make a great candidate
    for use as Predicate Masks, particularly when combined with
-  Vectorised `cmp` and Vectorised `crand`, `crxor` etc.
+  Vectorized `cmp` and Vectorized `crand`, `crxor` etc.
  
  It is only when looking slightly deeper into the Power ISA that
  certain things turn out to be missing, and this is down in part to IBM's
  
  It is only when looking slightly deeper into the Power ISA that
  certain things turn out to be missing, and this is down in part to IBM's
@@ -382,9 +389,9 @@ primary focus on the 750 Packed SIMD opcodes at the expense of the 250 or
  so Scalar ones.  Examples include that transfer operations between the
  Integer and Floating-point Scalar register files were dropped approximately
  a decade ago after the Packed SIMD variants were considered to be
  so Scalar ones.  Examples include that transfer operations between the
  Integer and Floating-point Scalar register files were dropped approximately
  a decade ago after the Packed SIMD variants were considered to be
-duplicates.  With it being completely inappropriate to attempt to Vectorise
+duplicates.  With it being completely inappropriate to attempt to Vectorize
  a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
  a Packed SIMD ISA designed 20 years ago with no Predication of any kind,
-the Scalar ISA, a much better all-round candidate for Vectorisation 
+the Scalar ISA, a much better all-round candidate for Vectorization 
  (the Scalar parts of Power ISA) is left anaemic.
  
  A particular key instruction that is missing is `MV.X` which is
  (the Scalar parts of Power ISA) is left anaemic.
  
  A particular key instruction that is missing is `MV.X` which is
@@ -392,7 +399,7 @@ illustrated as `GPR(dest) = GPR(GPR(src))`. This horrendously
  expensive instruction causing a huge swathe of Register Hazards
  in one single hit is almost never added to a Scalar ISA but
  is almost always added to a Vector one. When `MV.X` is
  expensive instruction causing a huge swathe of Register Hazards
  in one single hit is almost never added to a Scalar ISA but
  is almost always added to a Vector one. When `MV.X` is
-Vectorised it allows for arbitrary
+Vectorized it allows for arbitrary
  remapping of elements within a Vector to positions specified
  by another Vector. A typical Scalar ISA will use Memory to
  achieve this task, but with Vector ISAs the Vector Register Files are
  remapping of elements within a Vector to positions specified
  by another Vector. A typical Scalar ISA will use Memory to
  achieve this task, but with Vector ISAs the Vector Register Files are
@@ -846,10 +853,7 @@ There is also no reason why this type of arrangement should not be deployed
  in Multi-Chip-Module (aka "Chiplet") form, giving all the advantages of
  the performance boost that goes with smaller line-drivers.
  
  in Multi-Chip-Module (aka "Chiplet") form, giving all the advantages of
  the performance boost that goes with smaller line-drivers.
  
-
-Draft Image (placeholder):
-
-<img src="/openpower/sv/bridge_phy.jpg" width=800 />
+<img src="/openpower/sv/bridge_phy.svg" width=600 />
  
  # Transparently-Distributed Vector Processing
  
  
  # Transparently-Distributed Vector Processing
  
@@ -868,7 +872,7 @@ not that straightforward: programs
  have to be "massaged" by tools that insert intrinsics into the
  source code, in order to identify the Basic Blocks that the Zero-Overhead
  Loops can run. Can this be merged into standard gcc and llvm
  have to be "massaged" by tools that insert intrinsics into the
  source code, in order to identify the Basic Blocks that the Zero-Overhead
  Loops can run. Can this be merged into standard gcc and llvm
-compilers? As intrinsics: of course. Can it become part of auto-vectorisation? Probably,
+compilers? As intrinsics: of course. Can it become part of auto-vectorization? Probably,
  if an infinite supply of money and engineering time is thrown at it.
  Is a half-way-house solution of compiler intrinsics good enough?
  Intel, ARM, MIPS, Power ISA and RISC-V have all already said "yes" on that,
  if an infinite supply of money and engineering time is thrown at it.
  Is a half-way-house solution of compiler intrinsics good enough?
  Intel, ARM, MIPS, Power ISA and RISC-V have all already said "yes" on that,
@@ -905,8 +909,10 @@ definitely compelling enough to warrant in-depth investigation.
  
  **Use-case: Matrix and Convolutions**
  
  
  **Use-case: Matrix and Convolutions**
  
+<img src="/openpower/sv/sv_horizontal_vs_vertical.svg" />
+
  First, some important definitions, because there are two different
  First, some important definitions, because there are two different
-Vectorisation Modes in SVP64:
+Vectorization Modes in SVP64:
  
  * **Horizontal-First**: (aka standard Cray Vectors) walk
    through **elements** first before moving to next **instruction**
  
  * **Horizontal-First**: (aka standard Cray Vectors) walk
    through **elements** first before moving to next **instruction**
@@ -914,13 +920,30 @@ Vectorisation Modes in SVP64:
    moving to next **element**.  Currently managed by `svstep`,
    ZOLC may be deployed to manage the stepping, in a Deterministic manner.
  
    moving to next **element**.  Currently managed by `svstep`,
    ZOLC may be deployed to manage the stepping, in a Deterministic manner.
  
+Second:
+SVP64 Draft Matrix Multiply is currently set up to arrange a Schedule
+of Multiply-and-Accumulates, suitable for pipelining, that will,
+ultimately, result in a Matrix Multiply. Normal processors are forced
+to perform "loop-unrolling" in order to achieve this same Schedule.
+SIMD processors are further forced into a situation of pre-arranging rotated
+copies of data if the Matrices are not exactly on a power-of-two boundary.
+
+The current limitation of SVP64 however is (when Horizontal-First
+is deployed, at least, which is the least number of instructions)
+that both source and destination Matrices have to be in-registers,
+in full.  Vertical-First may be used to perform a LD/ST within
+the loop, covered by `svstep`, but it is still not ideal.  This
+is where the Snitch and EXTRA-V concepts kick in.
+
+<img src="/openpower/sv/matrix_svremap.svg" />
+
  Imagine a large Matrix scenario, with several values close to zero that
  could be skipped: no need to include zero-multiplications, but a
  traditional CPU in no way can help: only by loading the data through
  the L1-L4 Cache and Virtual Memory Barriers is it possible to
  ascertain, retrospectively, that time and power had just been wasted.
  
  Imagine a large Matrix scenario, with several values close to zero that
  could be skipped: no need to include zero-multiplications, but a
  traditional CPU in no way can help: only by loading the data through
  the L1-L4 Cache and Virtual Memory Barriers is it possible to
  ascertain, retrospectively, that time and power had just been wasted.
  
-SVP64 is able to do what is termed "Vertical-First" Vectorisation,
+SVP64 is able to do what is termed "Vertical-First" Vectorization,
  combined with SVREMAP Matrix Schedules.  Imagine that SVREMAP has been
  extended, Snitch-style, to perform a deterministic memory-array walk of
  a large Matrix.
  combined with SVREMAP Matrix Schedules.  Imagine that SVREMAP has been
  extended, Snitch-style, to perform a deterministic memory-array walk of
  a large Matrix.
@@ -954,7 +977,7 @@ L1/L2/L3 Caches only to find, at the CPU, that it is zero.
  
  The reason in this case for the use of Vertical-First Mode is the
  conditional execution of the Multiply-and-Accumulate.
  
  The reason in this case for the use of Vertical-First Mode is the
  conditional execution of the Multiply-and-Accumulate.
-Horizontal-First Mode is the standard Cray-Style Vectorisation:
+Horizontal-First Mode is the standard Cray-Style Vectorization:
  loop on all *elements* with the same instruction before moving
  on to the next instruction. Horizontal-First
  Predication needs to be pre-calculated
  loop on all *elements* with the same instruction before moving
  on to the next instruction. Horizontal-First
  Predication needs to be pre-calculated
@@ -985,7 +1008,7 @@ on a GB-OoO Micro-architecture)
  
  Draft Image (placeholder):
  
  
  Draft Image (placeholder):
  
-<img src="/openpower/sv/zolc_svp64_extrav.jpg" width=800 />
+<img src="/openpower/sv/zolc_svp64_extrav.svg" width=800 />
  
  The program being executed is a simple loop with a conditional
  test that ignores the multiply if the input is zero.
  
  The program being executed is a simple loop with a conditional
  test that ignores the multiply if the input is zero.
@@ -1132,7 +1155,8 @@ Bottom line is that there is a clear roadmap towards solving a long
  standing problem facing Computer Science and doing so in a way that
  reduces power consumption reduces algorithm completion time and reduces
  the need for complex hardware microarchitectures in favour of much
  standing problem facing Computer Science and doing so in a way that
  reduces power consumption reduces algorithm completion time and reduces
  the need for complex hardware microarchitectures in favour of much
-smaller distributed coherent Processing Elements.
+smaller distributed coherent Processing Elements with a Heterogenous ISA
+across the board.
  
  # Appendix
  
  
  # Appendix