(no commit message)

author lkcl <lkcl@web>

Sun, 8 May 2022 13:57:09 +0000 (14:57 +0100)

committer IkiWiki <ikiwiki.info>

Sun, 8 May 2022 13:57:09 +0000 (14:57 +0100)
author lkcl <lkcl@web>
Sun, 8 May 2022 13:57:09 +0000 (14:57 +0100)
committer IkiWiki <ikiwiki.info>
Sun, 8 May 2022 13:57:09 +0000 (14:57 +0100)
diff --git a/openpower/sv/SimpleV_rationale.mdwn b/openpower/sv/SimpleV_rationale.mdwn

index ae8ab2398b45e9126b59a8314a94f6889c369dc9..f0b6b86646988a62f9c10a872ff9cec35520d6c0 100644 (file)
--- a/openpower/sv/SimpleV_rationale.mdwn
+++ b/openpower/sv/SimpleV_rationale.mdwn
@@ -237,11 +237,13 @@ of magnitude increase in the number of hand-written lines of assembler
  compared to a well-designed Cray-style Vector ISA with a `setvl`
  instruction.
  
+<blockquote>
  *Packed SIMD looped algorithms actually have to
  contain multiple implementations processing fragments of data at
  different SIMD widths: Cray-style Vectors have just the one, covering not
  just current architectural implementations but future ones with
  wider back-end ALUs as well.*
+</blockquote>
  
  Assuming then that variable-length Vectors are obviously desirable,
  it becomes a matter of how, not if.  Both Cray and NEC SX Aurora
@@ -452,7 +454,9 @@ on any source or destination Matrix
  may be performed in as little as 4 instructions, one of which
  is to zero-initialise the accumulator Vector used to store the result.
  If addition to another Matrix is also required then it is only three
-instructions. Not only that, but because the "Schedule" is an abstract
+instructions.
+
+Not only that, but because the "Schedule" is an abstract
  concept separated from the mathematical operation, there is no reason
  why Matrix Multiplication Schedules may not be applied to Integer
  Mul-and-Accumulate, Galois Field Mul-and-Accumulate, Logical
@@ -467,16 +471,16 @@ and Qualcom's Hexagon, and both are targetted at FFTs only.
  There is no reason at all why future algorithmic schedules should not
  be proposed as extensions to SVP64 (sorting algorithms,
  compression algorithms, Sparse Data Sets, Graph Node walking
-for example). Bear in mind that
+for example). (*Bear in mind that
  the submission process will be
  entirely at the discretion of the OpenPOWER Foundation ISA WG,
-something that is both encouraged and welcomed by the OPF.
+something that is both encouraged and welcomed by the OPF.*)
  
  One of SVP64's current limitations is that it was initially designed
  for 3D and Video workloads as a hybrid GPU-VPU-CPU. This resulted in
  a heavy focus on adding hardware-for-loops onto the *Registers*.
  After more than three years of development the realisation hit that
-the SVP64 concept could be expanded to Coherent Distributed Memory,
+the SVP64 concept could be expanded to Coherent Distributed Memory.
  This astoundingly powerful concept is explored in the next section.
  
  # Coherent Deterministic Hybrid Distributed In-Memory Processing
@@ -527,7 +531,9 @@ an algorithm had to be hand-crafted then compared, and only
  the best one selected: all others discarded. 20 lines of optimised
  Assembler taking three to six months to write can in no way be termed
  "productive", yet this extreme level of unproductivity is an inherent
-side-effect of going down the parallel-processing rabbithole.
+side-effect of going down the parallel-processing rabbithole where
+the cost of providing "Traditional" programmabilility (Virtual Memory,
+SMP) is worse than counter-productive, it's often outright impossible.
  
  **In short, we are in "Programmer's nightmare" territory**
  
@@ -546,9 +552,10 @@ can be proposed.  These are:
  **ZOLC: Zero-Overhead Loop Control**
  
  Zero-Overhead Looping is the concept of automatically running a set sequence
-of instructions for a predetermined number of times, without requiring
-a branch. This is slightly different from using Power ISA `bc` in `CTR`
-(Counter) Mode, because in ZOLC the branch-back is automatic.
+of instructions a predetermined number of times, without requiring
+a branch. This is conceptually similar but
+slightly different from using Power ISA `bc` in `CTR`
+(Counter) Mode to create loops, because in ZOLC the branch-back is automatic.
  
  The simplest longest commercially successful deployment of Zero-overhead looping
  has been in Texas Instruments TMS320 DSPs. Up to fourteen sub-instructions
@@ -561,11 +568,14 @@ to ensure data overlaps do not occur.  Careful crafting of those
  14 instructions can keep the ALUs 100% occupied for sustained periods,
  and the iconic example for which the TI DSPs are renowned
  is that an entire inner loop for large FFTs
-can be done with that one VLIW word: no stalls, no stopping, no fuss.
+can be done with that one VLIW word: no stalls, no stopping, no fuss,
+an entire 1024 or 4096 wide FFT Layer in one instruction.
  
+<blockquote>
  The key aspect of these
  very simplistic countdown loops as far as we are concerned:
  is: *they are deterministic*.
+</blockquote>
  
  Zero-Overhead Loop Control takes this basic "single loop" concept
  way further: both nested loops and conditional exit are included,
@@ -601,9 +611,15 @@ schedules to more than just registers
  **OpenCAPI and Extra-V**
  
  OpenCAPI is a deterministic high-performance, high-bandwidth, low-latency
-cache-coherent Memory-access Protocol that is integrated into IBM's Supercomputing-class POWER9 and POWER10 processors. POWER10 *only*
-has OpenCAPI Memory interfaces, and requires an OMI-to-DDR4/5 Bridge PHY
-to connect to standard DIMMs.
+cache-coherent Memory-access Protocol that is integrated into IBM's Supercomputing-class POWER9 and POWER10 processors.
+
+<blockquote>(Side note:
+POWER10 *only*
+has OpenCAPI Memory interfaces: an astounding number of them,
+with overall bandwidth so high it's actually difficult to conceptualise.
+An OMI-to-DDR4/5 Bridge PHY is therefore required
+to connect to standard Memory DIMMs.)
+</blockquote>
  
  Extra-V appears to be a remarkable research project based on OpenCAPI that,
  by assuming that the map of edges (excluding the actual data)
@@ -652,22 +668,26 @@ the efficiency and effectiveness
  of these Load-Store-with-Increment instructions has been
  forgotten until Snitch.
  
-What the designers did however was not to add new Load-Store
-or Arithmetic instructions to RISC-V, but instead to "mark"
-registers with a tag.  These tags tell the CPU: when you are asked to
+What the designers did however was not to add any new Load-Store
+or Arithmetic instructions to the underlying RISC-V at all, but instead to "mark"
+registers with a tag which *augmented* (altered) the behaviour
+of *existing* instructions.  These tags tell the CPU: when you are asked to
  carry out
  an add instruction on r6 and r7, do not take r6 or r7 from the register
  file, instead please perform a Cache-coherent Load-with-Increment
-on each, using special Address Registers for each.  Each new use
+on each, using special (hidden, implicit)
+Address Registers for each.  Each new use
  of r6 therefore brings in an entirely new value *directly from
  memory*. Likewise on the second operand, r7, and likewise on
  the destination result which can be an automatic Coherent
  Store-and-increment
-directly into Memory. In essence:
+directly into Memory. 
  
+<blockquote>
  *The act of "reading" or "writing" a register has been decoupled
  and intercepted, then connected transparently to a completely
  separate Coherent Memory Subsystem*
+</blockquote>
  
  On top of a barrel-architecture the slowness of Memory access
  was not a problem because the Deterministic nature of classic
author	lkcl <lkcl@web>
	Sun, 8 May 2022 13:57:09 +0000 (14:57 +0100)
committer	IkiWiki <ikiwiki.info>
	Sun, 8 May 2022 13:57:09 +0000 (14:57 +0100)