(no commit message)

[libreriscv.git] / openpower / sv / SimpleV_rationale.mdwn
diff --git a/openpower/sv/SimpleV_rationale.mdwn b/openpower/sv/SimpleV_rationale.mdwn

index 4eee357e4b92e1726738d3522bdce947bdd92721..b94578c40113a2ba4876f98d2166667848a11fc3 100644 (file)
--- a/openpower/sv/SimpleV_rationale.mdwn
+++ b/openpower/sv/SimpleV_rationale.mdwn
@@ -438,16 +438,143 @@ This astoundingly powerful concept is explored in the next section.
  # Coherent Deterministic Hybrid Distributed Memory-Processing
  
  It is not often that a heading in an article can legitimately
-contain quite so many comically-chain buzzwords, but in this section
-it is justified.  As hinted at in the first section, the last time
+contain quite so many comically-chained buzzwords, but in this section
+they are justified.  As hinted at in the first section, the last time
  that memory was the same speed as processors was the Pentium III
  and Motorola 88100 era: 133 and 166 mhz SDRAM was available, and
  CPUs were about the same rate.  DRAM bitcells *simply cannot exceed
  these rates*, yet the pressure from Software Engineers is to
  make *sequential* algorithm processing faster and faster because
-parallelising of algorithms is simply too difficult to master and always
+parallelising of algorithms is simply too difficult to master, and always
  has been.  Thus whilst DRAM has to go parallel (like RAID Striping) to
  keep up, CPUs are now at 8-way Multi-Issue 5 ghz clock rates and
-are at an astonishing four levels of cache (L1 to L4).  The amount
-of wiring inside such CPUs is now measured in miles.
+are at an astonishing four levels of cache (L1 to L4).
+
+It should therefore come as no surprise that attempts are being made
+to move (distribute) processing closer to the DRAM Memory, firmly
+on the *opposite* side of the main CPU's L1/2/3/4 Caches.  However
+the alarm bells ring here at the keyword "distributed", because by
+moving the processing down next to the Memory, the speed of any
+of the parallel Processing Elements (PEs) has dropped
+by almost two orders of magnitude (5 ghz down to 100 mhz),
+the simplicity of each PE has, for pure pragmatic reasons,
+to drop by several
+orders of magnitude as well.
+Things that the average "sequential algorithm"
+programmer
+takes for granted such as SMP, Cache Coherency, Virtual Memory,
+spinlocks (atomic locking), all of these are either outright gone
+or expected that the programmer shall explicitly contend with
+(even if that programmer is the Compiler Developer).
+
+To give an extreme example: Aspex's Array-String Processor, which
+was 4096 2-bit SIMD PEs each with 256 bytes of Content Addressable
+Memory, was capable of literally a hundred-fold improvement in
+performance over Scalar CPUs such as the Pentium III of its era,
+all on a 3 watt budget at only 250 mhz in 130 nm.  Yet to take
+proper advantage of its capability required an astounding 5-10
+*days* per line of assembly code because multiple versions of
+an algorithm had to be hand-crafted then compared, and
+the best one selected, all others discarded. 20 lines of optimised
+Assembler taking six months to write can in no way be termed
+"productive", yet this extreme level of unproductivity is an inherent
+side-effect of going down the parallel-processing rabbithole.
+
+**In short, we are in "Programmer's nightmare" territory**
+
+Having dug a proverbial hole that rivals the Grand Canyon, and
+jumped in it feet-first, the next
+task is to piece together a strategy to climb back out and show
+how falling back in can be avoided. This takes some explaining,
+and first requires some background on various research efforts and
+commercial designs.  Once the context is clear, their synthesis
+can be proposed.  These are:
+
+* [ZOLC: Zero-Overhead Loop Control](https://ieeexplore.ieee.org/abstract/document/1692906/)
+* [OpenCAPI and Extra-V](https://dl.acm.org/doi/abs/10.14778/3137765.3137776)
+* [Snitch](https://arxiv.org/abs/2002.10143)
+
+**ZOLC: Zero-Overhead Loop Control**
+
+The simplest longest commercially successful deployment of Zero-overhead looping
+has been in Texas Instruments TMS320 DSPs. Up to fourteen sub-instructions
+within the VLIW word may be repeatedly deployed on successive clock
+cycles until a countdown reaches zero. This extraordinarily simple
+concept needs no branches, and has no complex Register Hazard
+Management in the hardware
+because it is down to the programmer (or, the compiler),
+to ensure data overlaps do not occur.
+
+The key aspect of these
+very simplistic countdown loops is: *they are deterministic*.
+
+Zero-Overhead Loop Control takes this basic "single loop" concept
+way further: both nested loops and conditional exit are included,
+but also arbitrary control-jumping from the current inner loop
+out to an entirely different loop, all based on conditions determined
+dynamically at runtime.
+
+Even when deployed on as basic a CPU as a single-issue in-order RISC
+core, the performance and power-savings were astonishing: between 20
+and **80%** reduction in algorithm completion times were achieved compared
+to a more traditional branch-speculative in-order RISC CPU.  MPEG
+Decode, the target algorithm specifically picked by the researcher
+due to its high complexity with 6-deep nested loops and conditional
+execution that frequently jumped in and out of at least 2 loops,
+came out with an astonishing 43% improvement in completion time. 43%
+less instructions executed is an almost unheard-of level of optimisation:
+most ISA designers are elated if they can achieve 5 to 10%. The reduction
+was so compelling that ST Microelectronics put it into commercial
+production in one of their embedded CPUs.
+
+The kicker: when implementing SVP64's Matrix REMAP Schedule, the VLSI
+design of its triple-nested for-loop system
+turned out to be remarkably similar to the
+core nested for-loop engine of ZOLC. In hindsight this should not
+have come as a surprise, because both are basically nested for-loops
+that do not need branches to issue instructions.
+
+The important insight is, however, that if ZOLC can be general-purpose
+and apply deterministic nested looped instruction
+schedules to more than just registers
+(unlike SVP64 in its current incarnation) then so can SVP64.
+
+**OpenCAPI and Extra-V**
+
+OpenCAPI is a deterministic high-performance, high-bandwidth, low-latency
+cache-coherent Memory-access Protocol that is integrated into IBM's Supercomputing-class POWER9 and POWER10 processors. POWER10 *only*
+has OpenCAPI Memory interfaces, and requires an OpenCAPI-to-DDR4/5 Bridge PHY
+to connect to standard DIMMs.
+
+Extra-V appears to be a remarkable research project that, by leveraging
+OpenCAPI, assuming that the map of edges in any given arbitrary data graph
+could be kept by the main CPU in-memory, could distribute and delegate
+a limited-capability deterministic but most importantly *data-dependent*
+node-walking schedule actually right down into the memory itself (on the other side of that L1-4 cache barrier).  A miniature processor analysed
+the data it had read (at the Memory), and determine if it should
+notify the main processor that this "Node" is worth investigating,
+or if the Graph node-walk should split in a different direction.
+Thanks to the OpenCAPI Standard, which takes care of Virtual Memory
+abstraction, locking, and cache-coherency, many of the nightmare problems
+of other more explicit parallel processing paradigms disappear.
+
+The similarity to ZOLC should not have gone unnoticed: where ZOLC
+has nested conditional for-loops Extra-V appears to have just the
+one conditional for-loop, but the key strategically-crucial
+part of this multi-faceted puzzle is that due to the deterministic and
+coherent nature of Extra-V, the processing of the loops, which
+requires a tiny processor, is not
+done close to the CPU at all: it is
+*embedded right next to the memory*.
+
+The similarity to the D-Matrix Systolic Array Processing, Aspex Microelectronics
+Array-String Processing, and Elixent 2D Array Processing, should
+also not have gone unnoticed.  All of these solutions utilised
+or utilise
+a more comprehensive Turing-complete von-Neumann "Management Core"
+to coordinate data passed in and out of PEs: none of them had or
+had something
+as powerful as OpenCAPI as part of that picture.
+
+**Snitch**