(no commit message)

author lkcl <lkcl@web>

Fri, 6 May 2022 13:29:04 +0000 (14:29 +0100)

committer IkiWiki <ikiwiki.info>

Fri, 6 May 2022 13:29:04 +0000 (14:29 +0100)
author lkcl <lkcl@web>
Fri, 6 May 2022 13:29:04 +0000 (14:29 +0100)
committer IkiWiki <ikiwiki.info>
Fri, 6 May 2022 13:29:04 +0000 (14:29 +0100)
diff --git a/openpower/sv/SimpleV_rationale.mdwn b/openpower/sv/SimpleV_rationale.mdwn

index 8732847a41fcce7139bc405f9ce16b360186efba..b94578c40113a2ba4876f98d2166667848a11fc3 100644 (file)
--- a/openpower/sv/SimpleV_rationale.mdwn
+++ b/openpower/sv/SimpleV_rationale.mdwn
@@ -445,7 +445,7 @@ and Motorola 88100 era: 133 and 166 mhz SDRAM was available, and
  CPUs were about the same rate.  DRAM bitcells *simply cannot exceed
  these rates*, yet the pressure from Software Engineers is to
  make *sequential* algorithm processing faster and faster because
-parallelising of algorithms is simply too difficult to master and always
+parallelising of algorithms is simply too difficult to master, and always
  has been.  Thus whilst DRAM has to go parallel (like RAID Striping) to
  keep up, CPUs are now at 8-way Multi-Issue 5 ghz clock rates and
  are at an astonishing four levels of cache (L1 to L4).
@@ -456,9 +456,11 @@ on the *opposite* side of the main CPU's L1/2/3/4 Caches.  However
  the alarm bells ring here at the keyword "distributed", because by
  moving the processing down next to the Memory, the speed of any
  of the parallel Processing Elements (PEs) has dropped
-by almost two orders of magnitude,
-the simplicity has for pure pragmatic reasons to drop by several
-orders of magnitude. Things that the average "sequential algorithm"
+by almost two orders of magnitude (5 ghz down to 100 mhz),
+the simplicity of each PE has, for pure pragmatic reasons,
+to drop by several
+orders of magnitude as well.
+Things that the average "sequential algorithm"
  programmer
  takes for granted such as SMP, Cache Coherency, Virtual Memory,
  spinlocks (atomic locking), all of these are either outright gone
@@ -467,11 +469,13 @@ or expected that the programmer shall explicitly contend with
  
  To give an extreme example: Aspex's Array-String Processor, which
  was 4096 2-bit SIMD PEs each with 256 bytes of Content Addressable
-Memory was capable of literally a hundred-fold improvement in
+Memory, was capable of literally a hundred-fold improvement in
  performance over Scalar CPUs such as the Pentium III of its era,
-all on a 3.5 watt budget at only 250 mhz in 130 nm.  Yet to take
+all on a 3 watt budget at only 250 mhz in 130 nm.  Yet to take
  proper advantage of its capability required an astounding 5-10
-*days* per line of assembly code. 20 lines of optimised
+*days* per line of assembly code because multiple versions of
+an algorithm had to be hand-crafted then compared, and
+the best one selected, all others discarded. 20 lines of optimised
  Assembler taking six months to write can in no way be termed
  "productive", yet this extreme level of unproductivity is an inherent
  side-effect of going down the parallel-processing rabbithole.
@@ -496,8 +500,9 @@ The simplest longest commercially successful deployment of Zero-overhead looping
  has been in Texas Instruments TMS320 DSPs. Up to fourteen sub-instructions
  within the VLIW word may be repeatedly deployed on successive clock
  cycles until a countdown reaches zero. This extraordinarily simple
-concept needs no branches, no complex Register Hazard
-Management because it is down to the programmer (or, the compiler),
+concept needs no branches, and has no complex Register Hazard
+Management in the hardware
+because it is down to the programmer (or, the compiler),
  to ensure data overlaps do not occur.
  
  The key aspect of these
@@ -526,10 +531,12 @@ The kicker: when implementing SVP64's Matrix REMAP Schedule, the VLSI
  design of its triple-nested for-loop system
  turned out to be remarkably similar to the
  core nested for-loop engine of ZOLC. In hindsight this should not
-have come as a surprise, because both are basically nested for-loops.
+have come as a surprise, because both are basically nested for-loops
+that do not need branches to issue instructions.
  
  The important insight is, however, that if ZOLC can be general-purpose
-and apply deterministic nested loop schedules to more than just registers
+and apply deterministic nested looped instruction
+schedules to more than just registers
  (unlike SVP64 in its current incarnation) then so can SVP64.
  
  **OpenCAPI and Extra-V**
@@ -543,7 +550,10 @@ Extra-V appears to be a remarkable research project that, by leveraging
  OpenCAPI, assuming that the map of edges in any given arbitrary data graph
  could be kept by the main CPU in-memory, could distribute and delegate
  a limited-capability deterministic but most importantly *data-dependent*
-node-walking schedule actually right down into the memory itself (on the other side of that L1-4 cache barrier).
+node-walking schedule actually right down into the memory itself (on the other side of that L1-4 cache barrier).  A miniature processor analysed
+the data it had read (at the Memory), and determine if it should
+notify the main processor that this "Node" is worth investigating,
+or if the Graph node-walk should split in a different direction.
  Thanks to the OpenCAPI Standard, which takes care of Virtual Memory
  abstraction, locking, and cache-coherency, many of the nightmare problems
  of other more explicit parallel processing paradigms disappear.
author	lkcl <lkcl@web>
	Fri, 6 May 2022 13:29:04 +0000 (14:29 +0100)
committer	IkiWiki <ikiwiki.info>
	Fri, 6 May 2022 13:29:04 +0000 (14:29 +0100)