one conditional for-loop, but the key strategically-crucial
part of this multi-faceted puzzle is that due to the deterministic and
coherent nature of Extra-V, the processing of the loops, which
-requires a tiny processor, is not
-done close to the CPU at all: it is
+requires a tiny non-Turing-Complete processor, is not
+done close to or by the main CPU at all: it is
*embedded right next to the memory*.
The similarity to the D-Matrix Systolic Array Processing, Aspex Microelectronics
also not have gone unnoticed. All of these solutions utilised
or utilise
a more comprehensive Turing-complete von-Neumann "Management Core"
-to coordinate data passed in and out of PEs: none of them had or
+to coordinate data passed in and out of PEs: none of them have or
had something
as powerful as OpenCAPI as part of that picture.
Snitch is an elegant Memory-Coherent Barrel-Processor where registers
become "tagged" with a Memory-access Mode that went out of fashion
-over forty years ago: Load-and-Increment. Expressed in c as
-`src = *x++`, and requiring special Address Registers (PDP-11, 68000)
-the efficiency of these Load-Store-with-Increment instructions has been
+over forty years ago: Load-then-Auto-Increment. Expressed in c as
+`src = *x++`, and requiring special Address Registers (PDP-11, 68000),
+thanks to the RISC paradigm having gone too far,
+the efficiency and effectiveness
+of these Load-Store-with-Increment instructions has been
forgotten until Snitch.
What the designers did however was not to add new Load-Store
or Arithmetic instructions to RISC-V, but instead to "mark"
-registers with a tag. These tags tell the CPU: when you perform
-an add on r6 and r7, please perform a Cache-coherent Load-with-Increment
-on each, using special Address Registers for each. Each reference
-to r6 therefore brings in an entirely new value *directly from
+registers with a tag. These tags tell the CPU: when you are asked to
+carry out
+an add instruction on r6 and r7, do not take r6 or r7 from the reguster
+file, instead please perform a Cache-coherent Load-with-Increment
+on each, using special Address Registers for each. Each new use
+of r6 therefore brings in an entirely new value *directly from
memory*. Likewise on the second operand, r7, and likewise on
-the destination which can be automatic Store-and-increment.
+the destination result which can be an automatic Coherent
+Store-and-increment
+directly into Memory.
On top of a barrel-architecture the slowness of Memory access
was not a problem because the Deterministic nature of classic
Load-Store-Increment can be compensated for by having 8 Memory
accesses scheduled underway and interleaved in a time-sliced
fashion with an FPU that is correspondingly 8 times faster than
-Memory accesses.
+the Coherent Memory accesses.
This design is almost identical to the early Vector Processors
-of the late 1950s and early 1960s. The barrel-archutecture neatly
+of the late 1950s and early 1960s, which also critically relied
+on implicit auto-increment addressing. The barrel-architecture neatly
solves one of the inherent problems with those designs (memory
speed) and the presence of a full register file caters for a
second limitation of pure Memory-based Vector Processors: temporary
-variables needed in the computation of
+variables needed in the computation of intermediate results put
+an awfully high artificial load on Memory bandwidth.
+