CPU-centric Vector-centric improvements relevant to power efficiency
and making more effective use of resources?
-# Simpler more compact programs
+# Simpler more compact programs saves power
The first and most obvious saving is that, just as with any Vector
ISA, the amount of data processing requested
astounding 240 hand-coded assembler instructions where it is around
12 to 14 for both RVV and SVP64. Worst case (full algorithm unrolling
for Massive FFTs) the L1 I-Cache becomes completely ineffective, and in
-the case of the IBM POWER9 a little-known design flaw this results in
+the case of the IBM POWER9 with a little-known design flaw not
+normally otherwise encountered this results in
contention between the L1 D and I Caches at the L2 Bus, slowing down
execution even further. Power ISA 3.1 MMA (Matrix-Multiply-Assist)
requires loop-unrolling to contend with non-power-of-two Matrix
the submission process will be
entirely at the discretion of the OpenPOWER Foundation ISA WG,
something that is both encouraged and welcomed by the OPF.
+
+One of SVP64's current limitations is that it was initially designed
+for 3D and Video workloads as a hybrid GPU-VPU-CPU. This resulted in
+a heavy focus on adding hardware-for-loops onto the *Registers*.
+After more than three years of development the realisation hit that
+the SVP64 concept could be expanded to Coherent Distributed Memory,
+This astoundingly powerful concept is explored in the next section.
+
+# Coherent Deterministic Hybrid Distributed Memory-Processing
+
+It is not often that a heading for an article can legitimately
+contain quite so many buzzwords, but in this section it is justified.