From: lkcl <lkcl@web>
Date: Sun, 8 May 2022 21:50:16 +0000 (+0100)
Subject: (no commit message)
X-Git-Tag: opf_rfc_ls005_v1~2296
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=7f19ece3d423f88e1352626112f3abb0cd6e2be9;p=libreriscv.git

---

diff --git a/openpower/sv/SimpleV_rationale.mdwn b/openpower/sv/SimpleV_rationale.mdwn
index d38504f3a..161c5511e 100644
--- a/openpower/sv/SimpleV_rationale.mdwn
+++ b/openpower/sv/SimpleV_rationale.mdwn
@@ -850,7 +850,7 @@ ascertain, retrospectively, that time and power had just been wasted.
 
 SVP64 is able to do what is termed "Vertical-First" Vectorisation
 *(walk first through a batch of instructions before explicitly
-moving to the next element, and repeating the batch)*,
+moving to the next element with `svstep`, and repeating the batch)*,
 combined with SVREMAP Matrix Schedules.  Imagine that SVREMAP has been
 extended, Snitch-style, to perform a deterministic memory-array walk of
 a large Matrix.
@@ -881,6 +881,31 @@ main CPU.  In this way a large Sparse Matrix Multiply or Convolution
 may be achieved without having to pass unnecessary data through
 L1/L2/L3 Caches only to find, at the CPU, that it is zero.
 
+The reason in this case for the use of Vertical-First Mode is the
+conditional execution of the Multiply-and-Accumulate.
+Horizontal-First Mode is the standard Cray-Style Vectorisation:
+loop on all elements with the same instruction before moving
+on to the next instruction. Predication needs to be pre-calculated
+for the entire Vector in order to exclude certain elements from
+the computation. In this case, that's an expensive inconvenience.
+
+Vertical-First allows *scalar* temporary registers to be utilised
+in the assessment as to whether a particular Vector element should
+be skipped, utilising a straight Branch instruction.  This technique
+is pioneered by Mitch Alsup and is a key feature of his VVM Engine
+in MyISA 66000.  Careful analysis of the registers within the
+Vertical-First Loop allows a Multi-Issue Out-of-Order Engine to
+*amortise in-flight scalar looped operations into SIMD batches*
+as long as the loop is kept small enough to entirely fit into
+in-flight Reservation Stations.
+
+*<blockquote>
+(With thanks and gratitude to Mitch Alsup on comp.arch for
+spending considerable time explaining VVM, how its Loop
+Construct explicitly identifies loop-invariant registers,
+and how to exploit GB-OoO Micro-architectures)
+</blockquote>*
+
 **Use-case: More powerful in-memory PEs**
 
 An obvious variant of the above is that, if there is inherently