SVP64 is able to do what is termed "Vertical-First" Vectorisation
*(walk first through a batch of instructions before explicitly
-moving to the next element, and repeating the batch)*,
+moving to the next element with `svstep`, and repeating the batch)*,
combined with SVREMAP Matrix Schedules. Imagine that SVREMAP has been
extended, Snitch-style, to perform a deterministic memory-array walk of
a large Matrix.
may be achieved without having to pass unnecessary data through
L1/L2/L3 Caches only to find, at the CPU, that it is zero.
+The reason in this case for the use of Vertical-First Mode is the
+conditional execution of the Multiply-and-Accumulate.
+Horizontal-First Mode is the standard Cray-Style Vectorisation:
+loop on all elements with the same instruction before moving
+on to the next instruction. Predication needs to be pre-calculated
+for the entire Vector in order to exclude certain elements from
+the computation. In this case, that's an expensive inconvenience.
+
+Vertical-First allows *scalar* temporary registers to be utilised
+in the assessment as to whether a particular Vector element should
+be skipped, utilising a straight Branch instruction. This technique
+is pioneered by Mitch Alsup and is a key feature of his VVM Engine
+in MyISA 66000. Careful analysis of the registers within the
+Vertical-First Loop allows a Multi-Issue Out-of-Order Engine to
+*amortise in-flight scalar looped operations into SIMD batches*
+as long as the loop is kept small enough to entirely fit into
+in-flight Reservation Stations.
+
+*<blockquote>
+(With thanks and gratitude to Mitch Alsup on comp.arch for
+spending considerable time explaining VVM, how its Loop
+Construct explicitly identifies loop-invariant registers,
+and how to exploit GB-OoO Micro-architectures)
+</blockquote>*
+
**Use-case: More powerful in-memory PEs**
An obvious variant of the above is that, if there is inherently