From: lkcl Date: Sun, 8 May 2022 21:50:16 +0000 (+0100) Subject: (no commit message) X-Git-Tag: opf_rfc_ls005_v1~2296 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=7f19ece3d423f88e1352626112f3abb0cd6e2be9;p=libreriscv.git --- diff --git a/openpower/sv/SimpleV_rationale.mdwn b/openpower/sv/SimpleV_rationale.mdwn index d38504f3a..161c5511e 100644 --- a/openpower/sv/SimpleV_rationale.mdwn +++ b/openpower/sv/SimpleV_rationale.mdwn @@ -850,7 +850,7 @@ ascertain, retrospectively, that time and power had just been wasted. SVP64 is able to do what is termed "Vertical-First" Vectorisation *(walk first through a batch of instructions before explicitly -moving to the next element, and repeating the batch)*, +moving to the next element with `svstep`, and repeating the batch)*, combined with SVREMAP Matrix Schedules. Imagine that SVREMAP has been extended, Snitch-style, to perform a deterministic memory-array walk of a large Matrix. @@ -881,6 +881,31 @@ main CPU. In this way a large Sparse Matrix Multiply or Convolution may be achieved without having to pass unnecessary data through L1/L2/L3 Caches only to find, at the CPU, that it is zero. +The reason in this case for the use of Vertical-First Mode is the +conditional execution of the Multiply-and-Accumulate. +Horizontal-First Mode is the standard Cray-Style Vectorisation: +loop on all elements with the same instruction before moving +on to the next instruction. Predication needs to be pre-calculated +for the entire Vector in order to exclude certain elements from +the computation. In this case, that's an expensive inconvenience. + +Vertical-First allows *scalar* temporary registers to be utilised +in the assessment as to whether a particular Vector element should +be skipped, utilising a straight Branch instruction. This technique +is pioneered by Mitch Alsup and is a key feature of his VVM Engine +in MyISA 66000. Careful analysis of the registers within the +Vertical-First Loop allows a Multi-Issue Out-of-Order Engine to +*amortise in-flight scalar looped operations into SIMD batches* +as long as the loop is kept small enough to entirely fit into +in-flight Reservation Stations. + +*
+(With thanks and gratitude to Mitch Alsup on comp.arch for +spending considerable time explaining VVM, how its Loop +Construct explicitly identifies loop-invariant registers, +and how to exploit GB-OoO Micro-architectures) +
* + **Use-case: More powerful in-memory PEs** An obvious variant of the above is that, if there is inherently