From: lkcl Date: Sun, 23 Apr 2023 10:18:04 +0000 (+0100) Subject: (no commit message) X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=730c10193aa2a25d15d367b7edb4fa9789907340;p=libreriscv.git --- diff --git a/openpower/sv/sprs.mdwn b/openpower/sv/sprs.mdwn index 9d4d6e516..c81c59a27 100644 --- a/openpower/sv/sprs.mdwn +++ b/openpower/sv/sprs.mdwn @@ -153,22 +153,13 @@ See `svstep` instruction for how to set Pack and Unpack Modes. A problem exists for hardware where it may not be able to detect that a programmer (or compiler) knows of opportunities for parallelism -and lack of overlap between loops. - -For hphint, the number chosen must be consistently -executed **every time**. Hardware is not permitted to execute five -computations for one instruction then three on the next. -hphint is a hint from the compiler to hardware that exactly this -many elements may be safely executed in parallel, without hazards -(including Memory accesses). -Interestingly, when hphint is set equal to VL, it is in effect -as if Vertical First mode were not set, because the hardware is -given the option to run through all elements in an instruction. -This is exactly what Horizontal-First is: a for-loop from 0 to VL-1 -except that the hardware may *choose* the number of elements. +and lack of overlap between loops, despite these being easy for a compiler +to statically detect and potentially express. +`hphint` is such an expression, declaring that elements within a batch are +independent of each other (no Register *or Memory* Hazards). Elements are considered to be in the same source batch if they have -the same `FLOOR(srcstep/hphint)`. Likewise in the same destination batch. +the same value of `FLOOR(srcstep/hphint)`. Likewise in the same destination batch. Three key observations here: 1. predication is **not** involved here. the number of actual elements @@ -178,10 +169,12 @@ batches 3. batch evaluation is done *before* REMAP, making Hazard elimination easier for Multi-Issue systems. -*Hard2are architectural note: each element within the same group may be treated as +*Hardware Architect note: each element within the same group may be treated as 100% independent from any other element within that group, and therefore -neither Register Hazards nor Memory Hazards inter-element exist. This makes -implementation far easier on resources.* +neither Register Hazards nor Memory Hazards inter-element exist +(but inter-group definitely does). This makes +implementation far easier on resources because the Hazard Dependencies are +effectively at a much coarser granularity than a single register.* `hphint` may legitimately be set greater than `MAXVL`. This indicates to Multi-Issue hardware that even though MAXVL is relatively small the batches are *still independent* @@ -198,6 +191,32 @@ also requires care to correctly declare in `hphint` how many elements are independent. In the case of most Reduction use-cases the answer is almost certainly "none". +`hphint` must definitely not be set on Atomic Memory operations, Cache-Inhibited +Memory operations, or Load-Reservation Store-Conditional. Also if Load-with-Update +Data-Dependent Fail-First is ever used for linked-list pointer-chasing, `hphint` +should again definitely be disabled. + +`hphint` may only be ignored by Hardware Implementors as long as full element-level +Register and Memory Hazards are implemented *in full* (including right down to individual +bytes of each register for when elwidth=8/16/32). In other words if `hphint` is to +be ignored then implementations must be made as if `hphint=0`. + +**Horizontal Parallelism in Vertical-First Mode** + +Setting `hphint` with Vertical-First is perfectly legitimate. Under these circumstances +the single-element strict Program Execution Order must be preserved at all times, but +should there be a small enough program loop, than Out-of-Order Hardware may *merge* +consecutive element-based instructions into the *same Reservation Stations*, for +multiple operations to be passed to massive-wide back-end SIMD ALUs or Vector-Chaining ALUs. +**Only** elements within the same `hphint` group (across multiple such looped instructions) +may be treated such. + +Note that if the loop of Vertical-First instructions cannot fit entirely into Reservation +Stations then Hardware clearly cannot exploit the above optimisation opportunity, but at +least there is no harm done: the loop is still correctly executed as Scalar instructions. +Programmers do need to be aware though that short loops on some Hardware Implementations +can be made considerably faster than on other Implementations. + ## SVLR SV Link Register, exactly analogous to LR (Link Register) may