From: lkcl <lkcl@web>
Date: Sun, 23 Apr 2023 10:18:04 +0000 (+0100)
Subject: (no commit message)
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=730c10193aa2a25d15d367b7edb4fa9789907340;p=libreriscv.git

---

diff --git a/openpower/sv/sprs.mdwn b/openpower/sv/sprs.mdwn
index 9d4d6e516..c81c59a27 100644
--- a/openpower/sv/sprs.mdwn
+++ b/openpower/sv/sprs.mdwn
@@ -153,22 +153,13 @@ See `svstep` instruction for how to set Pack and Unpack Modes.
 
 A problem exists for hardware where it may not be able to detect
 that a programmer (or compiler) knows of opportunities for parallelism
-and lack of overlap between loops.
-
-For hphint, the number chosen must be consistently
-executed **every time**. Hardware is not permitted to execute five
-computations for one instruction then three on the next.
-hphint is a hint from the compiler to hardware that exactly this
-many elements may be safely executed in parallel, without hazards
-(including Memory accesses).
-Interestingly, when hphint is set equal to VL, it is in effect
-as if Vertical First mode were not set, because the hardware is
-given the option to run through all elements in an instruction.
-This is exactly what Horizontal-First is: a for-loop from 0 to VL-1
-except that the hardware may *choose* the number of elements.
+and lack of overlap between loops, despite these being easy for a compiler
+to statically detect and potentially express.
+`hphint` is such an expression, declaring that elements within a batch are
+independent of each other (no Register *or Memory* Hazards).
 
 Elements are considered to be in the same source batch if they have
-the same `FLOOR(srcstep/hphint)`. Likewise in the same destination batch.
+the same value of `FLOOR(srcstep/hphint)`. Likewise in the same destination batch.
 Three key observations here:
 
 1. predication is **not** involved here.  the number of actual elements
@@ -178,10 +169,12 @@ batches
 3. batch evaluation is done *before* REMAP, making Hazard elimination easier
    for Multi-Issue systems.
 
-*Hard2are architectural note: each element within the same group may be treated as
+*Hardware Architect note: each element within the same group may be treated as
 100% independent from any other element within that group, and therefore
-neither Register Hazards nor Memory Hazards inter-element exist.  This makes
-implementation far easier on resources.*
+neither Register Hazards nor Memory Hazards inter-element exist
+(but inter-group definitely does).  This makes
+implementation far easier on resources because the Hazard Dependencies are
+effectively at a much coarser granularity than a single register.*
 
 `hphint` may legitimately be set greater than `MAXVL`. This indicates to Multi-Issue
 hardware that even though MAXVL is relatively small the batches are *still independent*
@@ -198,6 +191,32 @@ also requires care to correctly declare in `hphint` how many elements are
 independent. In the case of most Reduction use-cases the answer is almost certainly
 "none".
 
+`hphint` must definitely not be set on Atomic Memory operations, Cache-Inhibited
+Memory operations, or Load-Reservation Store-Conditional. Also if Load-with-Update
+Data-Dependent Fail-First is ever used for linked-list pointer-chasing, `hphint`
+should again definitely be disabled.
+
+`hphint` may only be ignored by Hardware Implementors as long as full element-level
+Register and Memory Hazards are implemented *in full* (including right down to individual
+bytes of each register for when elwidth=8/16/32). In other words if `hphint` is to
+be ignored then implementations must be made as if `hphint=0`.
+
+**Horizontal Parallelism in Vertical-First Mode**
+
+Setting `hphint` with Vertical-First is perfectly legitimate.  Under these circumstances
+the single-element strict Program Execution Order must be preserved at all times, but
+should there be a small enough program loop, than Out-of-Order Hardware may *merge*
+consecutive element-based instructions into the *same Reservation Stations*, for
+multiple operations to be passed to massive-wide back-end SIMD ALUs or Vector-Chaining ALUs.
+**Only** elements within the same `hphint` group (across multiple such looped instructions)
+may be treated such.
+
+Note that if the loop of Vertical-First instructions cannot fit entirely into Reservation
+Stations then Hardware clearly cannot exploit the above optimisation opportunity, but at
+least there is no harm done: the loop is still correctly executed as Scalar instructions.
+Programmers do need to be aware though that short loops on some Hardware Implementations
+can be made considerably faster than on other Implementations.
+
 ## SVLR
 
 SV Link Register, exactly analogous to LR (Link Register) may