From: lkcl <lkcl@web>
Date: Tue, 6 Sep 2022 07:15:33 +0000 (+0100)
Subject: (no commit message)
X-Git-Tag: opf_rfc_ls005_v1~662
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=5b0d27734236e39eaa553e710353aa5df31e59b7;p=libreriscv.git

---

diff --git a/openpower/sv/remap.mdwn b/openpower/sv/remap.mdwn
index 9dab85398..e4ad1c888 100644
--- a/openpower/sv/remap.mdwn
+++ b/openpower/sv/remap.mdwn
@@ -278,28 +278,29 @@ created.  No special action is taken: the result and its CR Field
 are stored "as usual" exactly as all other SVP64 Rc=1 operations.
 
 Note that the Schedule only makes sense on top of certain instructions:
-X-Form with a Register Profile of `RT,RA,RB` is fine.  Like Scalar
+X-Form with a Register Profile of `RT,RA,RB` is fine because two sources
+and the destination are all the same type.  Like Scalar
 Reduction, nothing is prohibited:
 the results of execution on an unsuitable instruction may simply
-not make sense. Many 3-input instructions (madd, fmadd) unlike Scalar
-Reduction in particular do not make sense, but `ternlogi`, if used
-with care, would.
+not make sense. With care, even 3-input instructions (madd, fmadd, ternlogi) 
+may be used.
 
 Critical to note regarding use of Parallel-Reduction REMAP is that,
 exactly as with Matrix Mode, the `svshape` instruction *requests*
 a certain Vector Length (number of elements to reduce) and then
 sets VL and MAXVL at the number of **operations** needed to be
-carried out.  Thus, equally as importantly, the total number of operations
+carried out.  Thus, equally as importantly, like Matrix REMAP
+the total number of operations
 is restricted to 127.  Any Parallel-Reduction requiring more operations
 will need to be done manually in batches.
 
 Also important to note is that the Deterministic Schedule is arranged
-so that some implementations *may* parallelise it, as long as doing so
-respects Program Order and Register Hazards.  Performance (speed)
+so that some implementations *may* parallelise it (as long as doing so
+respects Program Order and Register Hazards).  Performance (speed)
 of any given
 implementation is neither strictly defined or guaranteed.  As with
 the Vulkan(tm) Specification, strict compliance is paramount whilst
-performance is left to Implementors.
+performance is at the discretion of Implementors.
 
 **Parallel-Reduction with Predication**
 
@@ -338,7 +339,7 @@ vector-to-scalar MV operation is needed.
 The simplest usage is to perform an overwrite, specifying all three
 register operands the same.
 
-    setvl VL=6
+    svshape parallelreduce, 6
     sv.add *8, *8, *8
 
 The Reduction Schedule will issue the Parallel Tree Reduction spanning
@@ -350,7 +351,7 @@ version, only those destination elements necessary for storing
 intermediary computations will be written to: the remaining elements
 will **not** be overwritten and will **not** be zero'd.
 
-    setvl VL=4
+    svshape parallelreduce, 6
     sv.add *0, *8, *8
 
 However it is critical to note that if the source and destination are
@@ -393,6 +394,37 @@ and Parallel Reduction is applied per row, then if `SVM` is **clear**
 the Matrix is transposed (like Pack/Unpack)
 before still applying the Parallel Reduction to the **row**.
 
+# Determining Register Hazards
+
+For high-performance (Multi-Issue, Out-of-Order) systems it is critical
+to be able to statically determine the extent of Vectors in order to
+allocate pre-emptive Hazard protection.  The next task is to eliminate
+masked-out elements using predicate bits, freeing up the associated
+Hazards.
+
+For non-REMAP situations `VL` is sufficient to ascertain early
+Hazard coverage, and with SVSTATE being a high priority cached
+quantity at the same level of MSR and PC this is not a problem.
+
+The problems come when REMAP is enabled.  Indexed REMAP must instead
+use `MAXVL` as the earliest (simplest)
+batch-level Hazard Reservation indicator,
+but Matrix, FFT and Parallel Reduction must all use completely different
+schemes.  The reason is that VL is used to step through the total
+number of *operations*, not the number of registers.  The "Saving Grace"
+is that all of the REMAP Schedules are Deterministic.
+
+Advance-notice Parallel computation and subsequent cacheing
+of all of these complex Deterministic REMAP Schedules is
+*strongly recommended*, thus allowing clear and precise multi-issue
+batched Hazard coverage to be deployed, *even for Indexed Mode*.
+This is only possible for Indexed due to the strict guidelines
+given to Programmers.
+
+In short, there exists solutions to the problem of Hazard Management,
+with varying degrees of refinement possible at correspondingly
+increasing levels of complexity in hardware.
+
 # REMAP area of SVSTATE
 
 The following bits of the SVSTATE SPR are used for REMAP: