(no commit message)

author lkcl <lkcl@web>

Tue, 6 Sep 2022 07:15:33 +0000 (08:15 +0100)

committer IkiWiki <ikiwiki.info>

Tue, 6 Sep 2022 07:15:33 +0000 (08:15 +0100)
author lkcl <lkcl@web>
Tue, 6 Sep 2022 07:15:33 +0000 (08:15 +0100)
committer IkiWiki <ikiwiki.info>
Tue, 6 Sep 2022 07:15:33 +0000 (08:15 +0100)
diff --git a/openpower/sv/remap.mdwn b/openpower/sv/remap.mdwn

index 9dab853989d04ca2a0a224430dbd82de6f45e44d..e4ad1c88880f7367dd145aa2acad674a2b997bb8 100644 (file)
--- a/openpower/sv/remap.mdwn
+++ b/openpower/sv/remap.mdwn
@@ -278,28 +278,29 @@ created.  No special action is taken: the result and its CR Field
  are stored "as usual" exactly as all other SVP64 Rc=1 operations.
  
  Note that the Schedule only makes sense on top of certain instructions:
-X-Form with a Register Profile of `RT,RA,RB` is fine.  Like Scalar
+X-Form with a Register Profile of `RT,RA,RB` is fine because two sources
+and the destination are all the same type.  Like Scalar
  Reduction, nothing is prohibited:
  the results of execution on an unsuitable instruction may simply
-not make sense. Many 3-input instructions (madd, fmadd) unlike Scalar
-Reduction in particular do not make sense, but `ternlogi`, if used
-with care, would.
+not make sense. With care, even 3-input instructions (madd, fmadd, ternlogi) 
+may be used.
  
  Critical to note regarding use of Parallel-Reduction REMAP is that,
  exactly as with Matrix Mode, the `svshape` instruction *requests*
  a certain Vector Length (number of elements to reduce) and then
  sets VL and MAXVL at the number of **operations** needed to be
-carried out.  Thus, equally as importantly, the total number of operations
+carried out.  Thus, equally as importantly, like Matrix REMAP
+the total number of operations
  is restricted to 127.  Any Parallel-Reduction requiring more operations
  will need to be done manually in batches.
  
  Also important to note is that the Deterministic Schedule is arranged
-so that some implementations *may* parallelise it, as long as doing so
-respects Program Order and Register Hazards.  Performance (speed)
+so that some implementations *may* parallelise it (as long as doing so
+respects Program Order and Register Hazards).  Performance (speed)
  of any given
  implementation is neither strictly defined or guaranteed.  As with
  the Vulkan(tm) Specification, strict compliance is paramount whilst
-performance is left to Implementors.
+performance is at the discretion of Implementors.
  
  **Parallel-Reduction with Predication**
  
@@ -338,7 +339,7 @@ vector-to-scalar MV operation is needed.
  The simplest usage is to perform an overwrite, specifying all three
  register operands the same.
  
-    setvl VL=6
+    svshape parallelreduce, 6
      sv.add *8, *8, *8
  
  The Reduction Schedule will issue the Parallel Tree Reduction spanning
@@ -350,7 +351,7 @@ version, only those destination elements necessary for storing
  intermediary computations will be written to: the remaining elements
  will **not** be overwritten and will **not** be zero'd.
  
-    setvl VL=4
+    svshape parallelreduce, 6
      sv.add *0, *8, *8
  
  However it is critical to note that if the source and destination are
@@ -393,6 +394,37 @@ and Parallel Reduction is applied per row, then if `SVM` is **clear**
  the Matrix is transposed (like Pack/Unpack)
  before still applying the Parallel Reduction to the **row**.
  
+# Determining Register Hazards
+
+For high-performance (Multi-Issue, Out-of-Order) systems it is critical
+to be able to statically determine the extent of Vectors in order to
+allocate pre-emptive Hazard protection.  The next task is to eliminate
+masked-out elements using predicate bits, freeing up the associated
+Hazards.
+
+For non-REMAP situations `VL` is sufficient to ascertain early
+Hazard coverage, and with SVSTATE being a high priority cached
+quantity at the same level of MSR and PC this is not a problem.
+
+The problems come when REMAP is enabled.  Indexed REMAP must instead
+use `MAXVL` as the earliest (simplest)
+batch-level Hazard Reservation indicator,
+but Matrix, FFT and Parallel Reduction must all use completely different
+schemes.  The reason is that VL is used to step through the total
+number of *operations*, not the number of registers.  The "Saving Grace"
+is that all of the REMAP Schedules are Deterministic.
+
+Advance-notice Parallel computation and subsequent cacheing
+of all of these complex Deterministic REMAP Schedules is
+*strongly recommended*, thus allowing clear and precise multi-issue
+batched Hazard coverage to be deployed, *even for Indexed Mode*.
+This is only possible for Indexed due to the strict guidelines
+given to Programmers.
+
+In short, there exists solutions to the problem of Hazard Management,
+with varying degrees of refinement possible at correspondingly
+increasing levels of complexity in hardware.
+
  # REMAP area of SVSTATE
  
  The following bits of the SVSTATE SPR are used for REMAP:
author	lkcl <lkcl@web>
	Tue, 6 Sep 2022 07:15:33 +0000 (08:15 +0100)
committer	IkiWiki <ikiwiki.info>
	Tue, 6 Sep 2022 07:15:33 +0000 (08:15 +0100)