From d03ee41d9f91be0c26c03a233365f23177f726b6 Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Fri, 19 May 2023 19:45:59 +0100
Subject: [PATCH]

---
 openpower/sv/remap.mdwn | 292 ++++++++++++++++++++--------------------
 1 file changed, 143 insertions(+), 149 deletions(-)

diff --git a/openpower/sv/remap.mdwn b/openpower/sv/remap.mdwn
index a9d56ff2d..7eb0328e2 100644
--- a/openpower/sv/remap.mdwn
+++ b/openpower/sv/remap.mdwn
@@ -117,155 +117,6 @@ and briefly goes over their characteristics and limitations.
 Further details on the Deterministic Precise-Interruptible algorithms
 used in these Schedules is found in the [[sv/remap/appendix]].
 
-
-
-### Parallel Reduction
-
-Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture.  Like Scalar reduction, the "Scalar Base"
-(Power ISA v3.0B) operation is leveraged, unmodified, to give the
-*appearance* and *effect* of Reduction. Parallel Reduction is not limited
-to Power-of-two but is limited as usual by the total number of
-element operations (127) as well as available register file size.
-
-In Horizontal-First Mode, Vector-result reduction **requires**
-the destination to be a Vector, which will be used to store
-intermediary results, in order to achieve a correct final
-result.
-
-Given that the tree-reduction schedule is deterministic,
-Interrupts and exceptions
-can therefore also be precise.  The final result will be in the first
-non-predicate-masked-out destination element, but due again to
-the deterministic schedule programmers may find uses for the intermediate
-results, even for non-commutative Defined Word operations.
-
-Parallel Reduction is unusual in that it requires a full vector array
-of results (not a scalar) and uses the rest of the result Vector for
-the purposes of storing intermediary calculations.  As these intermediary
-results are Deterministically computed they may be useful.
-Additionally, because the intermediate results are always written out
-it is possible to service Precise Interrupts without affecting latency
-(a common limitation of Vector ISAs implementing explicit
-Parallel Reduction instructions, because their Architectural State cannot
-hold the partial results).
-
-When Rc=1 a corresponding Vector of co-resultant CRs is also
-created.  No special action is taken: the result *and its CR Field*
-are stored "as usual" exactly as all other SVP64 Rc=1 operations.
-
-Note that the Schedule only makes sense on top of certain instructions:
-X-Form with a Register Profile of `RT,RA,RB` is fine because two sources
-and the destination are all the same type.  Like Scalar
-Reduction, nothing is prohibited:
-the results of execution on an unsuitable instruction may simply
-not make sense. With care, even 3-input instructions (madd, fmadd, ternlogi) 
-may be used, and whilst it is down to the Programmer to walk through the
-process the Programmer can be confident that the Parallel-Reduction is
-guaranteed 100% Deterministic.
-
-Critical to note regarding use of Parallel-Reduction REMAP is that,
-exactly as with all REMAP Modes, the `svshape` instruction *requests*
-a certain Vector Length (number of elements to reduce) and then
-sets VL and MAXVL at the number of **operations** needed to be
-carried out.  Thus, equally as importantly, like Matrix REMAP
-the total number of operations
-is restricted to 127.  Any Parallel-Reduction requiring more operations
-will need to be done manually in batches (hierarchical
-recursive Reduction).
-
-Also important to note is that the Deterministic Schedule is arranged
-so that some implementations *may* parallelise it (as long as doing so
-respects Program Order and Register Hazards).  Performance (speed)
-of any given
-implementation is neither strictly defined or guaranteed.  As with
-the Vulkan(tm) Specification, strict compliance is paramount whilst
-performance is at the discretion of Implementors.
-
-**Parallel-Reduction with Predication**
-
-To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule
-completely separate from the actual element-level (scalar) operations,
-Move operations are **not** included in the Schedule.  This means that
-the Schedule leaves the final (scalar) result in the first-non-masked 
-element of the Vector used.  With the predicate mask being dynamic
-(but deterministic) at a superficial glance it seems this result
-could be anywhere.
-
-If that result is needed to be moved to a (single) scalar register
-then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be
-needed to get it, where the predicate is the exact same predicate used
-in the prior Parallel-Reduction instruction.
-
-* If there was only a single
-  bit in the predicate then the result will not have moved or been altered
-  from the source vector prior to the Reduction
-* If there was more than one bit the result will be in the
-  first element with a predicate bit set.
-
-In either case the result is in the element with the first bit set in
-the predicate mask. Thus, no move/copy *within the Reduction itself* was needed.
-
-Programmer's Note: For *some* hardware implementations
-the vector-to-scalar copy may be a slow operation, as may the Predicated
-Parallel Reduction itself.
-It may be better to perform a pre-copy
-of the values, compressing them (VREDUCE-style) into a contiguous block,
-which will guarantee that the result goes into the very first element
-of the destination vector, in which case clearly no follow-up
-predicated vector-to-scalar MV operation is needed. A VREDUCE effect
-is achieved by setting just a source predicate mask on Twin-Predicated
-operations.
-
-**Usage conditions**
-
-The simplest usage is to perform an overwrite, specifying all three
-register operands the same.
-
-```
-    svshape parallelreduce, 6
-    sv.add *8, *8, *8
-```
-
-The Reduction Schedule will issue the Parallel Tree Reduction spanning
-registers 8 through 13, by adjusting the offsets to RT, RA and RB as
-necessary (see "Parallel Reduction algorithm" in a later section).
-
-A non-overwrite is possible as well but just as with the overwrite
-version, only those destination elements necessary for storing
-intermediary computations will be written to: the remaining elements
-will **not** be overwritten and will **not** be zero'd.
-
-```
-    svshape parallelreduce, 6
-    sv.add *0, *8, *8
-```
-
-However it is critical to note that if the source and destination are
-not the same then the trick of using a follow-up vector-scalar MV will
-not work.
-
-### Sub-Vector Horizontal Reduction
-
-To achieve Sub-Vector Horizontal Reduction, Pack/Unpack should be enabled,
-which will turn the Schedule around such that issuing of the Scalar
-Defined Words is done with SUBVL looping as the inner loop not the
-outer loop. Rc=1 with Sub-Vectors (SUBVL=2,3,4) is `UNDEFINED` behaviour.
-
-*Programmer's Note: Overwrite Parallel Reduction with Sub-Vectors
-will clearly result in data corruption.  It may be best to perform
-a Pack/Unpack Transposing copy of the data first*
-
-### Parallel Prefix Sum
-
-This is a work-efficient Parallel Schedule that for example produces Trangular
-or Factorial number sequences. Half of the Prefix Sum Schedule is near-identical
-to Parallel Reduction.  Whilst the Arithmetic mapreduce Mode (`/mr`) may achieve the same
-end-result, implementations may only implement Mapreduce in serial form (or give
-the appearance to Programmers of the same). The Parallel Prefix Schedule is
-*required* to be implemented in such a way that its Deterministic Schedule may be
-parallelised. Like the Reduction Schedule it is 100% Deterministic and consequently
-may be used with non-commutative operations.
-
 ## Determining Register Hazards
 
 For high-performance (Multi-Issue, Out-of-Order) systems it is critical
@@ -446,6 +297,149 @@ Creates the Schedules for Parallel Tree Reduction and Prefix-Sum
   When clear the step will begin at 2 and double on each
   inner loop.
 
+**Parallel Prefix Sum**
+
+This is a work-efficient Parallel Schedule that for example produces Trangular
+or Factorial number sequences. Half of the Prefix Sum Schedule is near-identical
+to Parallel Reduction.  Whilst the Arithmetic mapreduce Mode (`/mr`) may achieve the same
+end-result, implementations may only implement Mapreduce in serial form (or give
+the appearance to Programmers of the same). The Parallel Prefix Schedule is
+*required* to be implemented in such a way that its Deterministic Schedule may be
+parallelised. Like the Reduction Schedule it is 100% Deterministic and consequently
+may be used with non-commutative operations.
+The Schedule Algorithm may be found in the [[sv/remap/appendix]]
+
+**Parallel Reduction**
+
+Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture.  Like Scalar reduction, the "Scalar Base"
+(Power ISA v3.0B) operation is leveraged, unmodified, to give the
+*appearance* and *effect* of Reduction. Parallel Reduction is not limited
+to Power-of-two but is limited as usual by the total number of
+element operations (127) as well as available register file size.
+
+In Horizontal-First Mode, Vector-result reduction **requires**
+the destination to be a Vector, which will be used to store
+intermediary results, in order to achieve a correct final
+result.
+
+Given that the tree-reduction schedule is deterministic,
+Interrupts and exceptions
+can therefore also be precise.  The final result will be in the first
+non-predicate-masked-out destination element, but due again to
+the deterministic schedule programmers may find uses for the intermediate
+results, even for non-commutative Defined Word operations.
+Additionally, because the intermediate results are always written out
+it is possible to service Precise Interrupts without affecting latency
+(a common limitation of Vector ISAs implementing explicit
+Parallel Reduction instructions, because their Architectural State cannot
+hold the partial results).
+
+When Rc=1 a corresponding Vector of co-resultant CRs is also
+created.  No special action is taken: the result *and its CR Field*
+are stored "as usual" exactly as all other SVP64 Rc=1 operations.
+
+Note that the Schedule only makes sense on top of certain instructions:
+X-Form with a Register Profile of `RT,RA,RB` is fine because two sources
+and the destination are all the same type.  Like Scalar
+Reduction, nothing is prohibited:
+the results of execution on an unsuitable instruction may simply
+not make sense. With care, even 3-input instructions (madd, fmadd, ternlogi) 
+may be used, and whilst it is down to the Programmer to walk through the
+process the Programmer can be confident that the Parallel-Reduction is
+guaranteed 100% Deterministic.
+
+Critical to note regarding use of Parallel-Reduction REMAP is that,
+exactly as with all REMAP Modes, the `svshape` instruction *requests*
+a certain Vector Length (number of elements to reduce) and then
+sets VL and MAXVL at the number of **operations** needed to be
+carried out.  Thus, equally as importantly, like Matrix REMAP
+the total number of operations
+is restricted to 127.  Any Parallel-Reduction requiring more operations
+will need to be done manually in batches (hierarchical
+recursive Reduction).
+
+Also important to note is that the Deterministic Schedule is arranged
+so that some implementations *may* parallelise it (as long as doing so
+respects Program Order and Register Hazards).  Performance (speed)
+of any given
+implementation is neither strictly defined or guaranteed.  As with
+the Vulkan(tm) Specification, strict compliance is paramount whilst
+performance is at the discretion of Implementors.
+
+**Parallel-Reduction with Predication**
+
+To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule
+completely separate from the actual element-level (scalar) operations,
+Move operations are **not** included in the Schedule.  This means that
+the Schedule leaves the final (scalar) result in the first-non-masked 
+element of the Vector used.  With the predicate mask being dynamic
+(but deterministic) at a superficial glance it seems this result
+could be anywhere.
+
+If that result is needed to be moved to a (single) scalar register
+then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be
+needed to get it, where the predicate is the exact same predicate used
+in the prior Parallel-Reduction instruction.
+
+* If there was only a single
+  bit in the predicate then the result will not have moved or been altered
+  from the source vector prior to the Reduction
+* If there was more than one bit the result will be in the
+  first element with a predicate bit set.
+
+In either case the result is in the element with the first bit set in
+the predicate mask. Thus, no move/copy *within the Reduction itself* was needed.
+
+Programmer's Note: For *some* hardware implementations
+the vector-to-scalar copy may be a slow operation, as may the Predicated
+Parallel Reduction itself.
+It may be better to perform a pre-copy
+of the values, compressing them (VREDUCE-style) into a contiguous block,
+which will guarantee that the result goes into the very first element
+of the destination vector, in which case clearly no follow-up
+predicated vector-to-scalar MV operation is needed. A VREDUCE effect
+is achieved by setting just a source predicate mask on Twin-Predicated
+operations.
+
+**Usage conditions**
+
+The simplest usage is to perform an overwrite, specifying all three
+register operands the same.
+
+```
+    svshape parallelreduce, 6
+    sv.add *8, *8, *8
+```
+
+The Reduction Schedule will issue the Parallel Tree Reduction spanning
+registers 8 through 13, by adjusting the offsets to RT, RA and RB as
+necessary (see "Parallel Reduction algorithm" in a later section).
+
+A non-overwrite is possible as well but just as with the overwrite
+version, only those destination elements necessary for storing
+intermediary computations will be written to: the remaining elements
+will **not** be overwritten and will **not** be zero'd.
+
+```
+    svshape parallelreduce, 6
+    sv.add *0, *8, *8
+```
+
+However it is critical to note that if the source and destination are
+not the same then the trick of using a follow-up vector-scalar MV will
+not work.
+
+**Sub-Vector Horizontal Reduction**
+
+To achieve Sub-Vector Horizontal Reduction, Pack/Unpack should be enabled,
+which will turn the Schedule around such that issuing of the Scalar
+Defined Words is done with SUBVL looping as the inner loop not the
+outer loop. Rc=1 with Sub-Vectors (SUBVL=2,3,4) is `UNDEFINED` behaviour.
+
+*Programmer's Note: Overwrite Parallel Reduction with Sub-Vectors
+will clearly result in data corruption.  It may be best to perform
+a Pack/Unpack Transposing copy of the data first*
+
 ## FFT/DCT mode
 
 submode2=0 is for FFT. For FFT submode the following schedules may be 
-- 
2.30.2