From: lkcl <lkcl@web>
Date: Mon, 5 Sep 2022 15:04:50 +0000 (+0100)
Subject: (no commit message)
X-Git-Tag: opf_rfc_ls005_v1~679
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=bc86ed27fd98147a0d2f5c67ce42b5db5b33deae;p=libreriscv.git

---

diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn
index 24b370738..2084ae9d4 100644
--- a/openpower/sv/svp64/appendix.mdwn
+++ b/openpower/sv/svp64/appendix.mdwn
@@ -518,122 +518,6 @@ parallel optimisation of the scalar reduce operation: it's just that
 as far as the user is concerned, all exceptions and interrupts **MUST**
 be precise.
 
-## Vector result reduce mode
-
-Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture.  Like Scalar reduction, the "Scalar Base"
-(Power ISA v3.0B) operation is leveraged, unmodified, to give the
-*appearance* and *effect* of Reduction.
-
-In Horizontal-First Mode, Vector-result reduction **requires**
-the destination to be a Vector, which will be used to store
-intermediary results.
-
-Given that the tree-reduction schedule is deterministic,
-Interrupts and exceptions
-can therefore also be precise.  The final result will be in the first
-non-predicate-masked-out destination element, but due again to
-the deterministic schedule programmers may find uses for the intermediate
-results.
-
-When Rc=1 a corresponding Vector of co-resultant CRs is also
-created.  No special action is taken: the result and its CR Field
-are stored "as usual" exactly as all other SVP64 Rc=1 operations.
-
-Note that the Schedule only makes sense on top of certain instructions:
-X-Form with a Register Profile of `RT,RA,RB` is fine.  Like Scalar
-Reduction, nothing is prohibited:
-the results of execution on an unsuitable instruction may simply
-not make sense. Many 3-input instructions (madd, fmadd) unlike Scalar
-Reduction in particular do not make sense, but `ternlogi`, if used
-with care, would.
-
-**Parallel-Reduction with Predication**
-
-To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule
-completely separate from the actual element-level (scalar) operations,
-Move operations are **not** included in the Schedule.  This means that
-the Schedule leaves the final (scalar) result in the first-non-masked 
-element of the Vector used.  With the predicate mask being dynamic
-(but deterministic) this result could be anywhere.
-
-If that result is needed to be moved to a (single) scalar register
-then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be
-needed to get it, where the predicate is the exact same predicate used
-in the prior Parallel-Reduction instruction.
-
-* If there was only a single
-  bit in the predicate then the result will not have moved or been altered
-  from the source vector prior to the Reduction
-* If there was more than one bit the result will be in the
-  first element with a predicate bit set.
-
-In either case the result is in the element with the first bit set in
-the predicate mask.
-
-For *some* implementations
-the vector-to-scalar copy may be a slow operation, as may the Predicated
-Parallel Reduction itself.
-It may be better to perform a pre-copy
-of the values, compressing them (VREDUCE-style) into a contiguous block,
-which will guarantee that the result goes into the very first element
-of the destination vector, in which case clearly no follow-up
-vector-to-scalar MV operation is needed.
-
-**Usage conditions**
-
-The simplest usage is to perform an overwrite, specifying all three
-register operands the same.
-
-    setvl VL=6
-    sv.add *8, *8, *8
-
-The Reduction Schedule will issue the Parallel Tree Reduction spanning
-registers 8 through 13, by adjusting the offsets to RT, RA and RB as
-necessary (see "Parallel Reduction algorithm" in a later section).
-
-A non-overwrite is possible as well but just as with the overwrite
-version, only those destination elements necessary for storing
-intermediary computations will be written to: the remaining elements
-will **not** be overwritten and will **not** be zero'd.
-
-    setvl VL=4
-    sv.add *0, *8, *8
-
-However it is critical to note that if the source and destination are
-not the same then the trick of using a follow-up vector-scalar MV will
-not work.
-
-## Sub-Vector Horizontal Reduction
-
-Note that when SVM is clear and SUBVL!=1 a Parallel Reduction is performed
-on all first Subvector elements, followed by another separate independent
-Parallel Reduction on all the second Subvector elements and so on.
-
-By contrast, when SVM is set and SUBVL!=1, a Horizontal
-Subvector mode is enabled, applying the Parallel Reduction
-Algorithm to the Subvector Elements. The Parallel Reduction
-is independently applied VL times, to each group of Subvector
-elements. Bear in mind that predication is never applied down
-into individual Subvector elements, but will be applied
-to select whether the *entire* Parallel Reduction on each
-group is performed or not.
-
-     Â for (i = 0; i < VL; i++)
-        if (predval & 1<<i) # predication
-           subvecparallelreduction(...)
-
-Note that as this is a Parallel Reduction, for best results
-it should be an overwrite operation, where the result for
-the Horizontal Reduction of each Subvector will be in the
-first Subvector element.
-Also note that use of Rc=1 is `UNDEFINED` behaviour.
-
-In essence what is happening here is that Structure Packing is being
-combined with Parallel Reduction.  If the Subvector elements may be
-laid out as a 2D matrix, with the Subvector elements on rows,
-and Parallel Reduction is applied per row, then if `SVM` is **clear**
-the Matrix is transposed (like Pack/Unpack)
-before still applying the Parallel Reduction to the **row**.
 
 # Fail-on-first <a name="fail-first"> </a>