From: lkcl <lkcl@web>
Date: Sun, 26 Jun 2022 10:46:41 +0000 (+0100)
Subject: (no commit message)
X-Git-Tag: opf_rfc_ls005_v1~1519
X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=e506bbfe7a44a0baa63ecd557ee776936b3d2b56;p=libreriscv.git

---

diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn
index 3ff57f151..75a08fa8c 100644
--- a/openpower/sv/svp64/appendix.mdwn
+++ b/openpower/sv/svp64/appendix.mdwn
@@ -537,6 +537,10 @@ Vector Reduce Mode issues a deterministic tree-reduction schedule to the underly
 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
 *appearance* and *effect* of Reduction.
 
+Vector-result reduction **requires**
+the destination to be a Vector, which will be used to store
+intermediary results.
+
 Given that the tree-reduction schedule is deterministic,
 Interrupts and exceptions
 can therefore also be precise.  The final result will be in the first
@@ -556,6 +560,44 @@ not make sense. Many 3-input instructions (madd, fmadd) unlike Scalar
 Reduction in particular do not make sense, but `ternlogi`, if used
 with care, would.
 
+**Parallel-Reduction with Predication**
+
+To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule
+completely separate from the actual element-level (scalar) operations,
+Move operations are **not** included in the Schedule.  This means that
+the Schedule leaves the final (scalar) result in the first-non-masked 
+element of the Vector used.  With the predicate mask being dynamic
+(but deterministic) this result could be anywhere.
+
+If that result is needed to be moved to a (single) scalar register
+then a follow-up `sv.mv/sm=predicate rt, ra.v` instruction will be
+needed to get it, where the predicate is the exact same predicate used
+in the prior Parallel-Reduction instruction. For *some* implementations
+this may be a slow operation.  It may be better to perform a pre-copy
+of the values, compressing them (VREDUCE-style) into a contiguous block,
+which will guarantee that the result goes into the very first element
+of the destination vector.
+
+**Usage conditions**
+
+The simplest usage is to perform an overwrite, specifying all three
+register operands the same.
+
+    setvl VL=6
+    sv.add/vr 8.v, 8.v, 8.v
+
+The Reduction Schedule will issue the Parallel Tree Reduction spanning
+registers 8 through 13, by adjusting the offsets to RT, RA and RB as
+necessary (see "Parallel Reduction algorithm" in a later section).
+
+A non-overwrite is possible as well but just as with the overwrite
+version, only those destination elements necessary for storing
+intermediary computations will be written to: the remaining elements
+will **not** be overwritten and will **not** be zero'd.
+
+    setvl VL=4
+    sv.add/vr 0.v, 8.v, 8.v
+
 ## Sub-Vector Horizontal Reduction
 
 Note that when SVM is clear and SUBVL!=1 the sub-elements are