From e506bbfe7a44a0baa63ecd557ee776936b3d2b56 Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Sun, 26 Jun 2022 11:46:41 +0100
Subject: [PATCH]

---
 openpower/sv/svp64/appendix.mdwn | 42 ++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn
index 3ff57f151..75a08fa8c 100644
--- a/openpower/sv/svp64/appendix.mdwn
+++ b/openpower/sv/svp64/appendix.mdwn
@@ -537,6 +537,10 @@ Vector Reduce Mode issues a deterministic tree-reduction schedule to the underly
 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
 *appearance* and *effect* of Reduction.
 
+Vector-result reduction **requires**
+the destination to be a Vector, which will be used to store
+intermediary results.
+
 Given that the tree-reduction schedule is deterministic,
 Interrupts and exceptions
 can therefore also be precise.  The final result will be in the first
@@ -556,6 +560,44 @@ not make sense. Many 3-input instructions (madd, fmadd) unlike Scalar
 Reduction in particular do not make sense, but `ternlogi`, if used
 with care, would.
 
+**Parallel-Reduction with Predication**
+
+To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule
+completely separate from the actual element-level (scalar) operations,
+Move operations are **not** included in the Schedule.  This means that
+the Schedule leaves the final (scalar) result in the first-non-masked 
+element of the Vector used.  With the predicate mask being dynamic
+(but deterministic) this result could be anywhere.
+
+If that result is needed to be moved to a (single) scalar register
+then a follow-up `sv.mv/sm=predicate rt, ra.v` instruction will be
+needed to get it, where the predicate is the exact same predicate used
+in the prior Parallel-Reduction instruction. For *some* implementations
+this may be a slow operation.  It may be better to perform a pre-copy
+of the values, compressing them (VREDUCE-style) into a contiguous block,
+which will guarantee that the result goes into the very first element
+of the destination vector.
+
+**Usage conditions**
+
+The simplest usage is to perform an overwrite, specifying all three
+register operands the same.
+
+    setvl VL=6
+    sv.add/vr 8.v, 8.v, 8.v
+
+The Reduction Schedule will issue the Parallel Tree Reduction spanning
+registers 8 through 13, by adjusting the offsets to RT, RA and RB as
+necessary (see "Parallel Reduction algorithm" in a later section).
+
+A non-overwrite is possible as well but just as with the overwrite
+version, only those destination elements necessary for storing
+intermediary computations will be written to: the remaining elements
+will **not** be overwritten and will **not** be zero'd.
+
+    setvl VL=4
+    sv.add/vr 0.v, 8.v, 8.v
+
 ## Sub-Vector Horizontal Reduction
 
 Note that when SVM is clear and SUBVL!=1 the sub-elements are
-- 
2.30.2