From 3249157a0015a7f606849dff63f6ba2bdd8c72c5 Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Mon, 5 Sep 2022 15:57:54 +0100
Subject: [PATCH]

---
 openpower/sv/svp64/appendix.mdwn | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn
index 13b27ff73..24b370738 100644
--- a/openpower/sv/svp64/appendix.mdwn
+++ b/openpower/sv/svp64/appendix.mdwn
@@ -557,13 +557,27 @@ element of the Vector used.  With the predicate mask being dynamic
 (but deterministic) this result could be anywhere.
 
 If that result is needed to be moved to a (single) scalar register
-then a follow-up `sv.mv/sm=predicate rt, ra.v` instruction will be
+then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be
 needed to get it, where the predicate is the exact same predicate used
-in the prior Parallel-Reduction instruction. For *some* implementations
-this may be a slow operation.  It may be better to perform a pre-copy
+in the prior Parallel-Reduction instruction.
+
+* If there was only a single
+  bit in the predicate then the result will not have moved or been altered
+  from the source vector prior to the Reduction
+* If there was more than one bit the result will be in the
+  first element with a predicate bit set.
+
+In either case the result is in the element with the first bit set in
+the predicate mask.
+
+For *some* implementations
+the vector-to-scalar copy may be a slow operation, as may the Predicated
+Parallel Reduction itself.
+It may be better to perform a pre-copy
 of the values, compressing them (VREDUCE-style) into a contiguous block,
 which will guarantee that the result goes into the very first element
-of the destination vector.
+of the destination vector, in which case clearly no follow-up
+vector-to-scalar MV operation is needed.
 
 **Usage conditions**
 
@@ -571,7 +585,7 @@ The simplest usage is to perform an overwrite, specifying all three
 register operands the same.
 
     setvl VL=6
-    sv.add/vr 8.v, 8.v, 8.v
+    sv.add *8, *8, *8
 
 The Reduction Schedule will issue the Parallel Tree Reduction spanning
 registers 8 through 13, by adjusting the offsets to RT, RA and RB as
@@ -583,7 +597,11 @@ intermediary computations will be written to: the remaining elements
 will **not** be overwritten and will **not** be zero'd.
 
     setvl VL=4
-    sv.add/vr 0.v, 8.v, 8.v
+    sv.add *0, *8, *8
+
+However it is critical to note that if the source and destination are
+not the same then the trick of using a follow-up vector-scalar MV will
+not work.
 
 ## Sub-Vector Horizontal Reduction
 
-- 
2.30.2