(no commit message)

author lkcl <lkcl@web>

Mon, 5 Sep 2022 14:57:54 +0000 (15:57 +0100)

committer IkiWiki <ikiwiki.info>

Mon, 5 Sep 2022 14:57:54 +0000 (15:57 +0100)
author lkcl <lkcl@web>
Mon, 5 Sep 2022 14:57:54 +0000 (15:57 +0100)
committer IkiWiki <ikiwiki.info>
Mon, 5 Sep 2022 14:57:54 +0000 (15:57 +0100)
diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn

index 13b27ff73e5748a848d4edeae2b47fc9570ac920..24b370738e14646406657bc8bef327a10facc925 100644 (file)
--- a/openpower/sv/svp64/appendix.mdwn
+++ b/openpower/sv/svp64/appendix.mdwn
@@ -557,13 +557,27 @@ element of the Vector used.  With the predicate mask being dynamic
  (but deterministic) this result could be anywhere.
  
  If that result is needed to be moved to a (single) scalar register
-then a follow-up `sv.mv/sm=predicate rt, ra.v` instruction will be
+then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be
  needed to get it, where the predicate is the exact same predicate used
-in the prior Parallel-Reduction instruction. For *some* implementations
-this may be a slow operation.  It may be better to perform a pre-copy
+in the prior Parallel-Reduction instruction.
+
+* If there was only a single
+  bit in the predicate then the result will not have moved or been altered
+  from the source vector prior to the Reduction
+* If there was more than one bit the result will be in the
+  first element with a predicate bit set.
+
+In either case the result is in the element with the first bit set in
+the predicate mask.
+
+For *some* implementations
+the vector-to-scalar copy may be a slow operation, as may the Predicated
+Parallel Reduction itself.
+It may be better to perform a pre-copy
  of the values, compressing them (VREDUCE-style) into a contiguous block,
  which will guarantee that the result goes into the very first element
-of the destination vector.
+of the destination vector, in which case clearly no follow-up
+vector-to-scalar MV operation is needed.
  
  **Usage conditions**
  
@@ -571,7 +585,7 @@ The simplest usage is to perform an overwrite, specifying all three
  register operands the same.
  
      setvl VL=6
-    sv.add/vr 8.v, 8.v, 8.v
+    sv.add *8, *8, *8
  
  The Reduction Schedule will issue the Parallel Tree Reduction spanning
  registers 8 through 13, by adjusting the offsets to RT, RA and RB as
@@ -583,7 +597,11 @@ intermediary computations will be written to: the remaining elements
  will **not** be overwritten and will **not** be zero'd.
  
      setvl VL=4
-    sv.add/vr 0.v, 8.v, 8.v
+    sv.add *0, *8, *8
+
+However it is critical to note that if the source and destination are
+not the same then the trick of using a follow-up vector-scalar MV will
+not work.
  
  ## Sub-Vector Horizontal Reduction
author	lkcl <lkcl@web>
	Mon, 5 Sep 2022 14:57:54 +0000 (15:57 +0100)
committer	IkiWiki <ikiwiki.info>
	Mon, 5 Sep 2022 14:57:54 +0000 (15:57 +0100)