(but deterministic) this result could be anywhere.
If that result is needed to be moved to a (single) scalar register
-then a follow-up `sv.mv/sm=predicate rt, ra.v` instruction will be
+then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be
needed to get it, where the predicate is the exact same predicate used
-in the prior Parallel-Reduction instruction. For *some* implementations
-this may be a slow operation. It may be better to perform a pre-copy
+in the prior Parallel-Reduction instruction.
+
+* If there was only a single
+ bit in the predicate then the result will not have moved or been altered
+ from the source vector prior to the Reduction
+* If there was more than one bit the result will be in the
+ first element with a predicate bit set.
+
+In either case the result is in the element with the first bit set in
+the predicate mask.
+
+For *some* implementations
+the vector-to-scalar copy may be a slow operation, as may the Predicated
+Parallel Reduction itself.
+It may be better to perform a pre-copy
of the values, compressing them (VREDUCE-style) into a contiguous block,
which will guarantee that the result goes into the very first element
-of the destination vector.
+of the destination vector, in which case clearly no follow-up
+vector-to-scalar MV operation is needed.
**Usage conditions**
register operands the same.
setvl VL=6
- sv.add/vr 8.v, 8.v, 8.v
+ sv.add *8, *8, *8
The Reduction Schedule will issue the Parallel Tree Reduction spanning
registers 8 through 13, by adjusting the offsets to RT, RA and RB as
will **not** be overwritten and will **not** be zero'd.
setvl VL=4
- sv.add/vr 0.v, 8.v, 8.v
+ sv.add *0, *8, *8
+
+However it is critical to note that if the source and destination are
+not the same then the trick of using a follow-up vector-scalar MV will
+not work.
## Sub-Vector Horizontal Reduction