From 3249157a0015a7f606849dff63f6ba2bdd8c72c5 Mon Sep 17 00:00:00 2001 From: lkcl Date: Mon, 5 Sep 2022 15:57:54 +0100 Subject: [PATCH] --- openpower/sv/svp64/appendix.mdwn | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn index 13b27ff73..24b370738 100644 --- a/openpower/sv/svp64/appendix.mdwn +++ b/openpower/sv/svp64/appendix.mdwn @@ -557,13 +557,27 @@ element of the Vector used. With the predicate mask being dynamic (but deterministic) this result could be anywhere. If that result is needed to be moved to a (single) scalar register -then a follow-up `sv.mv/sm=predicate rt, ra.v` instruction will be +then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be needed to get it, where the predicate is the exact same predicate used -in the prior Parallel-Reduction instruction. For *some* implementations -this may be a slow operation. It may be better to perform a pre-copy +in the prior Parallel-Reduction instruction. + +* If there was only a single + bit in the predicate then the result will not have moved or been altered + from the source vector prior to the Reduction +* If there was more than one bit the result will be in the + first element with a predicate bit set. + +In either case the result is in the element with the first bit set in +the predicate mask. + +For *some* implementations +the vector-to-scalar copy may be a slow operation, as may the Predicated +Parallel Reduction itself. +It may be better to perform a pre-copy of the values, compressing them (VREDUCE-style) into a contiguous block, which will guarantee that the result goes into the very first element -of the destination vector. +of the destination vector, in which case clearly no follow-up +vector-to-scalar MV operation is needed. **Usage conditions** @@ -571,7 +585,7 @@ The simplest usage is to perform an overwrite, specifying all three register operands the same. setvl VL=6 - sv.add/vr 8.v, 8.v, 8.v + sv.add *8, *8, *8 The Reduction Schedule will issue the Parallel Tree Reduction spanning registers 8 through 13, by adjusting the offsets to RT, RA and RB as @@ -583,7 +597,11 @@ intermediary computations will be written to: the remaining elements will **not** be overwritten and will **not** be zero'd. setvl VL=4 - sv.add/vr 0.v, 8.v, 8.v + sv.add *0, *8, *8 + +However it is critical to note that if the source and destination are +not the same then the trick of using a follow-up vector-scalar MV will +not work. ## Sub-Vector Horizontal Reduction -- 2.30.2