From eacf4b92e2508c129345ba0ac0699376ee9a812c Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Wed, 22 Jun 2022 11:05:55 +0100 Subject: [PATCH] re-add sub-vector horizontal reduction --- openpower/sv/svp64/appendix.mdwn | 45 ++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn index 208055c08..3ab14e84f 100644 --- a/openpower/sv/svp64/appendix.mdwn +++ b/openpower/sv/svp64/appendix.mdwn @@ -505,6 +505,51 @@ not make sense. Many 3-input instructions (madd, fmadd) unlike Scalar Reduction in particular do not make sense, but `ternlogi`, if used with care, would. +## Sub-Vector Horizontal Reduction + +Note that when SVM is clear and SUBVL!=1 the sub-elements are +*independent*, i.e. they are mapreduced per *sub-element* as a result. +illustration with a vec2, assuming RA==RT, e.g `sv.add/mr/vec2 r4, r4, r16` + + for i in range(0, VL): + # RA==RT in the instruction. does not have to be + iregs[RT].x = op(iregs[RT].x, iregs[RB+i].x) + iregs[RT].y = op(iregs[RT].y, iregs[RB+i].y) + +Thus logically there is nothing special or unanticipated about +`SVM=0`: it is expected behaviour according to standard SVP64 +Sub-Vector rules. + +By contrast, when SVM is set and SUBVL!=1, a Horizontal +Subvector mode is enabled, which behaves very much more +like a traditional Vector Processor Reduction instruction. + +Example for a vec2: + + for i in range(VL): + iregs[RT+i] = op(iregs[RA+i].x, iregs[RA+i].y) + +Example for a vec3: + + for i in range(VL): + iregs[RT+i] = op(iregs[RA+i].x, iregs[RA+i].y) + iregs[RT+i] = op(iregs[RT+i] , iregs[RA+i].z) + +Example for a vec4: + + for i in range(VL): + iregs[RT+i] = op(iregs[RA+i].x, iregs[RA+i].y) + iregs[RT+i] = op(iregs[RT+i] , iregs[RA+i].z) + iregs[RT+i] = op(iregs[RT+i] , iregs[RA+i].w) + +In this mode, when Rc=1 the Vector of CRs is as normal: each result +element creates a corresponding CR element (for the final, reduced, result). + +Note that the destination (RT) is automatically used as an "Accumulator" +register, and consequently the Sub-Vector Loop is interruptible. +If RT is a Scalar then as usual the main VL Loop terminates at the +first predicated element (or the first element if unpredicated). + # Fail-on-first Data-dependent fail-on-first has two distinct variants: one for LD/ST -- 2.30.2