From d03ee41d9f91be0c26c03a233365f23177f726b6 Mon Sep 17 00:00:00 2001 From: lkcl Date: Fri, 19 May 2023 19:45:59 +0100 Subject: [PATCH] --- openpower/sv/remap.mdwn | 292 ++++++++++++++++++++-------------------- 1 file changed, 143 insertions(+), 149 deletions(-) diff --git a/openpower/sv/remap.mdwn b/openpower/sv/remap.mdwn index a9d56ff2d..7eb0328e2 100644 --- a/openpower/sv/remap.mdwn +++ b/openpower/sv/remap.mdwn @@ -117,155 +117,6 @@ and briefly goes over their characteristics and limitations. Further details on the Deterministic Precise-Interruptible algorithms used in these Schedules is found in the [[sv/remap/appendix]]. - - -### Parallel Reduction - -Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base" -(Power ISA v3.0B) operation is leveraged, unmodified, to give the -*appearance* and *effect* of Reduction. Parallel Reduction is not limited -to Power-of-two but is limited as usual by the total number of -element operations (127) as well as available register file size. - -In Horizontal-First Mode, Vector-result reduction **requires** -the destination to be a Vector, which will be used to store -intermediary results, in order to achieve a correct final -result. - -Given that the tree-reduction schedule is deterministic, -Interrupts and exceptions -can therefore also be precise. The final result will be in the first -non-predicate-masked-out destination element, but due again to -the deterministic schedule programmers may find uses for the intermediate -results, even for non-commutative Defined Word operations. - -Parallel Reduction is unusual in that it requires a full vector array -of results (not a scalar) and uses the rest of the result Vector for -the purposes of storing intermediary calculations. As these intermediary -results are Deterministically computed they may be useful. -Additionally, because the intermediate results are always written out -it is possible to service Precise Interrupts without affecting latency -(a common limitation of Vector ISAs implementing explicit -Parallel Reduction instructions, because their Architectural State cannot -hold the partial results). - -When Rc=1 a corresponding Vector of co-resultant CRs is also -created. No special action is taken: the result *and its CR Field* -are stored "as usual" exactly as all other SVP64 Rc=1 operations. - -Note that the Schedule only makes sense on top of certain instructions: -X-Form with a Register Profile of `RT,RA,RB` is fine because two sources -and the destination are all the same type. Like Scalar -Reduction, nothing is prohibited: -the results of execution on an unsuitable instruction may simply -not make sense. With care, even 3-input instructions (madd, fmadd, ternlogi) -may be used, and whilst it is down to the Programmer to walk through the -process the Programmer can be confident that the Parallel-Reduction is -guaranteed 100% Deterministic. - -Critical to note regarding use of Parallel-Reduction REMAP is that, -exactly as with all REMAP Modes, the `svshape` instruction *requests* -a certain Vector Length (number of elements to reduce) and then -sets VL and MAXVL at the number of **operations** needed to be -carried out. Thus, equally as importantly, like Matrix REMAP -the total number of operations -is restricted to 127. Any Parallel-Reduction requiring more operations -will need to be done manually in batches (hierarchical -recursive Reduction). - -Also important to note is that the Deterministic Schedule is arranged -so that some implementations *may* parallelise it (as long as doing so -respects Program Order and Register Hazards). Performance (speed) -of any given -implementation is neither strictly defined or guaranteed. As with -the Vulkan(tm) Specification, strict compliance is paramount whilst -performance is at the discretion of Implementors. - -**Parallel-Reduction with Predication** - -To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule -completely separate from the actual element-level (scalar) operations, -Move operations are **not** included in the Schedule. This means that -the Schedule leaves the final (scalar) result in the first-non-masked -element of the Vector used. With the predicate mask being dynamic -(but deterministic) at a superficial glance it seems this result -could be anywhere. - -If that result is needed to be moved to a (single) scalar register -then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be -needed to get it, where the predicate is the exact same predicate used -in the prior Parallel-Reduction instruction. - -* If there was only a single - bit in the predicate then the result will not have moved or been altered - from the source vector prior to the Reduction -* If there was more than one bit the result will be in the - first element with a predicate bit set. - -In either case the result is in the element with the first bit set in -the predicate mask. Thus, no move/copy *within the Reduction itself* was needed. - -Programmer's Note: For *some* hardware implementations -the vector-to-scalar copy may be a slow operation, as may the Predicated -Parallel Reduction itself. -It may be better to perform a pre-copy -of the values, compressing them (VREDUCE-style) into a contiguous block, -which will guarantee that the result goes into the very first element -of the destination vector, in which case clearly no follow-up -predicated vector-to-scalar MV operation is needed. A VREDUCE effect -is achieved by setting just a source predicate mask on Twin-Predicated -operations. - -**Usage conditions** - -The simplest usage is to perform an overwrite, specifying all three -register operands the same. - -``` - svshape parallelreduce, 6 - sv.add *8, *8, *8 -``` - -The Reduction Schedule will issue the Parallel Tree Reduction spanning -registers 8 through 13, by adjusting the offsets to RT, RA and RB as -necessary (see "Parallel Reduction algorithm" in a later section). - -A non-overwrite is possible as well but just as with the overwrite -version, only those destination elements necessary for storing -intermediary computations will be written to: the remaining elements -will **not** be overwritten and will **not** be zero'd. - -``` - svshape parallelreduce, 6 - sv.add *0, *8, *8 -``` - -However it is critical to note that if the source and destination are -not the same then the trick of using a follow-up vector-scalar MV will -not work. - -### Sub-Vector Horizontal Reduction - -To achieve Sub-Vector Horizontal Reduction, Pack/Unpack should be enabled, -which will turn the Schedule around such that issuing of the Scalar -Defined Words is done with SUBVL looping as the inner loop not the -outer loop. Rc=1 with Sub-Vectors (SUBVL=2,3,4) is `UNDEFINED` behaviour. - -*Programmer's Note: Overwrite Parallel Reduction with Sub-Vectors -will clearly result in data corruption. It may be best to perform -a Pack/Unpack Transposing copy of the data first* - -### Parallel Prefix Sum - -This is a work-efficient Parallel Schedule that for example produces Trangular -or Factorial number sequences. Half of the Prefix Sum Schedule is near-identical -to Parallel Reduction. Whilst the Arithmetic mapreduce Mode (`/mr`) may achieve the same -end-result, implementations may only implement Mapreduce in serial form (or give -the appearance to Programmers of the same). The Parallel Prefix Schedule is -*required* to be implemented in such a way that its Deterministic Schedule may be -parallelised. Like the Reduction Schedule it is 100% Deterministic and consequently -may be used with non-commutative operations. - ## Determining Register Hazards For high-performance (Multi-Issue, Out-of-Order) systems it is critical @@ -446,6 +297,149 @@ Creates the Schedules for Parallel Tree Reduction and Prefix-Sum When clear the step will begin at 2 and double on each inner loop. +**Parallel Prefix Sum** + +This is a work-efficient Parallel Schedule that for example produces Trangular +or Factorial number sequences. Half of the Prefix Sum Schedule is near-identical +to Parallel Reduction. Whilst the Arithmetic mapreduce Mode (`/mr`) may achieve the same +end-result, implementations may only implement Mapreduce in serial form (or give +the appearance to Programmers of the same). The Parallel Prefix Schedule is +*required* to be implemented in such a way that its Deterministic Schedule may be +parallelised. Like the Reduction Schedule it is 100% Deterministic and consequently +may be used with non-commutative operations. +The Schedule Algorithm may be found in the [[sv/remap/appendix]] + +**Parallel Reduction** + +Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base" +(Power ISA v3.0B) operation is leveraged, unmodified, to give the +*appearance* and *effect* of Reduction. Parallel Reduction is not limited +to Power-of-two but is limited as usual by the total number of +element operations (127) as well as available register file size. + +In Horizontal-First Mode, Vector-result reduction **requires** +the destination to be a Vector, which will be used to store +intermediary results, in order to achieve a correct final +result. + +Given that the tree-reduction schedule is deterministic, +Interrupts and exceptions +can therefore also be precise. The final result will be in the first +non-predicate-masked-out destination element, but due again to +the deterministic schedule programmers may find uses for the intermediate +results, even for non-commutative Defined Word operations. +Additionally, because the intermediate results are always written out +it is possible to service Precise Interrupts without affecting latency +(a common limitation of Vector ISAs implementing explicit +Parallel Reduction instructions, because their Architectural State cannot +hold the partial results). + +When Rc=1 a corresponding Vector of co-resultant CRs is also +created. No special action is taken: the result *and its CR Field* +are stored "as usual" exactly as all other SVP64 Rc=1 operations. + +Note that the Schedule only makes sense on top of certain instructions: +X-Form with a Register Profile of `RT,RA,RB` is fine because two sources +and the destination are all the same type. Like Scalar +Reduction, nothing is prohibited: +the results of execution on an unsuitable instruction may simply +not make sense. With care, even 3-input instructions (madd, fmadd, ternlogi) +may be used, and whilst it is down to the Programmer to walk through the +process the Programmer can be confident that the Parallel-Reduction is +guaranteed 100% Deterministic. + +Critical to note regarding use of Parallel-Reduction REMAP is that, +exactly as with all REMAP Modes, the `svshape` instruction *requests* +a certain Vector Length (number of elements to reduce) and then +sets VL and MAXVL at the number of **operations** needed to be +carried out. Thus, equally as importantly, like Matrix REMAP +the total number of operations +is restricted to 127. Any Parallel-Reduction requiring more operations +will need to be done manually in batches (hierarchical +recursive Reduction). + +Also important to note is that the Deterministic Schedule is arranged +so that some implementations *may* parallelise it (as long as doing so +respects Program Order and Register Hazards). Performance (speed) +of any given +implementation is neither strictly defined or guaranteed. As with +the Vulkan(tm) Specification, strict compliance is paramount whilst +performance is at the discretion of Implementors. + +**Parallel-Reduction with Predication** + +To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule +completely separate from the actual element-level (scalar) operations, +Move operations are **not** included in the Schedule. This means that +the Schedule leaves the final (scalar) result in the first-non-masked +element of the Vector used. With the predicate mask being dynamic +(but deterministic) at a superficial glance it seems this result +could be anywhere. + +If that result is needed to be moved to a (single) scalar register +then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be +needed to get it, where the predicate is the exact same predicate used +in the prior Parallel-Reduction instruction. + +* If there was only a single + bit in the predicate then the result will not have moved or been altered + from the source vector prior to the Reduction +* If there was more than one bit the result will be in the + first element with a predicate bit set. + +In either case the result is in the element with the first bit set in +the predicate mask. Thus, no move/copy *within the Reduction itself* was needed. + +Programmer's Note: For *some* hardware implementations +the vector-to-scalar copy may be a slow operation, as may the Predicated +Parallel Reduction itself. +It may be better to perform a pre-copy +of the values, compressing them (VREDUCE-style) into a contiguous block, +which will guarantee that the result goes into the very first element +of the destination vector, in which case clearly no follow-up +predicated vector-to-scalar MV operation is needed. A VREDUCE effect +is achieved by setting just a source predicate mask on Twin-Predicated +operations. + +**Usage conditions** + +The simplest usage is to perform an overwrite, specifying all three +register operands the same. + +``` + svshape parallelreduce, 6 + sv.add *8, *8, *8 +``` + +The Reduction Schedule will issue the Parallel Tree Reduction spanning +registers 8 through 13, by adjusting the offsets to RT, RA and RB as +necessary (see "Parallel Reduction algorithm" in a later section). + +A non-overwrite is possible as well but just as with the overwrite +version, only those destination elements necessary for storing +intermediary computations will be written to: the remaining elements +will **not** be overwritten and will **not** be zero'd. + +``` + svshape parallelreduce, 6 + sv.add *0, *8, *8 +``` + +However it is critical to note that if the source and destination are +not the same then the trick of using a follow-up vector-scalar MV will +not work. + +**Sub-Vector Horizontal Reduction** + +To achieve Sub-Vector Horizontal Reduction, Pack/Unpack should be enabled, +which will turn the Schedule around such that issuing of the Scalar +Defined Words is done with SUBVL looping as the inner loop not the +outer loop. Rc=1 with Sub-Vectors (SUBVL=2,3,4) is `UNDEFINED` behaviour. + +*Programmer's Note: Overwrite Parallel Reduction with Sub-Vectors +will clearly result in data corruption. It may be best to perform +a Pack/Unpack Transposing copy of the data first* + ## FFT/DCT mode submode2=0 is for FFT. For FFT submode the following schedules may be -- 2.30.2