From: lkcl Date: Mon, 5 Sep 2022 15:04:50 +0000 (+0100) Subject: (no commit message) X-Git-Tag: opf_rfc_ls005_v1~679 X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=bc86ed27fd98147a0d2f5c67ce42b5db5b33deae;p=libreriscv.git --- diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn index 24b370738..2084ae9d4 100644 --- a/openpower/sv/svp64/appendix.mdwn +++ b/openpower/sv/svp64/appendix.mdwn @@ -518,122 +518,6 @@ parallel optimisation of the scalar reduce operation: it's just that as far as the user is concerned, all exceptions and interrupts **MUST** be precise. -## Vector result reduce mode - -Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base" -(Power ISA v3.0B) operation is leveraged, unmodified, to give the -*appearance* and *effect* of Reduction. - -In Horizontal-First Mode, Vector-result reduction **requires** -the destination to be a Vector, which will be used to store -intermediary results. - -Given that the tree-reduction schedule is deterministic, -Interrupts and exceptions -can therefore also be precise. The final result will be in the first -non-predicate-masked-out destination element, but due again to -the deterministic schedule programmers may find uses for the intermediate -results. - -When Rc=1 a corresponding Vector of co-resultant CRs is also -created. No special action is taken: the result and its CR Field -are stored "as usual" exactly as all other SVP64 Rc=1 operations. - -Note that the Schedule only makes sense on top of certain instructions: -X-Form with a Register Profile of `RT,RA,RB` is fine. Like Scalar -Reduction, nothing is prohibited: -the results of execution on an unsuitable instruction may simply -not make sense. Many 3-input instructions (madd, fmadd) unlike Scalar -Reduction in particular do not make sense, but `ternlogi`, if used -with care, would. - -**Parallel-Reduction with Predication** - -To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule -completely separate from the actual element-level (scalar) operations, -Move operations are **not** included in the Schedule. This means that -the Schedule leaves the final (scalar) result in the first-non-masked -element of the Vector used. With the predicate mask being dynamic -(but deterministic) this result could be anywhere. - -If that result is needed to be moved to a (single) scalar register -then a follow-up `sv.mv/sm=predicate rt, *ra` instruction will be -needed to get it, where the predicate is the exact same predicate used -in the prior Parallel-Reduction instruction. - -* If there was only a single - bit in the predicate then the result will not have moved or been altered - from the source vector prior to the Reduction -* If there was more than one bit the result will be in the - first element with a predicate bit set. - -In either case the result is in the element with the first bit set in -the predicate mask. - -For *some* implementations -the vector-to-scalar copy may be a slow operation, as may the Predicated -Parallel Reduction itself. -It may be better to perform a pre-copy -of the values, compressing them (VREDUCE-style) into a contiguous block, -which will guarantee that the result goes into the very first element -of the destination vector, in which case clearly no follow-up -vector-to-scalar MV operation is needed. - -**Usage conditions** - -The simplest usage is to perform an overwrite, specifying all three -register operands the same. - - setvl VL=6 - sv.add *8, *8, *8 - -The Reduction Schedule will issue the Parallel Tree Reduction spanning -registers 8 through 13, by adjusting the offsets to RT, RA and RB as -necessary (see "Parallel Reduction algorithm" in a later section). - -A non-overwrite is possible as well but just as with the overwrite -version, only those destination elements necessary for storing -intermediary computations will be written to: the remaining elements -will **not** be overwritten and will **not** be zero'd. - - setvl VL=4 - sv.add *0, *8, *8 - -However it is critical to note that if the source and destination are -not the same then the trick of using a follow-up vector-scalar MV will -not work. - -## Sub-Vector Horizontal Reduction - -Note that when SVM is clear and SUBVL!=1 a Parallel Reduction is performed -on all first Subvector elements, followed by another separate independent -Parallel Reduction on all the second Subvector elements and so on. - -By contrast, when SVM is set and SUBVL!=1, a Horizontal -Subvector mode is enabled, applying the Parallel Reduction -Algorithm to the Subvector Elements. The Parallel Reduction -is independently applied VL times, to each group of Subvector -elements. Bear in mind that predication is never applied down -into individual Subvector elements, but will be applied -to select whether the *entire* Parallel Reduction on each -group is performed or not. - -  for (i = 0; i < VL; i++) - if (predval & 1<