From b70a933162b3fb192c4385822c1e0ad68565209e Mon Sep 17 00:00:00 2001 From: lkcl Date: Mon, 21 Dec 2020 03:47:40 +0000 Subject: [PATCH] --- openpower/sv/svp_rewrite/svp64.mdwn | 287 ++++++++++++++-------------- 1 file changed, 144 insertions(+), 143 deletions(-) diff --git a/openpower/sv/svp_rewrite/svp64.mdwn b/openpower/sv/svp_rewrite/svp64.mdwn index 8c5fafd79..616dd2e71 100644 --- a/openpower/sv/svp_rewrite/svp64.mdwn +++ b/openpower/sv/svp_rewrite/svp64.mdwn @@ -232,149 +232,6 @@ Fields: * **SVM** sets "subvector" reduce mode * **N** sets signed/unsigned saturation. -## Rounding, clamp and saturate - -One of the issues with vector ops is that in integer DSP ops for example -in Audio the operation must clamp or saturate rather than overflow or -ignore the upper bits and become a modulo operation. This for Audio -is extremely important, also to provide an indicator as to whether -saturation occurred. see [[av_opcodes]]. - -To help ensure that audio quality is not compromised by overflow, -"saturation" is provided, as well as a way to detect when saturation -occurred (Rc=1). When Rc=1 there will be a *vector* of CRs, one CR per -element in the result (Note: this is different from VSX which has a -single CR per block). - -When N=0 the result is saturated to within the maximum range of an -unsigned value. For integer ops this will be 0 to 2^elwidth-1. Similar -logic applies to FP operations, with the result being saturated to -maximum rather than returning INF. - -When N=1 the same occurs except that the result is saturated to the min -or max of a signed result. - -When Rc=1, the CR "overflow" bit is set on the CR associated with the -element, to indicate whether saturation occurred. Note that due to -the hugely detrimental effect it has on parallel processing, XER.SO is -**ignored** completely and is **not** brought into play here. The CR -overflow bit is therefore simply set to zero if saturation did not occur, -and to one if it did. - -Post-analysis of the Vector of CRs to find out if any given element hit -saturation may be done using a mapreduced CR op (cror), or by using the -new crweird instruction, transferring the relevant CR bits to a scalar -integer and testing it for nonzero. see [[sv/cr_int_predication]] - - -## Reduce mode - -1. limited to single predicated dual src operations (add RT, RA, RB) and - to triple source operations where one of the inputs is set to a scalar - (these are rare) -2. limited to operations that make sense. divide is excluded, as is - subtract (X - Y - Z produces different answers depending on the order). - sane operations: multiply, add, logical bitwise OR, CR operations. - operations that do not return the same register type are also excluded - (isel, cmp) -3. the destination is a vector but the result is stored, ultimately, - in the first nonzero predicated element. all other nonzero predicated - elements are undefined. *this includes the CR vector* when Rc=1 -4. implementations may use any ordering and any algorithm to reduce - down to a single result. However it must be equivalent to a straight - application of mapreduce. The destination vector (except masked out - elements) may be used for storing any intermediate results. these may - be left in the vector (undefined). -5. CRM applies when Rc=1. When CRM is zero, the CR associated with - the result is regarded as a "some results met standard CR result - criteria". When CRM is one, this changes to "all results met standard - CR criteria". -6. implementations MAY use destoffs as well as srcoffs (see [[sv/sprs]]) - in order to store sufficient state to resume operation should an - interrupt occur. this is also why implementations are permitted to use - the destination vector to store intermediary computations - -TODO: Rc=1 on Scalar Logical Operations? is this possible? was space -reserved in Logical Ops? - -Pseudocode for the case where RA==RB: - - result = op(iregs[RA], iregs[RA+1]) - CR = analyse(result) - for i in range(2, VL): - result = op(result, iregs[RA+i]) - CRnew = analyse(result) - if Rc=1 - if CRM: - CR = CR bitwise or CRnew - else: - CR = CR bitwise AND CRnew - -TODO: case where RA!=RB which involves first a vector of 2-operand -results followed by a mapreduce on the intermediates. - -Note that when SUBVL!=1 the sub-elements are *independent*, i.e. they -are mapreduced per *sub-element* as a result. illustration with a vec2: - - result.x = op(iregs[RA].x, iregs[RA+1].x) - result.y = op(iregs[RA].y, iregs[RA+1].y) - for i in range(2, VL): - result.x = op(result.x, iregs[RA+i].x) - result.y = op(result.y, iregs[RA+i].y) - -When SVM is set and SUBVL!=1, another variant is enabled, which switches -to `RM-2P-2S1D` such that different elwidths may be applied to src -and dest. - - for i in range(VL): - result = op(iregs[RA+i].x, iregs[RA+i].x) - result = op(result, iregs[RA+i].z) - result = op(result, iregs[RA+i].z) - iregs[RT+i] = result - - - -## Fail-on-first - -Data-dependent fail-on-first has two distinct variants: one for LD/ST, -the other for arithmetic operations (actually, CR-driven). Note in each -case the assumption is that vector elements are required appear to be -executed in sequential Program Order, element 0 being the first. - -* LD/ST ffirst treats the first LD/ST in a vector (element 0) as an - ordinary one. Exceptions occur "as normal". However for elements 1 - and above, if an exception would occur, then VL is **truncated** to the - previous element. -* Data-driven (CR-driven) fail-on-first activates when Rc=1 or other - CR-creating operation produces a result (including cmp). Similar to - branch, an analysis of the CR is performed and if the test fails, the - vector operation terminates and discards all element operations at and - above the current one, and VL is truncated to the *previous* element. - Thus the new VL comprises a contiguous vector of results, all of which - pass the testing criteria (equal to zero, less than zero). - -The CR-based data-driven fail-on-first is new and not found in ARM SVE -or RVV. It is extremely useful for reducing instruction count, however -requires speculative execution involving modifications of VL to get high -performance implementations. - -In CR-based data-driven fail-on-first there is only the option to select -and test one bit of each CR (just as with branch BO). For more complex -tests this may be insufficient. If that is the case, a vectorised crops -(crand, cror) may be used, and ffirst applied to the crop instead of to -the arithmetic vector. - -One extremely important aspect of ffirst is: - -* LDST ffirst may never set VL equal to zero. This because on the first - element an exception must be raised "as normal". -* CR-based data-dependent ffirst on the other hand **can** set VL equal - to zero. This is the only means in the entirety of SV that VL may be set - to zero (with the exception of via the SV.STATE SPR). When VL is set - zero due to the first element failing the CR bit-test, all subsequent - vectorised operations are effectively `nops` which is - *precisely the desired and intended behaviour*. - # R\*\_EXTRA2 and R\*\_EXTRA3 Encoding EXTRA is the means by which two things are achieved: @@ -626,6 +483,150 @@ Additional unusual capabilities of Twin Predication include a back-to-back version of VCOMPRESS-VEXPAND which is effectively the ability to do an ordered multiple VINSERT. +## Rounding, clamp and saturate + +One of the issues with vector ops is that in integer DSP ops for example +in Audio the operation must clamp or saturate rather than overflow or +ignore the upper bits and become a modulo operation. This for Audio +is extremely important, also to provide an indicator as to whether +saturation occurred. see [[av_opcodes]]. + +To help ensure that audio quality is not compromised by overflow, +"saturation" is provided, as well as a way to detect when saturation +occurred (Rc=1). When Rc=1 there will be a *vector* of CRs, one CR per +element in the result (Note: this is different from VSX which has a +single CR per block). + +When N=0 the result is saturated to within the maximum range of an +unsigned value. For integer ops this will be 0 to 2^elwidth-1. Similar +logic applies to FP operations, with the result being saturated to +maximum rather than returning INF. + +When N=1 the same occurs except that the result is saturated to the min +or max of a signed result. + +When Rc=1, the CR "overflow" bit is set on the CR associated with the +element, to indicate whether saturation occurred. Note that due to +the hugely detrimental effect it has on parallel processing, XER.SO is +**ignored** completely and is **not** brought into play here. The CR +overflow bit is therefore simply set to zero if saturation did not occur, +and to one if it did. + +Post-analysis of the Vector of CRs to find out if any given element hit +saturation may be done using a mapreduced CR op (cror), or by using the +new crweird instruction, transferring the relevant CR bits to a scalar +integer and testing it for nonzero. see [[sv/cr_int_predication]] + + +## Reduce mode + +1. limited to single predicated dual src operations (add RT, RA, RB) and + to triple source operations where one of the inputs is set to a scalar + (these are rare) +2. limited to operations that make sense. divide is excluded, as is + subtract (X - Y - Z produces different answers depending on the order). + sane operations: multiply, add, logical bitwise OR, CR operations. + operations that do not return the same register type are also excluded + (isel, cmp) +3. the destination is a vector but the result is stored, ultimately, + in the first nonzero predicated element. all other nonzero predicated + elements are undefined. *this includes the CR vector* when Rc=1 +4. implementations may use any ordering and any algorithm to reduce + down to a single result. However it must be equivalent to a straight + application of mapreduce. The destination vector (except masked out + elements) may be used for storing any intermediate results. these may + be left in the vector (undefined). +5. CRM applies when Rc=1. When CRM is zero, the CR associated with + the result is regarded as a "some results met standard CR result + criteria". When CRM is one, this changes to "all results met standard + CR criteria". +6. implementations MAY use destoffs as well as srcoffs (see [[sv/sprs]]) + in order to store sufficient state to resume operation should an + interrupt occur. this is also why implementations are permitted to use + the destination vector to store intermediary computations + +TODO: Rc=1 on Scalar Logical Operations? is this possible? was space +reserved in Logical Ops? + +Pseudocode for the case where RA==RB: + + result = op(iregs[RA], iregs[RA+1]) + CR = analyse(result) + for i in range(2, VL): + result = op(result, iregs[RA+i]) + CRnew = analyse(result) + if Rc=1 + if CRM: + CR = CR bitwise or CRnew + else: + CR = CR bitwise AND CRnew + +TODO: case where RA!=RB which involves first a vector of 2-operand +results followed by a mapreduce on the intermediates. + +Note that when SUBVL!=1 the sub-elements are *independent*, i.e. they +are mapreduced per *sub-element* as a result. illustration with a vec2: + + result.x = op(iregs[RA].x, iregs[RA+1].x) + result.y = op(iregs[RA].y, iregs[RA+1].y) + for i in range(2, VL): + result.x = op(result.x, iregs[RA+i].x) + result.y = op(result.y, iregs[RA+i].y) + +When SVM is set and SUBVL!=1, another variant is enabled, which switches +to `RM-2P-2S1D` such that different elwidths may be applied to src +and dest. + + for i in range(VL): + result = op(iregs[RA+i].x, iregs[RA+i].x) + result = op(result, iregs[RA+i].z) + result = op(result, iregs[RA+i].z) + iregs[RT+i] = result + + + +## Fail-on-first + +Data-dependent fail-on-first has two distinct variants: one for LD/ST, +the other for arithmetic operations (actually, CR-driven). Note in each +case the assumption is that vector elements are required appear to be +executed in sequential Program Order, element 0 being the first. + +* LD/ST ffirst treats the first LD/ST in a vector (element 0) as an + ordinary one. Exceptions occur "as normal". However for elements 1 + and above, if an exception would occur, then VL is **truncated** to the + previous element. +* Data-driven (CR-driven) fail-on-first activates when Rc=1 or other + CR-creating operation produces a result (including cmp). Similar to + branch, an analysis of the CR is performed and if the test fails, the + vector operation terminates and discards all element operations at and + above the current one, and VL is truncated to the *previous* element. + Thus the new VL comprises a contiguous vector of results, all of which + pass the testing criteria (equal to zero, less than zero). + +The CR-based data-driven fail-on-first is new and not found in ARM SVE +or RVV. It is extremely useful for reducing instruction count, however +requires speculative execution involving modifications of VL to get high +performance implementations. + +In CR-based data-driven fail-on-first there is only the option to select +and test one bit of each CR (just as with branch BO). For more complex +tests this may be insufficient. If that is the case, a vectorised crops +(crand, cror) may be used, and ffirst applied to the crop instead of to +the arithmetic vector. + +One extremely important aspect of ffirst is: + +* LDST ffirst may never set VL equal to zero. This because on the first + element an exception must be raised "as normal". +* CR-based data-dependent ffirst on the other hand **can** set VL equal + to zero. This is the only means in the entirety of SV that VL may be set + to zero (with the exception of via the SV.STATE SPR). When VL is set + zero due to the first element failing the CR bit-test, all subsequent + vectorised operations are effectively `nops` which is + *precisely the desired and intended behaviour*. + + ## CR Operations CRs are slightly more involved than INT or FP registers due to the -- 2.30.2