* **SVM** sets "subvector" reduce mode
* **N** sets signed/unsigned saturation.
-## Rounding, clamp and saturate
-
-One of the issues with vector ops is that in integer DSP ops for example
-in Audio the operation must clamp or saturate rather than overflow or
-ignore the upper bits and become a modulo operation. This for Audio
-is extremely important, also to provide an indicator as to whether
-saturation occurred. see [[av_opcodes]].
-
-To help ensure that audio quality is not compromised by overflow,
-"saturation" is provided, as well as a way to detect when saturation
-occurred (Rc=1). When Rc=1 there will be a *vector* of CRs, one CR per
-element in the result (Note: this is different from VSX which has a
-single CR per block).
-
-When N=0 the result is saturated to within the maximum range of an
-unsigned value. For integer ops this will be 0 to 2^elwidth-1. Similar
-logic applies to FP operations, with the result being saturated to
-maximum rather than returning INF.
-
-When N=1 the same occurs except that the result is saturated to the min
-or max of a signed result.
-
-When Rc=1, the CR "overflow" bit is set on the CR associated with the
-element, to indicate whether saturation occurred. Note that due to
-the hugely detrimental effect it has on parallel processing, XER.SO is
-**ignored** completely and is **not** brought into play here. The CR
-overflow bit is therefore simply set to zero if saturation did not occur,
-and to one if it did.
-
-Post-analysis of the Vector of CRs to find out if any given element hit
-saturation may be done using a mapreduced CR op (cror), or by using the
-new crweird instruction, transferring the relevant CR bits to a scalar
-integer and testing it for nonzero. see [[sv/cr_int_predication]]
-
-
-## Reduce mode
-
-1. limited to single predicated dual src operations (add RT, RA, RB) and
- to triple source operations where one of the inputs is set to a scalar
- (these are rare)
-2. limited to operations that make sense. divide is excluded, as is
- subtract (X - Y - Z produces different answers depending on the order).
- sane operations: multiply, add, logical bitwise OR, CR operations.
- operations that do not return the same register type are also excluded
- (isel, cmp)
-3. the destination is a vector but the result is stored, ultimately,
- in the first nonzero predicated element. all other nonzero predicated
- elements are undefined. *this includes the CR vector* when Rc=1
-4. implementations may use any ordering and any algorithm to reduce
- down to a single result. However it must be equivalent to a straight
- application of mapreduce. The destination vector (except masked out
- elements) may be used for storing any intermediate results. these may
- be left in the vector (undefined).
-5. CRM applies when Rc=1. When CRM is zero, the CR associated with
- the result is regarded as a "some results met standard CR result
- criteria". When CRM is one, this changes to "all results met standard
- CR criteria".
-6. implementations MAY use destoffs as well as srcoffs (see [[sv/sprs]])
- in order to store sufficient state to resume operation should an
- interrupt occur. this is also why implementations are permitted to use
- the destination vector to store intermediary computations
-
-TODO: Rc=1 on Scalar Logical Operations? is this possible? was space
-reserved in Logical Ops?
-
-Pseudocode for the case where RA==RB:
-
- result = op(iregs[RA], iregs[RA+1])
- CR = analyse(result)
- for i in range(2, VL):
- result = op(result, iregs[RA+i])
- CRnew = analyse(result)
- if Rc=1
- if CRM:
- CR = CR bitwise or CRnew
- else:
- CR = CR bitwise AND CRnew
-
-TODO: case where RA!=RB which involves first a vector of 2-operand
-results followed by a mapreduce on the intermediates.
-
-Note that when SUBVL!=1 the sub-elements are *independent*, i.e. they
-are mapreduced per *sub-element* as a result. illustration with a vec2:
-
- result.x = op(iregs[RA].x, iregs[RA+1].x)
- result.y = op(iregs[RA].y, iregs[RA+1].y)
- for i in range(2, VL):
- result.x = op(result.x, iregs[RA+i].x)
- result.y = op(result.y, iregs[RA+i].y)
-
-When SVM is set and SUBVL!=1, another variant is enabled, which switches
-to `RM-2P-2S1D` such that different elwidths may be applied to src
-and dest.
-
- for i in range(VL):
- result = op(iregs[RA+i].x, iregs[RA+i].x)
- result = op(result, iregs[RA+i].z)
- result = op(result, iregs[RA+i].z)
- iregs[RT+i] = result
-
-
-
-## Fail-on-first
-
-Data-dependent fail-on-first has two distinct variants: one for LD/ST,
-the other for arithmetic operations (actually, CR-driven). Note in each
-case the assumption is that vector elements are required appear to be
-executed in sequential Program Order, element 0 being the first.
-
-* LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
- ordinary one. Exceptions occur "as normal". However for elements 1
- and above, if an exception would occur, then VL is **truncated** to the
- previous element.
-* Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
- CR-creating operation produces a result (including cmp). Similar to
- branch, an analysis of the CR is performed and if the test fails, the
- vector operation terminates and discards all element operations at and
- above the current one, and VL is truncated to the *previous* element.
- Thus the new VL comprises a contiguous vector of results, all of which
- pass the testing criteria (equal to zero, less than zero).
-
-The CR-based data-driven fail-on-first is new and not found in ARM SVE
-or RVV. It is extremely useful for reducing instruction count, however
-requires speculative execution involving modifications of VL to get high
-performance implementations.
-
-In CR-based data-driven fail-on-first there is only the option to select
-and test one bit of each CR (just as with branch BO). For more complex
-tests this may be insufficient. If that is the case, a vectorised crops
-(crand, cror) may be used, and ffirst applied to the crop instead of to
-the arithmetic vector.
-
-One extremely important aspect of ffirst is:
-
-* LDST ffirst may never set VL equal to zero. This because on the first
- element an exception must be raised "as normal".
-* CR-based data-dependent ffirst on the other hand **can** set VL equal
- to zero. This is the only means in the entirety of SV that VL may be set
- to zero (with the exception of via the SV.STATE SPR). When VL is set
- zero due to the first element failing the CR bit-test, all subsequent
- vectorised operations are effectively `nops` which is
- *precisely the desired and intended behaviour*.
-
# R\*\_EXTRA2 and R\*\_EXTRA3 Encoding
EXTRA is the means by which two things are achieved:
version of VCOMPRESS-VEXPAND which is effectively the ability to do an
ordered multiple VINSERT.
+## Rounding, clamp and saturate
+
+One of the issues with vector ops is that in integer DSP ops for example
+in Audio the operation must clamp or saturate rather than overflow or
+ignore the upper bits and become a modulo operation. This for Audio
+is extremely important, also to provide an indicator as to whether
+saturation occurred. see [[av_opcodes]].
+
+To help ensure that audio quality is not compromised by overflow,
+"saturation" is provided, as well as a way to detect when saturation
+occurred (Rc=1). When Rc=1 there will be a *vector* of CRs, one CR per
+element in the result (Note: this is different from VSX which has a
+single CR per block).
+
+When N=0 the result is saturated to within the maximum range of an
+unsigned value. For integer ops this will be 0 to 2^elwidth-1. Similar
+logic applies to FP operations, with the result being saturated to
+maximum rather than returning INF.
+
+When N=1 the same occurs except that the result is saturated to the min
+or max of a signed result.
+
+When Rc=1, the CR "overflow" bit is set on the CR associated with the
+element, to indicate whether saturation occurred. Note that due to
+the hugely detrimental effect it has on parallel processing, XER.SO is
+**ignored** completely and is **not** brought into play here. The CR
+overflow bit is therefore simply set to zero if saturation did not occur,
+and to one if it did.
+
+Post-analysis of the Vector of CRs to find out if any given element hit
+saturation may be done using a mapreduced CR op (cror), or by using the
+new crweird instruction, transferring the relevant CR bits to a scalar
+integer and testing it for nonzero. see [[sv/cr_int_predication]]
+
+
+## Reduce mode
+
+1. limited to single predicated dual src operations (add RT, RA, RB) and
+ to triple source operations where one of the inputs is set to a scalar
+ (these are rare)
+2. limited to operations that make sense. divide is excluded, as is
+ subtract (X - Y - Z produces different answers depending on the order).
+ sane operations: multiply, add, logical bitwise OR, CR operations.
+ operations that do not return the same register type are also excluded
+ (isel, cmp)
+3. the destination is a vector but the result is stored, ultimately,
+ in the first nonzero predicated element. all other nonzero predicated
+ elements are undefined. *this includes the CR vector* when Rc=1
+4. implementations may use any ordering and any algorithm to reduce
+ down to a single result. However it must be equivalent to a straight
+ application of mapreduce. The destination vector (except masked out
+ elements) may be used for storing any intermediate results. these may
+ be left in the vector (undefined).
+5. CRM applies when Rc=1. When CRM is zero, the CR associated with
+ the result is regarded as a "some results met standard CR result
+ criteria". When CRM is one, this changes to "all results met standard
+ CR criteria".
+6. implementations MAY use destoffs as well as srcoffs (see [[sv/sprs]])
+ in order to store sufficient state to resume operation should an
+ interrupt occur. this is also why implementations are permitted to use
+ the destination vector to store intermediary computations
+
+TODO: Rc=1 on Scalar Logical Operations? is this possible? was space
+reserved in Logical Ops?
+
+Pseudocode for the case where RA==RB:
+
+ result = op(iregs[RA], iregs[RA+1])
+ CR = analyse(result)
+ for i in range(2, VL):
+ result = op(result, iregs[RA+i])
+ CRnew = analyse(result)
+ if Rc=1
+ if CRM:
+ CR = CR bitwise or CRnew
+ else:
+ CR = CR bitwise AND CRnew
+
+TODO: case where RA!=RB which involves first a vector of 2-operand
+results followed by a mapreduce on the intermediates.
+
+Note that when SUBVL!=1 the sub-elements are *independent*, i.e. they
+are mapreduced per *sub-element* as a result. illustration with a vec2:
+
+ result.x = op(iregs[RA].x, iregs[RA+1].x)
+ result.y = op(iregs[RA].y, iregs[RA+1].y)
+ for i in range(2, VL):
+ result.x = op(result.x, iregs[RA+i].x)
+ result.y = op(result.y, iregs[RA+i].y)
+
+When SVM is set and SUBVL!=1, another variant is enabled, which switches
+to `RM-2P-2S1D` such that different elwidths may be applied to src
+and dest.
+
+ for i in range(VL):
+ result = op(iregs[RA+i].x, iregs[RA+i].x)
+ result = op(result, iregs[RA+i].z)
+ result = op(result, iregs[RA+i].z)
+ iregs[RT+i] = result
+
+
+
+## Fail-on-first
+
+Data-dependent fail-on-first has two distinct variants: one for LD/ST,
+the other for arithmetic operations (actually, CR-driven). Note in each
+case the assumption is that vector elements are required appear to be
+executed in sequential Program Order, element 0 being the first.
+
+* LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
+ ordinary one. Exceptions occur "as normal". However for elements 1
+ and above, if an exception would occur, then VL is **truncated** to the
+ previous element.
+* Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
+ CR-creating operation produces a result (including cmp). Similar to
+ branch, an analysis of the CR is performed and if the test fails, the
+ vector operation terminates and discards all element operations at and
+ above the current one, and VL is truncated to the *previous* element.
+ Thus the new VL comprises a contiguous vector of results, all of which
+ pass the testing criteria (equal to zero, less than zero).
+
+The CR-based data-driven fail-on-first is new and not found in ARM SVE
+or RVV. It is extremely useful for reducing instruction count, however
+requires speculative execution involving modifications of VL to get high
+performance implementations.
+
+In CR-based data-driven fail-on-first there is only the option to select
+and test one bit of each CR (just as with branch BO). For more complex
+tests this may be insufficient. If that is the case, a vectorised crops
+(crand, cror) may be used, and ffirst applied to the crop instead of to
+the arithmetic vector.
+
+One extremely important aspect of ffirst is:
+
+* LDST ffirst may never set VL equal to zero. This because on the first
+ element an exception must be raised "as normal".
+* CR-based data-dependent ffirst on the other hand **can** set VL equal
+ to zero. This is the only means in the entirety of SV that VL may be set
+ to zero (with the exception of via the SV.STATE SPR). When VL is set
+ zero due to the first element failing the CR bit-test, all subsequent
+ vectorised operations are effectively `nops` which is
+ *precisely the desired and intended behaviour*.
+
+
## CR Operations
CRs are slightly more involved than INT or FP registers due to the