XER.CA/CA32 on the other hand is expected and required to be implemented
according to standard Power ISA Scalar behaviour. Interestingly, due
to SVP64 being in effect a hardware for-loop around Scalar instructions
-executing in precise Program Order, a little thought shows that a Vectorised
+executing in precise Program Order, a little thought shows that a Vectorized
Carry-In-Out add is in effect a Big Integer Add, taking a single bit Carry In
and producing, at the end, a single bit Carry out. High performance
implementations may exploit this observation to deploy efficient
| 0 | 0 | both mask[src=0] and mask[dst=0] are 1 |
| 1 | 2 | sz=1 but dz=0: dst skips mask[1], src soes not |
| 2 | 3 | mask[src=2] and mask[dst=3] are 1 |
-| end | end | loop has ended because dst reached VL-1 |
+| 3 | end | loop has ended because dst reached VL-1 |
Example 2:
| 0 | 0 | both mask[src=0] and mask[dst=0] are 1 |
| 2 | 1 | sz=0 but dz=1: src skips mask[1], dst does not |
| 3 | 2 | mask[src=3] and mask[dst=2] are 1 |
-| end | end | loop has ended because src reached VL-1 |
+| end | 3 | loop has ended because src reached VL-1 |
In both these examples it is crucial to note that despite there being
a single predicate mask, with sz and dz being different, srcstep and
| 3 | 3 | mask[src=3] and mask[dst=3] are 1 |
| end | end | loop has ended because src and dst reached VL-1 |
-Here, both srcstep and dststep remain in lockstep because sz=dz=1
+Here, both srcstep and dststep remain in lockstep because sz=dz=0
## Twin Predication <a name="2p"> </a>
set a different subvector length for destination, and has a slightly
different pseudocode algorithm for Vertical-First Mode.
+Ordering is as follows:
+
+* SVSHAPE srcstep, dststep, ssubstep and dsubstep are advanced sequentially
+ depending on PACK/UNPACK.
+* srcstep and dststep are pushed through REMAP to compute actual Element offsets.
+* Swizzle is independently applied to ssubstep and dsubstep
+
Pack/Unpack is enabled (set up) through [[sv/svstep]].
## Reduce modes
as a simple and natural relaxation of the usual restriction on the Vector
Looping which would terminate if the destination was marked as a Scalar.
Scalar Reduction by contrast *keeps issuing Vector Element Operations*
-even though the destination register is marked as scalar. Thus it is
+even though the destination register is marked as scalar *and*
+the same register is used as a source register. Thus it is
up to the programmer to be aware of this, observe some conventions,
and thus end up achieving the desired outcome of scalar reduction.
* One of the sources is a Vector
* the destination is a scalar
* optionally but most usefully when one source scalar register is
- also the scalar destination (which may be informally termed
- the "accumulator")
+ also the scalar destination (which may be informally termed by
+ convention the "accumulator")
* That the source register type is the same as the destination register
type identified as the "accumulator". Scalar reduction on `cmp`,
`setb` or `isel` makes no sense for example because of the mixture
* LD/ST ffirst (not to be confused with *Data-Dependent* LD/ST ffirst)
treats the first LD/ST in a vector (element 0) as an ordinary one.
- Exceptions occur "as normal". However for elements 1 and above, if an
- exception would occur, then VL is **truncated** to the previous element.
+ Exceptions occur "as normal" on the first element. However for elements
+ 1 and above, if an exception would occur, then VL is **truncated**
+ to the previous element.
* Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
CR-creating operation produces a result (including cmp). Similar to
branch, an analysis of the CR is performed and if the test fails,
depending on whether VLi (VL "inclusive") is set.
Thus the new VL comprises a contiguous vector of results, all of which
-pass the testing criteria (equal to zero, less than zero).
+pass the testing criteria (equal to zero, less than zero). Demonstrated
+approximately in pseudocode:
+
+```
+for i in range(VL):
+ GPR[RT+i], CR[i] = operation(GPR[RA+i]... )
+ if test(CR[i]) == failure:
+ VL = i+VLi
+ break
+```
The CR-based data-driven fail-on-first is new and not
found in ARM SVE or RVV. At the same time it is also
In CR-based data-driven fail-on-first there is only the option to select
and test one bit of each CR (just as with branch BO). For more complex
-tests this may be insufficient. If that is the case, a vectorised crops
+tests this may be insufficient. If that is the case, a vectorized crops
(crand, cror) may be used, and ffirst applied to the crop instead of to
the arithmetic vector.
to zero. This is the only means in the entirety of SV that VL may be set
to zero (with the exception of via the SV.STATE SPR). When VL is set
zero due to the first element failing the CR bit-test, all subsequent
- vectorised operations are effectively `nops` which is
+ vectorized operations are effectively `nops` which is
*precisely the desired and intended behaviour*.
Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
More details can be found in [[sv/cr_ops]].
-## pred-result mode
-
-Pred-result mode may not be applied on CR-based operations.
-
-Although CR operations (mtcr, crand, cror) may be Vectorised, predicated,
-pred-result mode applies to operations that have an Rc=1 mode, or make
-sense to add an RC1 option.
-
-Predicate-result merges common CR testing with predication, saving
-on instruction count. In essence, a Condition Register Field test is
-performed, and if it fails it is considered to have been *as if* the
-destination predicate bit was zero. Given that there are no CR-based
-operations that produce Rc=1 co-results, there can be no pred-result
-mode for mtcr and other CR-based instructions
-
-Arithmetic and Logical Pred-result, which does have Rc=1 or for which
-RC1 Mode makes sense, is covered in [[sv/normal]]
-
## CR Operations
CRs are slightly more involved than INT or FP registers due to the
```
if extra3_mode:
spec = EXTRA3
- else:
- spec = EXTRA2<<1 | 0b0
+ elif EXTRA2[0]: # vector mode
+ spec = EXTRA2 << 1 # same as EXTRA3, shifted
+ else: # scalar mode
+ spec = (EXTRA2[0] << 2) | EXTRA2[1]
if spec[0]:
# vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
return ((BA >> 2)<<6) | # hi 3 bits shifted up
### CR fields as inputs/outputs of vector operations
CRs (or, the arithmetic operations associated with them)
-may be marked as Vectorised or Scalar. When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised. Likewise if the destination is scalar then so is the CR.
+may be marked as Vectorized or Scalar. When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorized if the destination is Vectorized. Likewise if the destination is scalar then so is the CR.
When vectorized, the CR inputs/outputs are sequentially read/written
-to 4-bit CR fields. Vectorised Integer results, when Rc=1, will begin
+to 4-bit CR fields. Vectorized Integer results, when Rc=1, will begin
writing to CR8 (TBD evaluate) and increase sequentially from there.
This is so that:
CR when Rc=1 is written to. This is CR0 for integer operations and CR1
for FP operations.
-Note that yes, the CR Fields are genuinely Vectorised. Unlike in SIMD VSX which
-has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
+Note that yes, the CR Fields are genuinely Vectorized. Unlike in SIMD VSX which
+has a single CR (CR6) for a given SIMD result, SV Vectorized OpenPOWER
v3.0B scalar operations produce a **tuple** of element results: the
result of the operation as one part of that element *and a corresponding
CR element*. Greatly simplified pseudocode:
the Vector of CRs, using cr ops (crand, crnor) to do so. This provides far
more flexibility in analysing vectors than standard Vector ISAs. Normal
Vector ISAs are typically restricted to "were all results nonzero" and
-"were some results nonzero". The application of mapreduce to Vectorised
+"were some results nonzero". The application of mapreduce to Vectorized
cr operations allows far more sophisticated analysis, particularly in
conjunction with the new crweird operations see [[sv/cr_int_predication]].
so FP instructions with Rc=1 write to CR1 (n=1).
CRs are not stored in SPRs: they are registers in their own right.
-Therefore context-switching the full set of CRs involves a Vectorised
+Therefore context-switching the full set of CRs involves a Vectorized
mfcr or mtcr, using VL=8 to do so. This is exactly as how
scalar OpenPOWER context-switches CRs: it is just that there are now
more of them.
* ew={N}: ew=8/16/32 - sets elwidth override
* sw={N}: sw=8/16/32 - sets source elwidth override
* ff={xx}: see fail-first mode
-* pr={xx}: see predicate-result mode
* sat{x}: satu / sats - see saturation mode
* mr: see map-reduce mode
* mrr: map-reduce, reverse-gear (VL-1 downto 0)
For modes:
-* pred-result:
- - pm=lt/gt/le/ge/eq/ne/so/ns
- - RC1 mode
* fail-first
- ff=lt/gt/le/ge/eq/ne/so/ns
- RC1 mode
multiply into RT and RT+1.
What, then, of `sv.maddedu`? If the destination is hard-coded to RT and
-RT+1 the instruction is not useful when Vectorised because the output
+RT+1 the instruction is not useful when Vectorized because the output
will be overwritten on the next element. To solve this is easy: define
the destination registers as RT and RT+MAXVL respectively. This makes
it easy for compilers to statically allocate registers even when VL