required for the [[sv/av_opcodes]]
-signed and unsigned min/max for integer. this is sort-of partly
-synthesiseable in [[sv/svp64]] with pred-result as long as the dest reg
-is one of the sources, but not both signed and unsigned. when the dest
-is also one of the srces and the mv fails due to the CR bittest failing
-this will only overwrite the dest where the src is greater (or less).
+signed and unsigned min/max for integer.
signed/unsigned min/max gives more flexibility.
modes make sense:
* saturation
-* predicate-result would be useful but is lower priority than Data-Dependent Fail-First
* simple (no augmentation)
* fail-first (where Vector Indexed is banned)
* Signed Effective Address computation (Vector Indexed only)
* Compacted operations into registers (normally only provided by SIMD)
* Fail-on-first (introduced in ARM SVE2)
* A new concept: Data-dependent fail-first
-* Condition-Register based *post-result* predication (also new)
* A completely new concept: "Twin Predication"
* vec2/3/4 "Subvectors" and Swizzling (standard fare for 3D)
If it did it would be called [[sv/mv.x]]. Once Vectorised, it's a
VGATHER/VSCATTER.
-# CR predicate result analysis
-
-Power ISA has Condition Registers. These store an analysis of the result
-of an operation to test it for being greater, less than or equal to zero.
-What if a test could be done, similar to branch BO testing, which hooked
-into the predication system?
-
- for i in range(VL):
- # predication test, skip all masked out elements.
- if predicate_masked_out(i): continue # skip
- result = op(iregs[RA+i], iregs[RB+i])
- CRnew = analyse(result) # calculates eq/lt/gt
- # Rc=1 always stores the CR
- if RC1 or Rc=1: crregs[offs+i] = CRnew
- if RC1: continue # RC1 mode skips result store
- # now test CR, similar to branch
- if CRnew[BO[0:1]] == BO[2]:
- # result optionally stored but CR always is
- iregs[RT+i] = result
-
-Note that whilst the Vector of CRs is always written to the CR regfile,
-only those result elements that pass the BO test get written to the
-integer regfile (when RC1 mode is not set). In RC1 mode the CR is always
-stored, but the result never is. This effectively turns every arithmetic
-operation into a type of `cmp` instruction.
-
-Here for example if FP overflow occurred, and the CR testing was carried
-out for that, all valid results would be stored but invalid ones would
-not, but in addition the Vector of CRs would contain the indicators of
-which ones failed. With the invalid results being simply not written
-this could save resources (save on register file writes).
-
-Also expected is, due to the fact that the predicate mask is effectively
-ANDed with the post-result analysis as a secondary type of predication,
-that there would be savings to be had in some types of operations where
-the post-result analysis, if not included in SV, would need a second
-predicate calculation followed by a predicate mask AND operation.
-
-Note, hilariously, that Vectorised Condition Register Operations (crand,
-cror) may also have post-result analysis applied to them. With Vectors
-of CRs being utilised *for* predication, possibilities for compact and
-elegant code begin to emerge from this innocuous-looking addition to SV.
-
# Exception-based Fail-on-first
One of the major issues with Vectorised LD/ST operations is when a
# Data-dependent fail-first
-This is a minor variant on the CR-based predicate-result mode. Where
-pred-result continues with independent element testing (any of which may
-be parallelised), data-dependent fail-first *stops* at the first failure:
+Data-dependent fail-first *stops* at the first failure:
if Rc=0: BO = inv<<2 | 0b00 # test CR.eq bit z/nz
for i in range(VL):
the actual calculation.
The only minor downside here though is the change to VL, which in some
-implementations may cause pipeline stalls. This was one of the reasons
-why CR-based pred-result analysis was added, because that at least is
-entirely paralleliseable.
+implementations may cause pipeline stalls.
# Vertical-First Mode
## Scalar (single) integer as predicate, with one DM row
-This idea has merit in that to perform predicate bitmanip operations the preficate is already in scalar INT reg form and consequently standard scalar INT bitmanip operations can be done straight away. Vectorised mfcr can be used to get CMP results or Vectorised Rc=1 CRs into the scalar INT, easily.
+This idea has merit in that to perform predicate bitmanip operations the predicate is already in scalar INT reg form and consequently standard scalar INT bitmanip operations can be done straight away. Vectorised mfcr can be used to get CMP results or Vectorised Rc=1 CRs into the scalar INT, easily.
This idea has several disadvantages.
truncates VL to that exact point. Useful for implementing algorithms
such as `strcpy` in around 14 high-performance Vector instructions, the
option exists to include or exclude the failing element.
-* **Predicate-result**: a strategic mode that effectively turns all and any
- operations into a type of `cmp`. An `Rc=1 BO test` is performed and if
- failing that element result is **not** written to the regfile. The `Rc=1`
- Vector of co-results **is** always written (subject to usual predication).
- Termed "predicate-result" because the combination of producing then
- testing a result is as if the test was in a follow-up predicated
- copy/mv operation, it reduces regfile pressure and instruction count.
- Also useful on saturated or other overflowing operations, the overflowing
- elements may be excluded from outputting to the regfile then
- post-analysed outside of critical hot-loops.
**RM Modes**
* Predication on both source and destination
* Two different sources of predication: INT and CR Fields
* SV Modes including saturation (for Audio, Video and DSP), mapreduce,
- fail-first and predicate-result mode.
+ and fail-first mode.
Different classes of operations require different formats. The earlier
sections cover the common formats and the four separate modes follow:
More details can be found in [[sv/cr_ops]].
-## pred-result mode
-
-Pred-result mode may not be applied on CR-based operations.
-
-Although CR operations (mtcr, crand, cror) may be Vectorised, predicated,
-pred-result mode applies to operations that have an Rc=1 mode, or make
-sense to add an RC1 option.
-
-Predicate-result merges common CR testing with predication, saving
-on instruction count. In essence, a Condition Register Field test is
-performed, and if it fails it is considered to have been *as if* the
-destination predicate bit was zero. Given that there are no CR-based
-operations that produce Rc=1 co-results, there can be no pred-result
-mode for mtcr and other CR-based instructions
-
-Arithmetic and Logical Pred-result, which does have Rc=1 or for which
-RC1 Mode makes sense, is covered in [[sv/normal]]
-
## CR Operations
CRs are slightly more involved than INT or FP registers due to the
* ew={N}: ew=8/16/32 - sets elwidth override
* sw={N}: sw=8/16/32 - sets source elwidth override
* ff={xx}: see fail-first mode
-* pr={xx}: see predicate-result mode
* sat{x}: satu / sats - see saturation mode
* mr: see map-reduce mode
* mrr: map-reduce, reverse-gear (VL-1 downto 0)
For modes:
-* pred-result:
- - pm=lt/gt/le/ge/eq/ne/so/ns
- - RC1 mode
* fail-first
- ff=lt/gt/le/ge/eq/ne/so/ns
- RC1 mode
**Arithmetic**
Arithmetic (known as "normal" mode) is where Scalar and Parallel
-Reduction can be done: Saturation as well, and two new innovative
-modes for Vector ISAs: data-dependent fail-first and predicate result.
+Reduction can be done: Saturation as well, and a new innovative
+modes for Vector ISAs: data-dependent fail-first.
Reduction and Saturation are common to see in Vector ISAs: it is just
that they are usually added as explicit instructions,
and NEC SX Aurora has even more iterative instructions. In SVP64 these
override field bits can be used for other purposes when Vectorising
CR Field instructions. Moreover, Rc=1 is completely invalid for
CR operations such as `crand`: Rc=1 is for arithmetic operations, producing
-a "co-result" that goes into CR0 or CR1. Thus, the Arithmetic modes
-such as predicate-result make no sense, and neither does Saturation.
+a "co-result" that goes into CR0 or CR1. Thus, Saturation makes no sense.
All of these differences, which require quite a lot of logical
reasoning and deduction, help explain why there is an entirely different
CR ops Vectorisation Category.