something like:
-| 0 1 | 2 3 | 4 5 | 6 | 7 9 | 10 12 | 13 18 | 19 20 |
-| ----- | --- | --- | ---- | ---- | ----- | ----- | ------ |
-| subvl | sew | dew | ptyp | psrc | pdst | vspec | zmode |
+| 0 1 | 2 3 | 4 5 | 6 | 7 9 | 10 12 | 13 18 | 19 23 |
+| ----- | --- | --- | ---- | ---- | ----- | ----- | ----- |
+| subvl | sew | dew | ptyp | psrc | pdst | vspec | mode |
* subvl - 1 to 4 scalar / vec2 / vec3 / vec4
* sew / dew - DEFAULT / 8 / 16 /32 element width
* ptyp - predication INT / CR
* psrc / pdst - predicate mask selector and inversion
* vspec - 3 bit src / dest scalar-vector extension
-* zmode: 2 bit src pred zero mode, dest pred zero mode
-* ffirst: 3 bit. EN and CR index bit.
+* mode: 5 bits
## twin predication, CR based.
With different bits being selectable (CR[0..3]) starting from the same CR makes some sense.
-
# standard arith ops (single predication)
these are of the form res = op(src1, src2, ...)
* vspec - 2/3 bit src / dest scalar-vector extension
* mode - 5 bit
-Mode
-
- 0 1 2 3 4 description
- ------------------
- 0 0 0 0 0 nothing
- 0 1 N zero sat mode: N=0/1 u/s
- 1 0 inv CR bit Rc=1: ffirst CR sel
- 1 0 zero Rc=0: pred zero mode
-
-
For 2 op (dest/src1/src2) the tag may be 3 bits: total 9 bits. for 3 op (dest/src1/2/3) the vspec may be 2 bits per reg: total 8 bits.
Note:
255 * 2 = 510 # if we used the smaller width, we'd get 254. Wrong
u16 + u16 = u8
- 256 + 2 = 2 # this is correct whether we use the larger or smaller width - hw can optimize narrowing addition
+ 256 + 2 = 2 # this is correct whether we use the larger or smaller width
+ # aka hw can optimize narrowing addition
+
+# Mode
+
+ 0 1 2 3 4 description
+ ------------------
+ 0 0 0 0 0 nothing
+ 0 1 N zero sat mode: N=0/1 u/s
+ 1 0 inv CR bit Rc=1: ffirst CR sel
+ 1 0 zero Rc=0: pred zero mode
+
# Notes about rounding, clamp and saturate
If there are spare bits it would be very good to look at using some of them to specify the mode, because otherwise a SPR has to be used which will need to be set and unset. This can get costly.
+# Fail-on-first
+
+Data-dependent fail-on-first has two distinct variants: one for LD/ST, the other for arithmetic operations (actually, CR-driven)
+
+* LD/ST ffirst treats the first LD/ST in a vector as an ordinary one. Exceptions occur "as normal". However for elements 1 and above, if an exception would occur, then VL is **truncated** to the previous element.
+* Data-driven (CR-driven) fail-on-first activates when Rc=1 or other CR-creating operation produces a result (including cmp). Similar to branch, an analysis of the CR is performed and if the test succeeds, the vector operation terminates all element operations at and above the current one, and VL is truncated to the *previous* element.
+
+The CR-based data-driven fail-on-first is new and not found in ARM SVE or RVV. It is extremely useful for reducing instruction count, however requires speculative execution involving modifications of VL to get high performance implementations.
+
+
# Notes about Swizzle
Basically, there isn't enough room to try to fit two src src1/2 swizzle, and SV, even into 64 bit (actually 24) without severely compromising on the number of bits allocated to either swizzle, or SV, or both.