(no commit message)
[libreriscv.git] / openpower / sv / svp_rewrite / svp64 / discussion.mdwn
1 # Links
2
3 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001498.html>>
4
5 # Notes on requirements for bit allocations
6
7 do not try to jam VL or MAXVL in. go with the flow of 24 bits spare.
8
9 * 2: SUBVL
10 * 2: elwidth
11 * 2: twin-predication (src, dest) elwidth
12 * 1: select INT or CR predication
13 * 3: predicate selection and inversion (QTY 2 for tpred)
14 * 4x2 or 3x3: src1/2/3/dest Vector/Scalar reg
15 * 5: mode
16
17 totals: 24 bits (dest elwidth shared)
18
19 http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001434.html
20
21 ## twin predication
22
23 twin predication and twin elwidth overrides is extremely important to have to be able to override both the src and dest elwidth yet keep the underlying scalar operation intact. examples include mr with an elwidth=8, VL=8 on the src will take a byte at a time from one 64 bit reg and place it into 8x 64-bit regs, zero-extended. more complex operations involve SUBVL and Audio/Video DSP operations, see [[av_opcodes]]
24
25 something like:
26
27 | 0 1 | 2 3 | 4 5 | 6 | 7 9 | 10 12 | 13 18 | 19 23 |
28 | ----- | --- | --- | ---- | ---- | ----- | ----- | ----- |
29 | subvl | sew | dew | ptyp | psrc | pdst | vspec | mode |
30
31 * subvl - 1 to 4 scalar / vec2 / vec3 / vec4
32 * sew / dew - DEFAULT / 8 / 16 /32 element width
33 * ptyp - predication INT / CR
34 * psrc / pdst - predicate mask selector and inversion
35 * vspec - 3 bit src / dest scalar-vector extension
36 * mode: 5 bits
37
38 ## twin predication, CR based.
39
40 separate src and dest predicates are a critical part of SV for provision of VEXPAND, VREDUCE, VSPLAT, VINSERT and many more operations.
41
42 Twin CR predication could be done in two ways:
43
44 * start from different CRs for the src and dest
45 * start from the same CR.
46
47 With different bits being selectable (CR[0..3]) starting from the same CR makes some sense.
48
49 # standard arith ops (single predication)
50
51 these are of the form res = op(src1, src2, ...)
52
53 | 0 1 | 2 3 | 4 5 | 6 | 7 9 | 10 18 | 19 23 |
54 | ----- | --- | --- | ---- | ---- | ----- | ------ |
55 | subvl | sew | dew | ptyp | pred | vspec | mode |
56
57 * subvl - 1 to 4 scalar / vec2 / vec3 / vec4
58 * sew / dew - DEFAULT / 8 / 16 /32 element width
59 * ptyp - predication INT / CR
60 * pred - predicate mask selector and inversion
61 * vspec - 2/3 bit src / dest scalar-vector extension
62 * mode - 5 bit
63
64 For 2 op (dest/src1/src2) the tag may be 3 bits: total 9 bits. for 3 op (dest/src1/2/3) the vspec may be 2 bits per reg: total 8 bits.
65
66 Note:
67
68 * the operation should always be done at max(srcwidth, dstwidth), unless it can
69 be proven using the lower will lead to the same result
70 * saturation is done on the result at the **dest** elwidth
71
72 Some examples on different operation widths:
73
74 u16 / u16 = u8
75 256 / 2 = 128 # if we used the smaller width, we'd get 0. Wrong
76
77 u8 * u8 = u16
78 255 * 2 = 510 # if we used the smaller width, we'd get 254. Wrong
79
80 u16 + u16 = u8
81 256 + 2 = 2 # this is correct whether we use the larger or smaller width
82 # aka hw can optimize narrowing addition
83
84 # Mode
85
86 0 1 2 3 4 description
87 ------------------
88 0 0 0 0 0 nothing
89 0 0 1 sz dz pred zeroing
90 0 1 inv CR-bit Rc=1: ffirst CR sel
91 0 1 inv sz dz Rc=0: ffirst z/nonz
92 1 0 N sz dz sat mode: N=0/1 u/s
93 1 1 inv CR-bit Rc=1: pred-result CR sel
94 1 1 inv sz dz Rc=0: pred-result z/nonz
95
96 Mode types:
97
98 * **predicate zeroing** (sz, dz) if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero.
99 * **ffirst* or data-dependent fail-on-first: see separate section.
100 * **sat mode** or saturation: clamps the result to a min/max rather than overflows / wraps. allows signed and unsigned clamping.
101 * **pred-result** will test the result (CR testing selects a bit of CR and inverts it, just like branch testing) and if the test fails it is as if the predicate bit was zero. When Rc=1 the CR element (CR0) however is still stored in the CR regfile. This scheme does not apply to crops (crand, cror).
102
103 # Notes about rounding, clamp and saturate
104
105 One of the issues with vector ops is that in integer DSP ops for example in Audio the operation must clamp or saturate rather than overflow or ignore the upper bits and become a modulo operation. This for Audio is extremely important, also to provide an indicator as to whether saturation occurred. see [[av_opcodes]].
106
107 If there are spare bits it would be very good to look at using some of them to specify the mode, because otherwise a SPR has to be used which will need to be set and unset. This can get costly.
108
109 # Fail-on-first
110
111 Data-dependent fail-on-first has two distinct variants: one for LD/ST, the other for arithmetic operations (actually, CR-driven). Note in each case the assumption is that vector elements are required appear to be executed in sequential Program Order, element 0 being the first.
112
113 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an ordinary one. Exceptions occur "as normal". However for elements 1 and above, if an exception would occur, then VL is **truncated** to the previous element.
114 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other CR-creating operation produces a result (including cmp). Similar to branch, an analysis of the CR is performed and if the test fails, the vector operation terminates and discards all element operations at and above the current one, and VL is truncated to the *previous* element. Thus the new VL comprises a vector of results that pass certain criteria (equal to zero, less than zero).
115
116 The CR-based data-driven fail-on-first is new and not found in ARM SVE or RVV. It is extremely useful for reducing instruction count, however requires speculative execution involving modifications of VL to get high performance implementations.
117
118 Where the options provided by selecting from only one bit of the CR being tested (and optional inversion of the same) are insufficient, a vectorised crops (crand, cror) may be used and ffirst applied to that.
119
120 # Notes about Swizzle
121
122 Basically, there isn't enough room to try to fit two src src1/2 swizzle, and SV, even into 64 bit (actually 24) without severely compromising on the number of bits allocated to either swizzle, or SV, or both.
123
124 therefore the strategy proposed is:
125
126 * design 16bit scalar ops
127 * use the 11 bit old SV prefix to create 32bit insns
128 * when those are embedded into v3.1B 64 prefix, the 24 bits are entirely allocated to swizzle.
129
130 with 2x12 this would mean no need to have complex encoding of swizzle.
131
132 if we really do need 2 bits spare then the complex encoder of swizzle could be deployed. (*an analysis shows this to be very unlikely. 7^4 is around 2400 which still requires 12 bits to encode*)
133
134 # note about INT predicate
135
136 001 ALWAYS (implicit) Operation is not masked
137
138 this means by default that 001 will always be in nonpredicated ops, which seems anomalous. would 000 be better to indicate "no predication"?
139
140 000 would indicate "the predicate is an immediate of all 1s" i.e. "no operation is masked out"