(no commit message)
[libreriscv.git] / openpower / sv / svp64 / discussion.mdwn
1
2 # Note about naming
3
4 the original assessment for SVP from 18 months ago concluded that it should be easy for scalar (non SV) instructions to get at the exact same scalar registers when in SVP mode. otherwise scalar v3.0B code needs to restrict itself to a massively truncated subset of the scalar registers numbered 0-31 (only r0, r4, r8...) which hugely interferes with ABIs to such an extent that it would compromise SV.
5
6 question: has anything changed about the assessment that was done, which concluded that for scalar SVP regs they should overlap completely with scalar ISA regs?
7
8
9 # Notes on requirements for bit allocations
10
11 do not try to jam VL or MAXVL in. go with the flow of 24 bits spare.
12
13 * 2: SUBVL
14 * 2: elwidth
15 * 2: twin-predication (src, dest) elwidth
16 * 1: select INT or CR predication
17 * 3: predicate selection and inversion (QTY 2 for tpred)
18 * 4x2 or 3x3: src1/2/3/dest Vector/Scalar reg
19 * 5: mode
20
21 totals: 24 bits (dest elwidth shared)
22
23 http://lists.libre-soc.org/pipermail/libre-soc-dev/2020-December/001434.html
24
25 ## All zeros indicates "disable SVP"
26
27 The defaults for all capabilities of SVP should be zero to indicate "no action". SUBVL=1 encoded as 0b00. register name prefixes, scalar=0b0, elwidth overrides DEFAULT=0b00, predication off=0b000 etc.
28
29 this way SV may be entirely disabled, leaving an "all zeros" to indicate to v3.1B 64bit prefixing that the standard OpenPOWER v3.1B encodings are in full effect (and that SV is not). As all zeros meshes with current "reserved" encodings this should work well.
30
31
32 ## twin predication
33
34 twin predication and twin elwidth overrides is extremely important to have to be able to override both the src and dest elwidth yet keep the underlying scalar operation intact. examples include mr with an elwidth=8, VL=8 on the src will take a byte at a time from one 64 bit reg and place it into 8x 64-bit regs, zero-extended. more complex operations involve SUBVL and Audio/Video DSP operations, see [[av_opcodes]]
35
36 something like:
37
38 | 0 1 | 2 3 | 4 5 | 6 | 7 9 | 10 12 | 13 18 | 19 23 |
39 |-------|-----|-----|------|------|-------|-------|-------|
40 | subvl | sew | dew | ptyp | psrc | pdst | vspec | mode |
41
42 table:
43
44 * subvl - 1 to 4 scalar / vec2 / vec3 / vec4
45 * sew / dew - DEFAULT / 8 / 16 /32 element width
46 * ptyp - predication INT / CR
47 * psrc / pdst - predicate mask selector and inversion
48 * vspec - 3 bit src / dest scalar-vector extension
49 * mode: 5 bits
50
51 ## twin predication, CR based.
52
53 separate src and dest predicates are a critical part of SV for provision of VEXPAND, VCOMPRESS, VSPLAT, VINSERT and many more operations.
54
55 Twin CR predication could be done in two ways:
56
57 * start from different CRs for the src and dest
58 * start from the same CR.
59
60 With different bits being selectable (CR[0..3]) starting from the same CR makes some sense.
61
62 # standard arith ops (single predication)
63
64 these are of the form res = op(src1, src2, ...)
65
66 | 0 1 | 2 3 | 4 5 | 6 | 7 9 | 10 18 | 19 23 |
67 |-------|-----|-----|------|------|-------|--------|
68 | subvl | sew | dew | ptyp | pred | vspec | mode |
69
70 table:
71
72 * subvl - 1 to 4 scalar / vec2 / vec3 / vec4
73 * sew / dew - DEFAULT / 8 / 16 /32 element width
74 * ptyp - predication INT / CR
75 * pred - predicate mask selector and inversion
76 * vspec - 2/3 bit src / dest scalar-vector extension
77 * mode - 5 bit
78
79 For 2 op (dest/src1/src2) the tag may be 3 bits: total 9 bits. for 3 op (dest/src1/2/3) the vspec may be 2 bits per reg: total 8 bits.
80
81 Note:
82
83 * the operation should always be done at max(srcwidth, dstwidth), unless it can
84 be proven using the lower will lead to the same result
85 * saturation is done on the result at the **dest** elwidth
86
87 Some examples on different operation widths:
88
89 u16 / u16 = u8
90 256 / 2 = 128 # if we used the smaller width, we'd get 0. Wrong
91
92 u8 * u8 = u16
93 255 * 2 = 510 # if we used the smaller width, we'd get 254. Wrong
94
95 u16 + u16 = u8
96 256 + 2 = 2 # this is correct whether we use the larger or smaller width
97 # aka hw can optimize narrowing addition
98
99
100 # Notes about Swizzle
101
102 Basically, there isn't enough room to try to fit two src src1/2 swizzle, and SV, even into 64 bit (actually 24) without severely compromising on the number of bits allocated to either swizzle, or SV, or both.
103
104 therefore the strategy proposed is:
105
106 * design 16bit scalar ops
107 * use the 11 bit old SV prefix to create 32bit insns
108 * when those are embedded into v3.1B 64 prefix, the 24 bits are entirely allocated to swizzle.
109
110 with 2x12 this would mean no need to have complex encoding of swizzle.
111
112 if we really do need 2 bits spare then the complex encoder of swizzle could be deployed. (*an analysis shows this to be very unlikely. 7^4 is around 2400 which still requires 12 bits to encode* (that's miscalculated, see Single Swizzle section below.) it isn't because you missed out predicate mask skip as thr 7th option.)
113
114 ## Single Swizzle
115
116 I expect swizzle to not be common enough to warrant 2 swizzles in a single instruction, therefor the above swizzle strategy is probably unnecessary.
117
118 Also, if a swizzle supports up to subvl=4, then 11 bits is sufficient since each swizzle element needs to be able to select 1 of 6 different values: 0, 1, x, y, z, w. 6^4 = 1296 which easily fits in 11 bits (only by dropping "predicate mask" from the list of options, which makes 7 options not 6. see [[mv.swizzle]])
119
120 What about subvl=4 that skips one element? src vec is 4 but one of the elements is to be left alone? This is not 6 options, it is 7 options (including "skip" i.e combining with a predicate mask in effect). note that this is not the same as a vec3-with-a-skip
121
122 What could hypothetically be done is: when SUBVL=3 a different encoding is used, one that allows the "skip" to be specified. X Y skip W for example. this would then be interpreted, "actually the vector is vec4 but one rlement is skipped"
123
124 the problem with that is that now SUBVL has become critically dependent on the swizzle, worse than that the swizzle is embedded in the instruction, even worse than that it's encoded in a complex multi-gate fashion.
125
126 all of which screams, "this is going in completely the wrong direction". keep it simple. 7 options, 3 bits, 4x3, 12 bits for swizzle, ignore some if SUBVL is 1 2 or 3.
127
128 # note about INT predicate
129
130 001 ALWAYS (implicit) Operation is not masked
131
132 this means by default that 001 will always be in nonpredicated ops, which seems anomalous. would 000 be better to indicate "no predication"?
133
134 000 would indicate "the predicate is an immediate of all 1s" i.e. "no operation is masked out"
135
136 programmerjake:
137 I picked 0001 to indicate ALWAYS since that matches with the other semantics: the LSB bit is invert-the-mask, and you can think about the table as-if it is really:
138
139 this is the opposite of what feels natural. inversion should switch *off* something. also 000 is the canonical "this feature is off by default" number.
140
141 the constant should be an immediate of all 1s (not r0), which is the natural way to think of "predication is off".
142
143 i get the idea "r0 to be used therefore it is all zeros" but that makes 001 the "default", not 000.
144
145 | Value | Mnemonic |
146 |-------|-------------|
147 | 000 | R0 (zero) set to all 1s, naturally means "no predication" |
148 | 001 | ~R0 (~zero) |
149 | 010 | R3 |
150 | 011 | ~R3 |
151 | 100 | R10 |
152 | 101 | ~R10 |
153 | 110 | R30 |
154 | 111 | ~R30 |
155
156
157 # CR Vectorisation
158
159 Some thoughts on this: the sensible (sane) number of CRs to have is 64. A case could be made for having 128 but it is an awful lot. 64 CRs also has the advantage that it is only 4x 64 bit registers on a context-switch (programmerjake: yeah, but we already have 256 64-bit registers, a few more won't change much).
160
161 A practical issue stems from the fact that accessing the CR regfile on a non-aligned 8-CR boundary during Vector operations would significantly increase internal routing. By aligning Vector Reads/Writes to 8 CRs this requires only 32 bit aligned read/writes. (programmerjake: simple solution -- rename them internally such that CR6 is the first one)
162
163 How to number them as vectors gets particularly interesting. A case could be made for treating the 64 CRs as a square, and using CR numbering (CR0-7) to begin VL for-loop incrementing first by row and when rolling over to increment the column. CR6 CR14 ... CR62 then CR7 CR15 ...
164
165 When the SV prefix marks them with 2 bits, one of those could be used to indicate scalar, and the other to indicate whether the 3 bit CR number is to be treated as a horizontal vector (CR incrementing straight by 1) or a vertical vector (incrementing by 8)
166
167 When there are 3 bits it would be possible to indicate whether to begin from a position offset by 4 (middle of matrix, edge of matrix).
168
169 Note: considerable care needs to be taken when putting these horiz/vertical CRs through the Dependency Matrices
170
171 Indexing algorithm illustrating how the H/V modes would work. Note that BA is the 3 bit CR register field that normsll, in scalar ISA, would reference only CR0-7 as CR[BA].
172
173 for i in range(VL)
174 y = i % 8
175 x = i // 8
176 if verticalmode:
177 CRINDEX = BA + y*8 + x
178 else:
179 CRINDEX = BA*8 + i
180 CR[CRINDEX] = ...
181
182 # Should twin-predication (src=1, dest=1) have DEST SUBVL?
183
184 this is tricky: there isn't really enough space unless the reg scalar-vector extension (currently 3 bits per reg) is compacted to only 2 bits each, which would provide 2 extra bits.
185
186 so before adding this, an evaluation is needed: *is it necessary*?
187
188 what actual operations out of this list need - and work - with a separate SRC and DEST SUBVL?
189
190 * mv (the usual way that V* operations are created)
191 * exts* sign-extension
192 * rwlinm and other RS-RA shift operations
193 * LD and ST (treating AGEN as one source)
194 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
195 * Condition Register ops mfcr, mtcr and other similar
196
197 Evaluation:
198
199 * mv: yes. these may need merge/split
200 * exts: no. no transformation.
201 * rwlinm shift operations: no
202 * LD and ST: no
203 * FP ops: no
204 * CR ops: maybe on mvs, not on arithmetic.
205
206 therefore it makes no sense to have DEST SUBVL, and instead to have special mv operations. see [[mv.vec]]