(no commit message)
[libreriscv.git] / openpower / sv / cr_int_predication.mdwn
1 [[!tag standards]]
2
3 # New instructions for CR/INT predication
4
5 **DRAFT STATUS**
6
7 See:
8
9 * main bugreport for crweirds
10 <https://bugs.libre-soc.org/show_bug.cgi?id=533>
11 * <https://bugs.libre-soc.org/show_bug.cgi?id=527>
12 * <https://bugs.libre-soc.org/show_bug.cgi?id=569>
13 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
14
15 Rationale:
16
17 Condition Registers are conceptually perfect for use as predicate masks,
18 the only problem being that typical Vector ISAs have quite comprehensive
19 mask-based instructions: set-before-first, popcount and much more.
20 In fact many Vector ISAs can use Vectors *as* masks, consequently the
21 entire Vector ISA is usually available for use in creating masks (one
22 exception being AVX512 which has a dedicated Mask regfile and opcodes).
23 Duplication of such operations (popcount etc) is not practical for SV
24 given the strategy of leveraging pre-existing Scalar instructions in a
25 minimalist way.
26
27 With the scalar OpenPOWER v3.0B ISA having already popcnt, cntlz and
28 others normally seen in Vector Mask operations it makes sense to allow
29 *both* scalar integers *and* CR-Vectors to be predicate masks. That in
30 turn means that much more comprehensive interaction between CRs and scalar
31 Integers is required, because with the CR Predication Modes designating
32 CR *Fields* (not CR bits) as Predicate Elements, fast transfers between
33 CR *Fields* and the Integer Register File is needed.
34
35 The opportunity is therefore taken to also augment CR logical arithmetic
36 as well, using a mask-based paradigm that takes into consideration
37 multiple bits of each CR Field (eq/lt/gt/ov). By contrast v3.0B Scalar
38 CR instructions (crand, crxor) only allow a single bit calculation, and
39 both mtcr and mfcr are CR-orientated rather than CR *Field* orientated.
40
41 Also strangely there is no v3.0 instruction for directly moving CR Fields,
42 only CR *bits*, so that is corrected here with `mcrfm`. The opportunity
43 is taken to allow inversion of CR Field bits, when copied.
44
45 Basic concept:
46
47 * CR-based instructions that perform simple AND/OR from any four bits
48 of a CR field to create a single bit value (0/1) in an integer register
49 * Inverse of the same, taking a single bit value (0/1) from an integer
50 register to selectively target any four bits of a given CR Field
51 * CR-to-CR version of the same, allowing multiple bits to be AND/OR/XORed
52 in one hit.
53 * Optional Vectorisation of the same when SVP64 is implemented
54
55 Purpose:
56
57 * To provide a merged version of what is currently a multi-sequence of
58 CR operations (crand, cror, crxor) with mfcr and mtcrf, reducing
59 instruction count.
60 * To provide a vectorised version of the same, suitable for advanced
61 predication
62
63 Useful side-effects:
64
65 * mtcrweird when RA=0 is a means to set or clear
66 multiple arbitrary CR Field bits simultaneously,
67 using immediates embedded within the instruction.
68 * With SVP64 on the weird instructions there is bit-for-bit interaction
69 between GPR predicate masks (r3, r10, r31) and the source
70 or destination GPR, in ways that are not possible with other
71 SVP64 instructions because normal SVP64 is bit-per-element.
72 On these weird instructions the element in effect *is* a bit.
73 * `mfcrweird` mitigates a need to add `conflictd`, part of
74 [[sv/vector_ops]], as well as allowing more complex comparisons.
75
76 # Bit ordering.
77
78 Please see [[svp64/appendix]] regarding CR bit ordering and for
79 the definition of `CR{n}`
80
81 # Instruction form and pseudocode
82
83 **DRAFT** Instruction format (use of MAJOR 19 not approved by
84 OPF ISA WG):
85
86 |0-5|6-10 |11|12-15|16-18|19-20|21-25 |26-30 |31|name |
87 |---|---- |--|-----|-----|-----|----- |----- |--|---- |
88 |19 |RT | |mask |BFA | |XO[0:4]|XO[5:9]|/ | |
89 |19 | | | | | |1 //// |00011 | |rsvd |
90 |19 |RT |M |mask |BFA | 0 0 |0 mode |00011 |Rc|crrweird |
91 |19 |RT |M |mask |BFA | 0 1 |0 mode |00011 |Rc|mfcrweird |
92 |19 |RA |M |mask |BF | 1 0 |0 mode |00011 |0 |mtcrrweird |
93 |19 |RA |M |mask |BF | 1 0 |0 mode |00011 |1 |mtcrweird |
94 |19 |BT |M |mask |BFA | 1 1 |0 mode |00011 |0 |crweirder |
95 |19 |BF //|M |mask |BFA | 1 1 |0 mode |00011 |1 |mcrfm |
96
97 **crrweird**
98
99 mode is encoded in XO and is 4 bits
100
101 crrweird: RT,BFA,M,mask,mode
102
103 creg = CR{BFA}
104 n0 = mask[0] & (mode[0] == creg[0])
105 n1 = mask[1] & (mode[1] == creg[1])
106 n2 = mask[2] & (mode[2] == creg[2])
107 n3 = mask[3] & (mode[3] == creg[3])
108 result = n0|n1|n2|n3 if M else n0&n1&n2&n3
109 RT[63] = result # MSB0 numbering, 63 is LSB
110 If Rc:
111 CR0 = analyse(RT)
112
113 When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
114 SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
115 Mode capability
116
117 Also as noted below, element-width override bits normally used
118 on the source is instead used to allow multiple results to be packed
119 sequentially into the destination. *Destination elwidth overrides still apply*.
120
121 **mfcrrweird**
122
123 mode is encoded in XO and is 4 bits
124
125 mfcrrweird: RT,BFA,mask,mode
126
127 creg = CR{BFA}
128 n0 = mask[0] & (mode[0] == creg[0])
129 n1 = mask[1] & (mode[1] == creg[1])
130 n2 = mask[2] & (mode[2] == creg[2])
131 n3 = mask[3] & (mode[3] == creg[3])
132 result = n0||n1||n2||n3
133 RT[60:63] = result # MSB0 numbering, 63 is LSB
134 If Rc:
135 CR0 = analyse(RT)
136
137 When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
138 SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
139 Mode capability.
140
141 Also as noted below, element-width override bits normally used
142 on the source is instead used to allow multiple results to be packed
143 into the destination. *Destination elwidth overrides still apply*
144
145 **mtcrrweird**
146
147 mode is encoded in XO and is 4 bits
148
149 mtcrrweird: BF,RA,M,mask,mode
150
151 a = (RA|0)
152 n0 = mask[0] & (mode[0] == a[63])
153 n1 = mask[1] & (mode[1] == a[62])
154 n2 = mask[2] & (mode[2] == a[61])
155 n3 = mask[3] & (mode[3] == a[60])
156 result = n0 || n1 || n2 || n3
157 if M:
158 result |= CR{BF} & ~mask
159 CR{BF} = result
160
161 When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
162 SVP64 type operation and as such can use RC1 Data-dependent
163 Mode capability
164
165 **mtcrweird**
166
167 mtcrweird: BF,RA,M,mask,mode
168
169 reg = (RA|0)
170 lsb = reg[63] # MSB0 numbering
171 n0 = mask[0] & (mode[0] == lsb)
172 n1 = mask[1] & (mode[1] == lsb)
173 n2 = mask[2] & (mode[2] == lsb)
174 n3 = mask[3] & (mode[3] == lsb)
175 result = n0 || n1 || n2 || n3
176 if M:
177 result |= CR{BF} & ~mask
178 CR{BF} = result
179
180 Note that when M=1 this operation is a Read-Modify-Write on the CR Field
181 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
182 M=1. Correspondingly when M=0 this operation is an overwrite: no read
183 of BF is required because the masked-out bits of the BF CR Field are
184 set to zero.
185
186 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
187 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
188 capability (BF is 3 bits)
189
190 **mcrfm** - Move CR Field, masked.
191
192 This instruction copies, sets, or inverts parts of a CR Field
193 into another CR Field. `mcrf` copies only one bit of the CR
194 from any arbitrary bit to any other arbitrary bit, whereas
195 `mcrfm` copies an entire 4-bit CR Field (or masked parts thereof).
196 Unlike `mcrf` the bits of the CR Field may not change position:
197 the EQ bit from the source may only go into the EQ bit of the
198 destination (optionally inverted, set, or cleared).
199
200 mcrfm: BF,BFA,M,mask,mode
201
202 result = mask & CR{BFA}
203 if M:
204 result |= CR{BF} & ~mask
205 result ^= mode
206 CR{BF} = result
207
208 When M=1 this operation is a Read-Modify-Write on the CR Field
209 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
210 M=1. Correspondingly when M=0 this operation is an overwrite: no read
211 of BF is required because the masked-out bits of the BF CR Field are
212 set to zero.
213
214 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
215 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
216 capability (BF is 3 bits)
217
218 *Programmer's note: `mode` being XORed onto the result provides
219 considerable flexibility. individual bits of BFA may be copied inverted
220 to BF by ensuring that `mask` and `mode` have the same bit set. Also,
221 individual bits in BF may be set to 1 by ensuring that the required bit of
222 `mask` is set to zero and the same bit in `mode` is set to 1*
223
224 **crweirder**
225
226 crweirder: BT,BFA,mask,mode
227
228 creg = CR{BFA}
229 n0 = mask[0] & (mode[0] == creg[0])
230 n1 = mask[1] & (mode[1] == creg[1])
231 n2 = mask[2] & (mode[2] == creg[2])
232 n3 = mask[3] & (mode[3] == creg[3])
233 BF = BT[2:4] # select CR
234 bit = BT[0:1] # select bit of CR
235 result = n0|n1|n2|n3 if M else n0&n1&n2&n3
236 CR{BF}[bit] = result
237
238 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
239 type operation that has 5-bit Data-dependent and 5-bit Predicate-result
240 capability (BFT is 5 bits)
241
242 **Example Pseudo-ops:**
243
244 mtcri BF, mode mtcrweird BF, r0, 0, 0b1111,~mode
245 mtcrset BF, mask mtcrweird BF, r0, 1, mask,0b0000
246 mtcrclr BF, mask mtcrweird BF, r0, 1, mask,0b1111
247
248 # Vectorised versions involving GPRs
249
250 The name "weird" refers to a minor violation of SV rules when it comes
251 to deriving the Vectorised versions of these instructions.
252
253 Normally the progression of the SV for-loop would move on to the
254 next register. Instead however in the scalar case these instructions
255 **remain in the same register** and insert or transfer between **bits**
256 of the scalar integer source or destination. The reason is that when
257 using CR Fields as predicate masks and there is a need to transfer
258 into a GPR, again for use as a predicate mask, the CR Field bits
259 need to be efficiently packed into that one GPR (r3, r10 or r31).
260
261 Further useful violation of the normal SV Elwidth override rules allows
262 for packing (or unpacking) of multiple CR test results into (or out of)
263 an Integer Element. Note that the CR (source operand) elwidth field is
264 utilised to determine the bit- packing size (1/2/4/8 with remaining
265 bits within the Integer element set to zero) whilst the INT (dest
266 operand) elwidth field still sets the Integer element size as usual
267 (8/16/32/default)
268
269 **crrweird: RT, BB, mask.mode**
270
271 for i in range(VL):
272 if BB.isvec:
273 creg = CR{BB+i}
274 else:
275 creg = CR{BB}
276 n0 = mask[0] & (mode[0] == creg[0])
277 n1 = mask[1] & (mode[1] == creg[1])
278 n2 = mask[2] & (mode[2] == creg[2])
279 n3 = mask[3] & (mode[3] == creg[3])
280 # OR or AND to a single bit
281 result = n0|n1|n2|n3 if M else n0&n1&n2&n3
282 if RT.isvec:
283 # TODO: RT.elwidth override to be also added here
284 # note, yes, really, the CR's elwidth field determines
285 # the bit-packing into the INT!
286 if BB.elwidth == 0b00:
287 # pack 1 result into 64-bit registers
288 iregs[RT+i][0..62] = 0
289 iregs[RT+i][63] = result # sets LSB to result
290 if BB.elwidth == 0b01:
291 # pack 2 results sequentially into INT registers
292 iregs[RT+i//2][0..61] = 0
293 iregs[RT+i//2][63-(i%2)] = result
294 if BB.elwidth == 0b10:
295 # pack 4 results sequentially into INT registers
296 iregs[RT+i//4][0..59] = 0
297 iregs[RT+i//4][63-(i%4)] = result
298 if BB.elwidth == 0b11:
299 # pack 8 results sequentially into INT registers
300 iregs[RT+i//8][0..55] = 0
301 iregs[RT+i//8][63-(i%8)] = result
302 else:
303 iregs[RT][63-i] = result # results also in scalar INT
304
305 Note that:
306
307 * in the scalar case the CR-Vector assessment
308 is stored bit-wise starting at the LSB of the
309 destination scalar INT
310 * in the INT-vector case the results are packed into LSBs
311 of the INT Elements, the packing arrangement depending on both
312 elwidth override settings.
313
314 **mfcrrweird: RT, BFA, mask.mode**
315
316 Unlike `crrweird` the results are 4-bit wide, so the packing
317 will begin to spill over to other destination elements. 8 results per
318 destination at 4-bits each still fits into destination elwidth at 32-bit,
319 but for 16-bit and 8-bit obviously this does not fit, and must split
320 across to the next element
321
322 When for example destination elwidth is 16-bit (0b10) the following packing
323 occurs:
324
325 - SVRM bits 6:7 equal to 0b00 - one 4-bit result element packed into the
326 first 4-bits of the 16-bit destination element (in the first 4 LSBs)
327 - SVRM bits 6:7 equal to 0b01 - two 4-bit result elements packed into the
328 first 8-bits of the 16-bit destination element (in the first 8 LSBs)
329 - SVRM bits 6:7 equal to 0b10 - four 4-bit result elements packed into each
330 16-bit destination element
331 - SVRM bits 6:7 equal to 0b11 - eight 4-bit result elements, the first four
332 of which are packed into the first 16-bit destination element, the
333 second four of which are packed into the second 16-bit destination element.
334
335 Pseudocode example: note that dest elwidth overrides affect the
336 packing of results. BB.elwidth in effect requests how many 4-bit
337 result elements would like to be packed, but RT.elwidth determines
338 the limit. Any parts of the destination elements not containing
339 results are set to zero.
340
341 for i in range(VL):
342 if BB.isvec:
343 creg = CR{BB+i}
344 else:
345 creg = CR{BB}
346 n0 = mask[0] & (mode[0] == creg[0])
347 n1 = mask[1] & (mode[1] == creg[1])
348 n2 = mask[2] & (mode[2] == creg[2])
349 n3 = mask[3] & (mode[3] == creg[3])
350 result = n0||n1||n2||n3 # 4-bit result
351 if RT.isvec:
352 # RT.elwidth override can affect the packing
353 bwid = {0b00:64, 0b01:8, 0b10:16, 0b11:32}[RT.elwidth]
354 t4, t8 = min(4, bwid//2), min(8, bwid//2)
355 # yes, really, the CR's elwidth field determines
356 # the bit-packing into the INT!
357 if BB.elwidth == 0b00:
358 # pack 1 result into 64-bit registers
359 idx, boff = i, 0
360 if BB.elwidth == 0b01:
361 # pack 2 results sequentially into INT registers
362 idx, boff = i//2, i%2
363 if BB.elwidth == 0b10:
364 # pack 4 results sequentially into INT registers
365 idx, boff = i//t4, i%t4
366 if BB.elwidth == 0b11:
367 # pack 8 results sequentially into INT registers
368 idx, boff = i//t8, i%t8
369 else:
370 # exceeding VL=16 is UNDEFINED
371 idx, boff = 0, i
372 iregs[RT+idx][60-boff*4:63-boff*4] = result
373
374
375
376 # v3.1 setbc instructions
377
378 There are additional setb conditional instructions in v3.1 (p129)
379
380 RT = (CR[BI] == 1) ? 1 : 0
381
382 which also negate that, and also return -1 / 0. these are similar to
383 crweird but not the same purpose. most notable is that crweird acts on
384 CR fields rather than the entire 32 bit CR.
385
386 # Predication Examples
387
388 Take the following example:
389
390 r10 = 0b00010
391 sv.mtcrweird/dm=r10/dz cr8.v, 0, 0b0011.0000
392
393 Here, RA is zero, so the source input is zero. The destination is CR Field
394 8, and the destination predicate mask indicates to target the first two
395 elements. Destination predicate zeroing is enabled, and the destination
396 predicate is only set in the 2nd bit. mask is 0b0011, mode is all zeros.
397
398 Let us first consider what should go into element 0 (CR Field 8):
399
400 * The destination predicate bit is zero, and zeroing is enabled.
401 * Therefore, what is in the source is irrelevant: the result must
402 be zero.
403 * Therefore all four bits of CR Field 8 are therefore set to zero.
404
405 Now the second element, CR Field 9 (CR9):
406
407 * Bit 2 of the destination predicate, r10, is 1. Therefore the computation
408 of the result is relevant.
409 * RA is zero therefore bit 2 is zero. mask is 0b0011 and mode is 0b0000
410 * When calculating n0 thru n3 we get n0=1, n1=2, n2=0, n3=0
411 * Therefore, CR9 is set (using LSB0 ordering) to 0b0011, i.e. to mask.
412
413 It should be clear that this instruction uses bits of the integer
414 predicate to decide whether to set CR Fields to `(mask & ~mode)` or
415 to zero. Thus, in effect, it is the integer predicate that has been
416 copied into the CR Fields.
417
418 By using twin predication, zeroing, and inversion (sm=~r3, dm=r10) for
419 example, it becomes possible to combine two Integers together in order
420 to set bits in CR Fields. Likewise there are dozens of ways that CR
421 Predicates can be used, on the same sv.mtcrweird instruction.