3 # New instructions for CR/INT predication
9 * main bugreport for crweirds
10 <https://bugs.libre-soc.org/show_bug.cgi?id=533>
11 * <https://bugs.libre-soc.org/show_bug.cgi?id=527>
12 * <https://bugs.libre-soc.org/show_bug.cgi?id=569>
13 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
17 Condition Registers are conceptually perfect for use as predicate masks,
18 the only problem being that typical Vector ISAs have quite comprehensive
19 mask-based instructions: set-before-first, popcount and much more.
20 In fact many Vector ISAs can use Vectors *as* masks, consequently the
21 entire Vector ISA is usually available for use in creating masks (one
22 exception being AVX512 which has a dedicated Mask regfile and opcodes).
23 Duplication of such operations (popcount etc) is not practical for SV
24 given the strategy of leveraging pre-existing Scalar instructions in a
27 With the scalar OpenPOWER v3.0B ISA having already popcnt, cntlz and
28 others normally seen in Vector Mask operations it makes sense to allow
29 *both* scalar integers *and* CR-Vectors to be predicate masks. That in
30 turn means that much more comprehensive interaction between CRs and scalar
31 Integers is required, because with the CR Predication Modes designating
32 CR *Fields* (not CR bits) as Predicate Elements, fast transfers between
33 CR *Fields* and the Integer Register File is needed.
35 The opportunity is therefore taken to also augment CR logical arithmetic
36 as well, using a mask-based paradigm that takes into consideration
37 multiple bits of each CR Field (eq/lt/gt/ov). By contrast v3.0B Scalar
38 CR instructions (crand, crxor) only allow a single bit calculation, and
39 both mtcr and mfcr are CR-orientated rather than CR *Field* orientated.
41 Also strangely there is no v3.0 instruction for directly moving CR Fields,
42 only CR *bits*, so that is corrected here with `mcrfm`. The opportunity
43 is taken to allow inversion of CR Field bits, when copied.
47 * CR-based instructions that perform simple AND/OR from any four bits
48 of a CR field to create a single bit value (0/1) in an integer register
49 * Inverse of the same, taking a single bit value (0/1) from an integer
50 register to selectively target any four bits of a given CR Field
51 * CR-to-CR version of the same, allowing multiple bits to be AND/OR/XORed
53 * Optional Vectorisation of the same when SVP64 is implemented
57 * To provide a merged version of what is currently a multi-sequence of
58 CR operations (crand, cror, crxor) with mfcr and mtcrf, reducing
60 * To provide a vectorised version of the same, suitable for advanced
65 * mtcrweird when RA=0 is a means to set or clear
66 multiple arbitrary CR Field bits simultaneously,
67 using immediates embedded within the instruction.
68 * With SVP64 on the weird instructions there is bit-for-bit interaction
69 between GPR predicate masks (r3, r10, r31) and the source
70 or destination GPR, in ways that are not possible with other
71 SVP64 instructions because normal SVP64 is bit-per-element.
72 On these weird instructions the element in effect *is* a bit.
73 * `mfcrweird` mitigates a need to add `conflictd`, part of
74 [[sv/vector_ops]], as well as allowing more complex comparisons.
78 Please see [[svp64/appendix]] regarding CR bit ordering and for
79 the definition of `CR{n}`
81 # Instruction form and pseudocode
83 **DRAFT** Instruction format (use of MAJOR 19 not approved by
86 |0-5|6-10 |11|12-15|16-18|19-20|21-25 |26-30 |31|name |
87 |---|---- |--|-----|-----|-----|----- |----- |--|---- |
88 |19 |RT | |mask |BFA | |XO[0:4]|XO[5:9]|/ | |
89 |19 | | | | | |1 //// |00011 | |rsvd |
90 |19 |RT |M |mask |BFA | 0 0 |0 mode |00011 |Rc|crrweird |
91 |19 |RT |M |mask |BFA | 0 1 |0 mode |00011 |Rc|mfcrweird |
92 |19 |RA |M |mask |BF | 1 0 |0 mode |00011 |0 |mtcrrweird |
93 |19 |RA |M |mask |BF | 1 0 |0 mode |00011 |1 |mtcrweird |
94 |19 |BT |M |mask |BFA | 1 1 |0 mode |00011 |0 |crweirder |
95 |19 |BF //|M |mask |BFA | 1 1 |0 mode |00011 |1 |mcrfm |
99 mode is encoded in XO and is 4 bits
101 crrweird: RT,BFA,M,mask,mode
104 n0 = mask[0] & (mode[0] == creg[0])
105 n1 = mask[1] & (mode[1] == creg[1])
106 n2 = mask[2] & (mode[2] == creg[2])
107 n3 = mask[3] & (mode[3] == creg[3])
108 result = n0|n1|n2|n3 if M else n0&n1&n2&n3
109 RT[63] = result # MSB0 numbering, 63 is LSB
113 When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
114 SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
117 Also as noted below, element-width override bits normally used
118 on the source is instead used to allow multiple results to be packed
119 sequentially into the destination. *Destination elwidth overrides still apply*.
123 mode is encoded in XO and is 4 bits
125 mfcrrweird: RT,BFA,mask,mode
128 n0 = mask[0] & (mode[0] == creg[0])
129 n1 = mask[1] & (mode[1] == creg[1])
130 n2 = mask[2] & (mode[2] == creg[2])
131 n3 = mask[3] & (mode[3] == creg[3])
132 result = n0||n1||n2||n3
133 RT[60:63] = result # MSB0 numbering, 63 is LSB
137 When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
138 SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
141 Also as noted below, element-width override bits normally used
142 on the source is instead used to allow multiple results to be packed
143 into the destination. *Destination elwidth overrides still apply*
147 mode is encoded in XO and is 4 bits
149 mtcrrweird: BF,RA,M,mask,mode
152 n0 = mask[0] & (mode[0] == a[63])
153 n1 = mask[1] & (mode[1] == a[62])
154 n2 = mask[2] & (mode[2] == a[61])
155 n3 = mask[3] & (mode[3] == a[60])
156 result = n0 || n1 || n2 || n3
158 result |= CR{BF} & ~mask
161 When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
162 SVP64 type operation and as such can use RC1 Data-dependent
167 mtcrweird: BF,RA,M,mask,mode
170 lsb = reg[63] # MSB0 numbering
171 n0 = mask[0] & (mode[0] == lsb)
172 n1 = mask[1] & (mode[1] == lsb)
173 n2 = mask[2] & (mode[2] == lsb)
174 n3 = mask[3] & (mode[3] == lsb)
175 result = n0 || n1 || n2 || n3
177 result |= CR{BF} & ~mask
180 Note that when M=1 this operation is a Read-Modify-Write on the CR Field
181 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
182 M=1. Correspondingly when M=0 this operation is an overwrite: no read
183 of BF is required because the masked-out bits of the BF CR Field are
186 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
187 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
188 capability (BF is 3 bits)
190 **mcrfm** - Move CR Field, masked.
192 This instruction copies, sets, or inverts parts of a CR Field
193 into another CR Field. `mcrf` copies only one bit of the CR
194 from any arbitrary bit to any other arbitrary bit, whereas
195 `mcrfm` copies an entire 4-bit CR Field (or masked parts thereof).
196 Unlike `mcrf` the bits of the CR Field may not change position:
197 the EQ bit from the source may only go into the EQ bit of the
198 destination (optionally inverted, set, or cleared).
200 mcrfm: BF,BFA,M,mask,mode
202 result = mask & CR{BFA}
204 result |= CR{BF} & ~mask
208 When M=1 this operation is a Read-Modify-Write on the CR Field
209 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
210 M=1. Correspondingly when M=0 this operation is an overwrite: no read
211 of BF is required because the masked-out bits of the BF CR Field are
214 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
215 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
216 capability (BF is 3 bits)
218 *Programmer's note: `mode` being XORed onto the result provides
219 considerable flexibility. individual bits of BFA may be copied inverted
220 to BF by ensuring that `mask` and `mode` have the same bit set. Also,
221 individual bits in BF may be set to 1 by ensuring that the required bit of
222 `mask` is set to zero and the same bit in `mode` is set to 1*
226 crweirder: BT,BFA,mask,mode
229 n0 = mask[0] & (mode[0] == creg[0])
230 n1 = mask[1] & (mode[1] == creg[1])
231 n2 = mask[2] & (mode[2] == creg[2])
232 n3 = mask[3] & (mode[3] == creg[3])
233 BF = BT[2:4] # select CR
234 bit = BT[0:1] # select bit of CR
235 result = n0|n1|n2|n3 if M else n0&n1&n2&n3
238 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
239 type operation that has 5-bit Data-dependent and 5-bit Predicate-result
240 capability (BFT is 5 bits)
242 **Example Pseudo-ops:**
244 mtcri BF, mode mtcrweird BF, r0, 0, 0b1111,~mode
245 mtcrset BF, mask mtcrweird BF, r0, 1, mask,0b0000
246 mtcrclr BF, mask mtcrweird BF, r0, 1, mask,0b1111
248 # Vectorised versions involving GPRs
250 The name "weird" refers to a minor violation of SV rules when it comes
251 to deriving the Vectorised versions of these instructions.
253 Normally the progression of the SV for-loop would move on to the
254 next register. Instead however in the scalar case these instructions
255 **remain in the same register** and insert or transfer between **bits**
256 of the scalar integer source or destination. The reason is that when
257 using CR Fields as predicate masks and there is a need to transfer
258 into a GPR, again for use as a predicate mask, the CR Field bits
259 need to be efficiently packed into that one GPR (r3, r10 or r31).
261 Further useful violation of the normal SV Elwidth override rules allows
262 for packing (or unpacking) of multiple CR test results into (or out of)
263 an Integer Element. Note that the CR (source operand) elwidth field is
264 utilised to determine the bit- packing size (1/2/4/8 with remaining
265 bits within the Integer element set to zero) whilst the INT (dest
266 operand) elwidth field still sets the Integer element size as usual
269 **crrweird: RT, BB, mask.mode**
276 n0 = mask[0] & (mode[0] == creg[0])
277 n1 = mask[1] & (mode[1] == creg[1])
278 n2 = mask[2] & (mode[2] == creg[2])
279 n3 = mask[3] & (mode[3] == creg[3])
280 # OR or AND to a single bit
281 result = n0|n1|n2|n3 if M else n0&n1&n2&n3
283 # TODO: RT.elwidth override to be also added here
284 # note, yes, really, the CR's elwidth field determines
285 # the bit-packing into the INT!
286 if BB.elwidth == 0b00:
287 # pack 1 result into 64-bit registers
288 iregs[RT+i][0..62] = 0
289 iregs[RT+i][63] = result # sets LSB to result
290 if BB.elwidth == 0b01:
291 # pack 2 results sequentially into INT registers
292 iregs[RT+i//2][0..61] = 0
293 iregs[RT+i//2][63-(i%2)] = result
294 if BB.elwidth == 0b10:
295 # pack 4 results sequentially into INT registers
296 iregs[RT+i//4][0..59] = 0
297 iregs[RT+i//4][63-(i%4)] = result
298 if BB.elwidth == 0b11:
299 # pack 8 results sequentially into INT registers
300 iregs[RT+i//8][0..55] = 0
301 iregs[RT+i//8][63-(i%8)] = result
303 iregs[RT][63-i] = result # results also in scalar INT
307 * in the scalar case the CR-Vector assessment
308 is stored bit-wise starting at the LSB of the
309 destination scalar INT
310 * in the INT-vector case the results are packed into LSBs
311 of the INT Elements, the packing arrangement depending on both
312 elwidth override settings.
314 **mfcrrweird: RT, BFA, mask.mode**
316 Unlike `crrweird` the results are 4-bit wide, so the packing
317 will begin to spill over to other destination elements. 8 results per
318 destination at 4-bits each still fits into destination elwidth at 32-bit,
319 but for 16-bit and 8-bit obviously this does not fit, and must split
320 across to the next element
322 When for example destination elwidth is 16-bit (0b10) the following packing
325 - SVRM bits 6:7 equal to 0b00 - one 4-bit result element packed into the
326 first 4-bits of the 16-bit destination element (in the first 4 LSBs)
327 - SVRM bits 6:7 equal to 0b01 - two 4-bit result elements packed into the
328 first 8-bits of the 16-bit destination element (in the first 8 LSBs)
329 - SVRM bits 6:7 equal to 0b10 - four 4-bit result elements packed into each
330 16-bit destination element
331 - SVRM bits 6:7 equal to 0b11 - eight 4-bit result elements, the first four
332 of which are packed into the first 16-bit destination element, the
333 second four of which are packed into the second 16-bit destination element.
335 Pseudocode example: note that dest elwidth overrides affect the
336 packing of results. BB.elwidth in effect requests how many 4-bit
337 result elements would like to be packed, but RT.elwidth determines
338 the limit. Any parts of the destination elements not containing
339 results are set to zero.
346 n0 = mask[0] & (mode[0] == creg[0])
347 n1 = mask[1] & (mode[1] == creg[1])
348 n2 = mask[2] & (mode[2] == creg[2])
349 n3 = mask[3] & (mode[3] == creg[3])
350 result = n0||n1||n2||n3 # 4-bit result
352 # RT.elwidth override can affect the packing
353 bwid = {0b00:64, 0b01:8, 0b10:16, 0b11:32}[RT.elwidth]
354 t4, t8 = min(4, bwid//2), min(8, bwid//2)
355 # yes, really, the CR's elwidth field determines
356 # the bit-packing into the INT!
357 if BB.elwidth == 0b00:
358 # pack 1 result into 64-bit registers
360 if BB.elwidth == 0b01:
361 # pack 2 results sequentially into INT registers
362 idx, boff = i//2, i%2
363 if BB.elwidth == 0b10:
364 # pack 4 results sequentially into INT registers
365 idx, boff = i//t4, i%t4
366 if BB.elwidth == 0b11:
367 # pack 8 results sequentially into INT registers
368 idx, boff = i//t8, i%t8
370 # exceeding VL=16 is UNDEFINED
372 iregs[RT+idx][60-boff*4:63-boff*4] = result
376 # v3.1 setbc instructions
378 There are additional setb conditional instructions in v3.1 (p129)
380 RT = (CR[BI] == 1) ? 1 : 0
382 which also negate that, and also return -1 / 0. these are similar to
383 crweird but not the same purpose. most notable is that crweird acts on
384 CR fields rather than the entire 32 bit CR.
386 # Predication Examples
388 Take the following example:
391 sv.mtcrweird/dm=r10/dz cr8.v, 0, 0b0011.0000
393 Here, RA is zero, so the source input is zero. The destination is CR Field
394 8, and the destination predicate mask indicates to target the first two
395 elements. Destination predicate zeroing is enabled, and the destination
396 predicate is only set in the 2nd bit. mask is 0b0011, mode is all zeros.
398 Let us first consider what should go into element 0 (CR Field 8):
400 * The destination predicate bit is zero, and zeroing is enabled.
401 * Therefore, what is in the source is irrelevant: the result must
403 * Therefore all four bits of CR Field 8 are therefore set to zero.
405 Now the second element, CR Field 9 (CR9):
407 * Bit 2 of the destination predicate, r10, is 1. Therefore the computation
408 of the result is relevant.
409 * RA is zero therefore bit 2 is zero. mask is 0b0011 and mode is 0b0000
410 * When calculating n0 thru n3 we get n0=1, n1=2, n2=0, n3=0
411 * Therefore, CR9 is set (using LSB0 ordering) to 0b0011, i.e. to mask.
413 It should be clear that this instruction uses bits of the integer
414 predicate to decide whether to set CR Fields to `(mask & ~mode)` or
415 to zero. Thus, in effect, it is the integer predicate that has been
416 copied into the CR Fields.
418 By using twin predication, zeroing, and inversion (sm=~r3, dm=r10) for
419 example, it becomes possible to combine two Integers together in order
420 to set bits in CR Fields. Likewise there are dozens of ways that CR
421 Predicates can be used, on the same sv.mtcrweird instruction.