openpower/sv/cr_int_predication.mdwn

   1 [[!tag standards]]
   2
   3 # New instructions for CR/INT predication
   4
   5 **DRAFT STATUS**
   6
   7 See:
   8
   9 * main bugreport for crweirds
  10   <https://bugs.libre-soc.org/show_bug.cgi?id=533>
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=527>
  12 * <https://bugs.libre-soc.org/show_bug.cgi?id=569>
  13 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
  14
  15 Rationale:
  16
  17 Condition Registers are conceptually perfect for use as predicate masks,
  18 the only problem being that typical Vector ISAs have quite comprehensive
  19 mask-based instructions: set-before-first, popcount and much more.
  20 In fact many Vector ISAs can use Vectors *as* masks, consequently the
  21 entire Vector ISA is usually available for use in creating masks (one
  22 exception being AVX512 which has a dedicated Mask regfile and opcodes).
  23 Duplication of such operations (popcount etc) is not practical for SV
  24 given the strategy of leveraging pre-existing Scalar instructions in a
  25 minimalist way.
  26
  27 With the scalar OpenPOWER v3.0B ISA having already popcnt, cntlz and
  28 others normally seen in Vector Mask operations it makes sense to allow
  29 *both* scalar integers *and* CR-Vectors to be predicate masks.  That in
  30 turn means that much more comprehensive interaction between CRs and scalar
  31 Integers is required, because with the CR Predication Modes designating
  32 CR *Fields* (not CR bits) as Predicate Elements, fast transfers between
  33 CR *Fields* and the Integer Register File is needed.
  34
  35 The opportunity is therefore taken to also augment CR logical arithmetic
  36 as well, using a mask-based paradigm that takes into consideration
  37 multiple bits of each CR Field (eq/lt/gt/ov).  By contrast v3.0B Scalar
  38 CR instructions (crand, crxor) only allow a single bit calculation, and
  39 both mtcr and mfcr are CR-orientated rather than CR *Field* orientated.
  40
  41 Also strangely there is no v3.0 instruction for directly moving CR Fields,
  42 only CR *bits*, so that is corrected here with `mcrfm`. The opportunity
  43 is taken to allow inversion of CR Field bits, when copied.
  44
  45 Basic concept:
  46
  47 * CR-based instructions that perform simple AND/OR from any four bits
  48   of a CR field to create a single bit value (0/1) in an integer register
  49 * Inverse of the same, taking a single bit value (0/1) from an integer
  50   register to selectively target any four bits of a given CR Field
  51 * CR-to-CR version of the same, allowing multiple bits to be AND/OR/XORed
  52   in one hit.
  53 * Optional Vectorisation of the same when SVP64 is implemented
  54
  55 Purpose:
  56
  57 * To provide a merged version of what is currently a multi-sequence of
  58   CR operations (crand, cror, crxor) with mfcr and mtcrf, reducing
  59   instruction count.
  60 * To provide a vectorised version of the same, suitable for advanced
  61   predication
  62
  63 Useful side-effects:
  64
  65 * mtcrweird when RA=0 is a means to set or clear
  66   multiple arbitrary CR Field bits simultaneously,
  67   using immediates embedded within the instruction.
  68 * With SVP64 on the weird instructions there is bit-for-bit interaction
  69   between GPR predicate masks (r3, r10, r31) and the source
  70   or destination GPR, in ways that are not possible with other
  71   SVP64 instructions because normal SVP64 is bit-per-element.
  72   On these weird instructions the element in effect *is* a bit.
  73 * `mfcrweird` mitigates a need to add `conflictd`, part of
  74   [[sv/vector_ops]], as well as allowing more complex comparisons.
  75
  76 # Bit ordering.
  77
  78 Please see [[svp64/appendix]] regarding CR bit ordering and for
  79 the definition of `CR{n}`
  80
  81 # Instruction form and pseudocode
  82
  83 **DRAFT** Instruction format (use of MAJOR 19 not approved by
  84 OPF ISA WG):
  85
  86 |0-5|6-10 |11|12-15|16-18|19-20|21-25  |26-30  |31|name      |
  87 |---|---- |--|-----|-----|-----|-----  |-----  |--|----      |
  88 |19 |RT   |  |mask |BFA  |     |XO[0:4]|XO[5:9]|/ |          |
  89 |19 |     |  |     |     |     |1 //// |00011  |  |rsvd      |
  90 |19 |RT   |M |mask |BFA  | 0 0 |0 mode |00011  |Rc|crrweird  |
  91 |19 |RT   |M |mask |BFA  | 0 1 |0 mode |00011  |Rc|mfcrweird |
  92 |19 |RA   |M |mask |BF   | 1 0 |0 mode |00011  |0 |mtcrrweird |
  93 |19 |RA   |M |mask |BF   | 1 0 |0 mode |00011  |1 |mtcrweird |
  94 |19 |BT   |M |mask |BFA  | 1 1 |0 mode |00011  |0 |crweirder |
  95 |19 |BF //|M |mask |BFA  | 1 1 |0 mode |00011  |1 |mcrfm     |
  96
  97 **crrweird**
  98
  99 mode is encoded in XO and is 4 bits
 100
 101     crrweird: RT,BFA,M,mask,mode
 102
 103     creg = CR{BFA}
 104     n0 = mask[0] & (mode[0] == creg[0])
 105     n1 = mask[1] & (mode[1] == creg[1])
 106     n2 = mask[2] & (mode[2] == creg[2])
 107     n3 = mask[3] & (mode[3] == creg[3])
 108     result = n0|n1|n2|n3 if M else n0&n1&n2&n3
 109     RT[63] = result # MSB0 numbering, 63 is LSB
 110     If Rc:
 111         CR0 = analyse(RT)
 112
 113 When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
 114 SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
 115 Mode capability
 116
 117 Also as noted below, element-width override bits normally used
 118 on the source is instead used to allow multiple results to be packed
 119 sequentially into the destination. *Destination elwidth overrides still apply*.
 120
 121 **mfcrrweird**
 122
 123 mode is encoded in XO and is 4 bits
 124
 125     mfcrrweird: RT,BFA,mask,mode
 126
 127     creg = CR{BFA}
 128     n0 = mask[0] & (mode[0] == creg[0])
 129     n1 = mask[1] & (mode[1] == creg[1])
 130     n2 = mask[2] & (mode[2] == creg[2])
 131     n3 = mask[3] & (mode[3] == creg[3])
 132     result = n0||n1||n2||n3
 133     RT[60:63] = result # MSB0 numbering, 63 is LSB
 134     If Rc:
 135         CR0 = analyse(RT)
 136
 137 When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
 138 SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
 139 Mode capability.
 140
 141 Also as noted below, element-width override bits normally used
 142 on the source is instead used to allow multiple results to be packed
 143 into the destination.  *Destination elwidth overrides still apply*
 144
 145 **mtcrrweird**
 146
 147 mode is encoded in XO and is 4 bits
 148
 149     mtcrrweird: BF,RA,M,mask,mode
 150
 151     a = (RA|0)
 152     n0 = mask[0] & (mode[0] == a[63])
 153     n1 = mask[1] & (mode[1] == a[62])
 154     n2 = mask[2] & (mode[2] == a[61])
 155     n3 = mask[3] & (mode[3] == a[60])
 156     result = n0 || n1 || n2 || n3
 157     if M:
 158         result |= CR{BF} & ~mask
 159     CR{BF} = result
 160
 161 When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
 162 SVP64 type operation and as such can use RC1 Data-dependent
 163 Mode capability
 164
 165 **mtcrweird**
 166
 167     mtcrweird: BF,RA,M,mask,mode
 168
 169     reg = (RA|0)
 170     lsb = reg[63] # MSB0 numbering
 171     n0 = mask[0] & (mode[0] == lsb)
 172     n1 = mask[1] & (mode[1] == lsb)
 173     n2 = mask[2] & (mode[2] == lsb)
 174     n3 = mask[3] & (mode[3] == lsb)
 175     result = n0 || n1 || n2 || n3
 176     if M:
 177         result |= CR{BF} & ~mask
 178     CR{BF} = result
 179
 180 Note that when M=1 this operation is a Read-Modify-Write on the CR Field
 181 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
 182 M=1. Correspondingly when M=0 this operation is an overwrite: no read
 183 of BF is required because the masked-out bits of the BF CR Field are
 184 set to zero.
 185
 186 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
 187 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
 188 capability (BF is 3 bits)
 189
 190 **mcrfm** - Move CR Field, masked.
 191
 192 This instruction copies, sets, or inverts parts of a CR Field
 193 into another CR Field.  `mcrf` copies only one bit of the CR
 194 from any arbitrary bit to any other arbitrary bit, whereas
 195 `mcrfm` copies an entire 4-bit CR Field (or masked parts thereof).
 196 Unlike `mcrf` the bits of the CR Field may not change position:
 197 the EQ bit from the source may only go into the EQ bit of the
 198 destination (optionally inverted, set, or cleared).
 199
 200     mcrfm: BF,BFA,M,mask,mode
 201
 202     result = mask & CR{BFA}
 203     if M:
 204         result |= CR{BF} & ~mask
 205     result ^= mode
 206     CR{BF} = result
 207
 208 When M=1 this operation is a Read-Modify-Write on the CR Field
 209 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
 210 M=1. Correspondingly when M=0 this operation is an overwrite: no read
 211 of BF is required because the masked-out bits of the BF CR Field are
 212 set to zero.
 213
 214 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
 215 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
 216 capability (BF is 3 bits)
 217
 218 *Programmer's note: `mode` being XORed onto the result provides
 219 considerable flexibility. individual bits of BFA may be copied inverted
 220 to BF by ensuring that `mask` and `mode` have the same bit set.  Also,
 221 individual bits in BF may be set to 1 by ensuring that the required bit of
 222 `mask` is set to zero and the same bit in `mode` is set to 1*
 223
 224 **crweirder**
 225
 226     crweirder: BT,BFA,mask,mode
 227
 228     creg = CR{BFA}
 229     n0 = mask[0] & (mode[0] == creg[0])
 230     n1 = mask[1] & (mode[1] == creg[1])
 231     n2 = mask[2] & (mode[2] == creg[2])
 232     n3 = mask[3] & (mode[3] == creg[3])
 233     BF = BT[2:4] # select CR
 234     bit = BT[0:1] # select bit of CR
 235     result = n0|n1|n2|n3 if M else n0&n1&n2&n3
 236     CR{BF}[bit] = result
 237
 238 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
 239 type operation that has 5-bit Data-dependent and 5-bit Predicate-result
 240 capability (BFT is 5 bits)
 241
 242 **Example Pseudo-ops:**
 243
 244     mtcri BF, mode    mtcrweird BF, r0, 0, 0b1111,~mode
 245     mtcrset BF, mask  mtcrweird BF, r0, 1, mask,0b0000
 246     mtcrclr BF, mask  mtcrweird BF, r0, 1, mask,0b1111
 247
 248 # Vectorised versions involving GPRs
 249
 250 The name "weird" refers to a minor violation of SV rules when it comes
 251 to deriving the Vectorised versions of these instructions.
 252
 253 Normally the progression of the SV for-loop would move on to the
 254 next register.  Instead however in the scalar case these instructions
 255 **remain in the same register** and insert or transfer between **bits**
 256 of the scalar integer source or destination.  The reason is that when
 257 using CR Fields as predicate masks and there is a need to transfer
 258 into a GPR, again for use as a predicate mask, the CR Field bits
 259 need to be efficiently packed into that one GPR (r3, r10 or r31).
 260
 261 Further useful violation of the normal SV Elwidth override rules allows
 262 for packing (or unpacking) of multiple CR test results into (or out of)
 263 an Integer Element. Note that the CR (source operand) elwidth field is
 264 utilised to determine the bit- packing size (1/2/4/8 with remaining
 265 bits within the Integer element set to zero) whilst the INT (dest
 266 operand) elwidth field still sets the Integer element size as usual
 267 (8/16/32/default)
 268
 269 **crrweird: RT, BB, mask.mode**
 270
 271     for i in range(VL):
 272         if BB.isvec:
 273             creg = CR{BB+i}
 274         else:
 275             creg = CR{BB}
 276         n0 = mask[0] & (mode[0] == creg[0])
 277         n1 = mask[1] & (mode[1] == creg[1])
 278         n2 = mask[2] & (mode[2] == creg[2])
 279         n3 = mask[3] & (mode[3] == creg[3])
 280         # OR or AND to a single bit
 281         result = n0|n1|n2|n3 if M else n0&n1&n2&n3
 282         if RT.isvec:
 283             # TODO: RT.elwidth override to be also added here
 284             # note, yes, really, the CR's elwidth field determines
 285             # the bit-packing into the INT!
 286             if BB.elwidth == 0b00:
 287                 # pack 1 result into 64-bit registers
 288                 iregs[RT+i][0..62] = 0
 289                 iregs[RT+i][63] = result # sets LSB to result
 290             if BB.elwidth == 0b01:
 291                 # pack 2 results sequentially into INT registers
 292                 iregs[RT+i//2][0..61] = 0
 293                 iregs[RT+i//2][63-(i%2)] = result
 294             if BB.elwidth == 0b10:
 295                 # pack 4 results sequentially into INT registers
 296                 iregs[RT+i//4][0..59] = 0
 297                 iregs[RT+i//4][63-(i%4)] = result
 298             if BB.elwidth == 0b11:
 299                 # pack 8 results sequentially into INT registers
 300                 iregs[RT+i//8][0..55] = 0
 301                 iregs[RT+i//8][63-(i%8)] = result
 302         else:
 303             iregs[RT][63-i] = result # results also in scalar INT
 304
 305 Note that:
 306
 307 * in the scalar case the CR-Vector assessment
 308   is stored bit-wise starting at the LSB of the
 309    destination scalar INT
 310 * in the INT-vector case the results are packed into LSBs
 311   of the INT Elements, the packing arrangement depending on both
 312   elwidth override settings.
 313
 314 **mfcrrweird: RT, BFA, mask.mode**
 315
 316 Unlike `crrweird` the results are 4-bit wide, so the packing
 317 will begin to spill over to other destination elements.  8 results per
 318 destination at 4-bits each still fits into destination elwidth at 32-bit,
 319 but for 16-bit and 8-bit obviously this does not fit, and must split
 320 across to the next element
 321
 322 When for example destination elwidth is 16-bit (0b10) the following packing
 323 occurs:
 324
 325 - SVRM bits 6:7 equal to 0b00 - one 4-bit result element packed into the
 326   first 4-bits of the 16-bit destination element (in the first 4 LSBs)
 327 - SVRM bits 6:7 equal to 0b01 - two 4-bit result elements packed into the
 328   first 8-bits of the 16-bit destination element (in the first 8 LSBs)
 329 - SVRM bits 6:7 equal to 0b10 - four 4-bit result elements packed into each
 330   16-bit destination element
 331 - SVRM bits 6:7 equal to 0b11 - eight 4-bit result elements, the first four
 332   of which are packed into the first 16-bit destination element, the
 333   second four of which are packed into the second 16-bit destination element.
 334
 335 Pseudocode example: note that dest elwidth overrides affect the
 336 packing of results. BB.elwidth in effect requests how many 4-bit
 337 result elements would like to be packed, but RT.elwidth determines
 338 the limit. Any parts of the destination elements not containing
 339 results are set to zero.
 340
 341     for i in range(VL):
 342         if BB.isvec:
 343             creg = CR{BB+i}
 344         else:
 345             creg = CR{BB}
 346         n0 = mask[0] & (mode[0] == creg[0])
 347         n1 = mask[1] & (mode[1] == creg[1])
 348         n2 = mask[2] & (mode[2] == creg[2])
 349         n3 = mask[3] & (mode[3] == creg[3])
 350         result = n0||n1||n2||n3 # 4-bit result
 351         if RT.isvec:
 352             # RT.elwidth override can affect the packing
 353             bwid = {0b00:64, 0b01:8, 0b10:16, 0b11:32}[RT.elwidth]
 354             t4, t8 = min(4, bwid//2), min(8, bwid//2)
 355             # yes, really, the CR's elwidth field determines
 356             # the bit-packing into the INT!
 357             if BB.elwidth == 0b00:
 358                 # pack 1 result into 64-bit registers
 359                 idx, boff = i, 0
 360             if BB.elwidth == 0b01:
 361                 # pack 2 results sequentially into INT registers
 362                 idx, boff = i//2, i%2
 363             if BB.elwidth == 0b10:
 364                 # pack 4 results sequentially into INT registers
 365                 idx, boff = i//t4, i%t4
 366             if BB.elwidth == 0b11:
 367                 # pack 8 results sequentially into INT registers
 368                 idx, boff = i//t8, i%t8
 369         else:
 370             # exceeding VL=16 is UNDEFINED
 371             idx, boff = 0, i
 372         iregs[RT+idx][60-boff*4:63-boff*4] = result
 373
 374
 375
 376 # v3.1 setbc instructions
 377
 378 There are additional setb conditional instructions in v3.1 (p129)
 379
 380     RT = (CR[BI] == 1) ? 1 : 0
 381
 382 which also negate that, and also return -1 / 0.  these are similar to
 383 crweird but not the same purpose.  most notable is that crweird acts on
 384 CR fields rather than the entire 32 bit CR.
 385
 386 # Predication Examples
 387
 388 Take the following example:
 389
 390     r10 = 0b00010
 391     sv.mtcrweird/dm=r10/dz cr8.v, 0, 0b0011.0000
 392
 393 Here, RA is zero, so the source input is zero. The destination is CR Field
 394 8, and the destination predicate mask indicates to target the first two
 395 elements.  Destination predicate zeroing is enabled, and the destination
 396 predicate is only set in the 2nd bit.  mask is 0b0011, mode is all zeros.
 397
 398 Let us first consider what should go into element 0 (CR Field 8):
 399
 400 * The destination predicate bit is zero, and zeroing is enabled.
 401 * Therefore, what is in the source is irrelevant: the result must
 402   be zero.
 403 * Therefore all four bits of CR Field 8 are therefore set to zero.
 404
 405 Now the second element, CR Field 9 (CR9):
 406
 407 * Bit 2 of the destination predicate, r10, is 1. Therefore the computation
 408   of the result is relevant.
 409 * RA is zero therefore bit 2 is zero.  mask is 0b0011 and mode is 0b0000
 410 * When calculating n0 thru n3 we get n0=1, n1=2, n2=0, n3=0
 411 * Therefore, CR9 is set (using LSB0 ordering) to 0b0011, i.e. to mask.
 412
 413 It should be clear that this instruction uses bits of the integer
 414 predicate to decide whether to set CR Fields to `(mask & ~mode)` or
 415 to zero.  Thus, in effect, it is the integer predicate that has been
 416 copied into the CR Fields.
 417
 418 By using twin predication, zeroing, and inversion (sm=~r3, dm=r10) for
 419 example, it becomes possible to combine two Integers together in order
 420 to set bits in CR Fields.  Likewise there are dozens of ways that CR
 421 Predicates can be used, on the same sv.mtcrweird instruction.