openpower/sv/cr_int_predication.mdwn

   1 [[!tag standards]]
   2
   3 # New instructions for CR/INT predication
   4
   5 **DRAFT STATUS**
   6
   7 See:
   8
   9 * main bugreport for crweirds
  10   <https://bugs.libre-soc.org/show_bug.cgi?id=533>
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=527>
  12 * <https://bugs.libre-soc.org/show_bug.cgi?id=569>
  13 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
  14
  15 Rationale:
  16
  17 Condition Registers are conceptually perfect for use as predicate masks,
  18 the only problem being that typical Vector ISAs have quite comprehensive
  19 mask-based instructions: set-before-first, popcount and much more.
  20 In fact many Vector ISAs can use Vectors *as* masks, consequently the
  21 entire Vector ISA is usually available for use in creating masks (one
  22 exception being AVX512 which has a dedicated Mask regfile and opcodes).
  23 Duplication of such operations (popcount etc) is not practical for SV
  24 given the strategy of leveraging pre-existing Scalar instructions in a
  25 minimalist way.
  26
  27 With the scalar OpenPOWER v3.0B ISA having already popcnt, cntlz and
  28 others normally seen in Vector Mask operations it makes sense to allow
  29 *both* scalar integers *and* CR-Vectors to be predicate masks.  That in
  30 turn means that much more comprehensive interaction between CRs and scalar
  31 Integers is required, because with the CR Predication Modes designating
  32 CR *Fields* (not CR bits) as Predicate Elements, fast transfers between
  33 CR *Fields* and the Integer Register File is needed.
  34
  35 The opportunity is therefore taken to also augment CR logical arithmetic
  36 as well, using a mask-based paradigm that takes into consideration
  37 multiple bits of each CR Field (eq/lt/gt/ov).  By contrast v3.0B Scalar
  38 CR instructions (crand, crxor) only allow a single bit calculation, and
  39 both mtcr and mfcr are CR-orientated rather than CR *Field* orientated.
  40
  41 Also strangely there is no v3.0 instruction for directly moving CR Fields,
  42 only CR *bits*, so that is corrected here with `mcrfm`. The opportunity
  43 is taken to allow inversion of CR Field bits, when copied.
  44
  45 Basic concept:
  46
  47 * CR-based instructions that perform simple AND/OR from any four bits
  48   of a CR field to create a single bit value (0/1) in an integer register
  49 * Inverse of the same, taking a single bit value (0/1) from an integer
  50   register to selectively target any four bits of a given CR Field
  51 * CR-to-CR version of the same, allowing multiple bits to be AND/OR/XORed
  52   in one hit.
  53 * Optional Vectorisation of the same when SVP64 is implemented
  54
  55 Purpose:
  56
  57 * To provide a merged version of what is currently a multi-sequence of
  58   CR operations (crand, cror, crxor) with mfcr and mtcrf, reducing
  59   instruction count.
  60 * To provide a vectorised version of the same, suitable for advanced
  61   predication
  62
  63 Side-effects:
  64
  65 * mtcrweird when RA=0 is a means to set or clear arbitrary CR bits
  66   using immediates embedded within the instruction.
  67
  68 (Twin) Predication interactions:
  69
  70 * INT twin predication with zeroing is a way to copy an integer into
  71   CRs without necessarily needing the INT register (RA).  if it is, it is
  72   effectively ANDed (or negate-and-ANDed) with the INT Predicate
  73 * CR twin predication with zeroing is likewise a way to interact with
  74   the incoming integer
  75
  76 this gets particularly powerful if data-dependent predication is also
  77 enabled.  further explanation is below.
  78
  79 # Bit ordering.
  80
  81 Please see [[svp64/appendix]] regarding CR bit ordering and for
  82 the definition of `CR{n}`
  83
  84 # Instruction form and pseudocode
  85
  86 **DRAFT** Instruction format (use of MAJOR 19 not approved by
  87 OPF ISA WG):
  88
  89 |0-5|6-10 |11|12-15|16-18|19-20|21-25  |26-30  |31|name      |
  90 |---|---- |--|-----|-----|-----|-----  |-----  |--|----      |
  91 |19 |RT   |  |mask |BFA  |     |XO[0:4]|XO[5:9]|/ |          |
  92 |19 |     |  |     |     |     |1 //// |00011  |  |rsvd      |
  93 |19 |RT   |M |mask |BFA  | 0 0 |0 mode |00011  |Rc|crrweird  |
  94 |19 |RT   |M |mask |BFA  | 0 1 |0 mode |00011  |Rc|mfcrweird |
  95 |19 |RA   |M |mask |BF   | 1 0 |0 mode |00011  |0 |mtcrrweird |
  96 |19 |RA   |M |mask |BF   | 1 0 |0 mode |00011  |1 |mtcrweird |
  97 |19 |BT   |M |mask |BFA  | 1 1 |0 mode |00011  |0 |crweirder |
  98 |19 |BF //|M |mask |BFA  | 1 1 |0 mode |00011  |1 |mcrfm     |
  99
 100 **crrweird**
 101
 102 mode is encoded in XO and is 4 bits
 103
 104     crrweird: RT, BFA, M, mask.mode
 105
 106     creg = CR{BFA}
 107     n0 = mask[0] & (mode[0] == creg[0])
 108     n1 = mask[1] & (mode[1] == creg[1])
 109     n2 = mask[2] & (mode[2] == creg[2])
 110     n3 = mask[3] & (mode[3] == creg[3])
 111     result = n0|n1|n2|n3 if M else n0&n1&n2&n3
 112     RT[63] = result # MSB0 numbering, 63 is LSB
 113     If Rc:
 114         CR0 = analyse(RT)
 115
 116 When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
 117 SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
 118 Mode capability
 119
 120 Also as noted below, element-width override bits normally used
 121 on the source is instead used to allow multiple results to be packed
 122 sequentially into the destination. *Destination elwidth overrides still apply*.
 123
 124 **mfcrrweird**
 125
 126 mode is encoded in XO and is 4 bits
 127
 128     mfcrrweird: RT, BFA, mask.mode
 129
 130     creg = CR{BFA}
 131     n0 = mask[0] & (mode[0] == creg[0])
 132     n1 = mask[1] & (mode[1] == creg[1])
 133     n2 = mask[2] & (mode[2] == creg[2])
 134     n3 = mask[3] & (mode[3] == creg[3])
 135     result = n0||n1||n2||n3
 136     RT[60:63] = result # MSB0 numbering, 63 is LSB
 137     If Rc:
 138         CR0 = analyse(RT)
 139
 140 When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
 141 SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
 142 Mode capability.
 143
 144 Also as noted below, element-width override bits normally used
 145 on the source is instead used to allow multiple results to be packed
 146 into the destination.  *Destination elwidth overrides still apply*
 147
 148 **mtcrrweird**
 149
 150 mode is encoded in XO and is 4 bits
 151
 152     mtcrrweird: BF, RA, M, mask.mode
 153
 154     n0 = mask[0] & (mode[0] == RA[63])
 155     n1 = mask[1] & (mode[1] == RA[62])
 156     n2 = mask[2] & (mode[2] == RA[61])
 157     n3 = mask[3] & (mode[3] == RA[60])
 158     result = n0 || n1 || n2 || n3
 159     if M:
 160         result |= CR{BF} & ~mask
 161     CR{BF} = result
 162
 163 When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
 164 SVP64 type operation and as such can use RC1 Data-dependent
 165 Mode capability
 166
 167 **mtcrweird**
 168
 169     mtcrweird: BF, RA, M, mask.mode
 170
 171     reg = (RA|0)
 172     lsb = reg[63] # MSB0 numbering
 173     n0 = mask[0] & (mode[0] == lsb)
 174     n1 = mask[1] & (mode[1] == lsb)
 175     n2 = mask[2] & (mode[2] == lsb)
 176     n3 = mask[3] & (mode[3] == lsb)
 177     result = n0 || n1 || n2 || n3
 178     if M:
 179         result |= CR{BF} & ~mask
 180     CR{BF} = result
 181
 182 Note that when M=1 this operation is a Read-Modify-Write on the CR Field
 183 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
 184 M=1. Correspondingly when M=0 this operation is an overwrite: no read
 185 of BF is required because the masked-out bits of the BF CR Field are
 186 set to zero.
 187
 188 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
 189 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
 190 capability (BF is 3 bits)
 191
 192 **mcrfm** - Move CR Field, masked.
 193
 194     mcrfm: BF, BFA, M, mask.mode
 195
 196     result = mask & CR{BFA}
 197     if M:
 198         result |= CR{BF} & ~mask
 199     result ^= mode
 200     CR{BF} = result
 201
 202 Note that when M=1 this operation is a Read-Modify-Write on the CR Field
 203 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
 204 M=1. Correspondingly when M=0 this operation is an overwrite: no read
 205 of BF is required because the masked-out bits of the BF CR Field are
 206 set to zero.
 207
 208 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
 209 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
 210 capability (BF is 3 bits)
 211
 212 *Programmer's note: `mode` being XORed onto the result provides
 213 considerable flexibility. individual bits of BFA may be copied inverted
 214 to BF by ensuring that `mask` and `mode` have the same bit set.  Also,
 215 individual bits in BF may be set to 1 by ensuring that the required bit of
 216 `mask` is set to zero and the same bit in `mode` is set to 1*
 217
 218 **crweirder**
 219
 220     crweirder: BT, BFA, mask.mode
 221
 222     creg = CR{BFA}
 223     n0 = mask[0] & (mode[0] == creg[0])
 224     n1 = mask[1] & (mode[1] == creg[1])
 225     n2 = mask[2] & (mode[2] == creg[2])
 226     n3 = mask[3] & (mode[3] == creg[3])
 227     BF = BT[2:4] # select CR
 228     bit = BT[0:1] # select bit of CR
 229     result = n0|n1|n2|n3 if M else n0&n1&n2&n3
 230     CR{BF}[bit] = result
 231
 232 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
 233 type operation that has 5-bit Data-dependent and 5-bit Predicate-result
 234 capability (BFT is 5 bits)
 235
 236 **Example Pseudo-ops:**
 237
 238     mtcri BF, mode    mtcrweird BF, r0, 0, 0b1111.~mode
 239     mtcrset BF, mask  mtcrweird BF, r0, 1, mask.0b0000
 240     mtcrclr BF, mask  mtcrweird BF, r0, 1, mask.0b1111
 241
 242 # Vectorised versions involving GPRs
 243
 244 The name "weird" refers to a minor violation of SV rules when it comes
 245 to deriving the Vectorised versions of these instructions.
 246
 247 Normally the progression of the SV for-loop would move on to the
 248 next register.  Instead however in the scalar case these instructions
 249 **remain in the same register** and insert or transfer between **bits**
 250 of the scalar integer source or destination.
 251
 252 Further useful violation of the normal SV Elwidth override rules allows
 253 for packing (or unpacking) of multiple CR test results into (or out of)
 254 an Integer Element. Note that the CR (source operand) elwidth field is
 255 utilised to determine the bit- packing size (1/2/4/8 with remaining
 256 bits within the Integer element set to zero) whilst the INT (dest
 257 operand) elwidth field still sets the Integer element size as usual
 258 (8/16/32/default)
 259
 260 **crrweird: RT, BB, mask.mode**
 261
 262     for i in range(VL):
 263         if BB.isvec:
 264             creg = CR{BB+i}
 265         else:
 266             creg = CR{BB}
 267         n0 = mask[0] & (mode[0] == creg[0])
 268         n1 = mask[1] & (mode[1] == creg[1])
 269         n2 = mask[2] & (mode[2] == creg[2])
 270         n3 = mask[3] & (mode[3] == creg[3])
 271         # OR or AND to a single bit
 272         result = n0|n1|n2|n3 if M else n0&n1&n2&n3
 273         if RT.isvec:
 274             # TODO: RT.elwidth override to be also added here
 275             # note, yes, really, the CR's elwidth field determines
 276             # the bit-packing into the INT!
 277             if BB.elwidth == 0b00:
 278                 # pack 1 result into 64-bit registers
 279                 iregs[RT+i][0..62] = 0
 280                 iregs[RT+i][63] = result # sets LSB to result
 281             if BB.elwidth == 0b01:
 282                 # pack 2 results sequentially into INT registers
 283                 iregs[RT+i//2][0..61] = 0
 284                 iregs[RT+i//2][63-(i%2)] = result
 285             if BB.elwidth == 0b10:
 286                 # pack 4 results sequentially into INT registers
 287                 iregs[RT+i//4][0..59] = 0
 288                 iregs[RT+i//4][63-(i%4)] = result
 289             if BB.elwidth == 0b11:
 290                 # pack 8 results sequentially into INT registers
 291                 iregs[RT+i//8][0..55] = 0
 292                 iregs[RT+i//8][63-(i%8)] = result
 293         else:
 294             iregs[RT][63-i] = result # results also in scalar INT
 295
 296 Note that:
 297
 298 * in the scalar case the CR-Vector assessment
 299   is stored bit-wise starting at the LSB of the
 300    destination scalar INT
 301 * in the INT-vector case the results are packed into LSBs
 302   of the INT Elements, the packing arrangement depending on both
 303   elwidth override settings.
 304
 305 **mfcrrweird: RT, BFA, mask.mode**
 306
 307 Unlike `crrweird` the results are 4-bit wide, so the packing
 308 will begin to spill over to other destination elements.  8 results per
 309 destination at 4-bits each still fits into destination elwidth at 32-bit,
 310 but for 16-bit and 8-bit obviously this does not fit, and must split
 311 across to the next element
 312
 313 When for example destination elwidth is 16-bit (0b10) the following packing
 314 occurs:
 315
 316 - SVRM bits 6:7 equal to 0b00 - one 4-bit result element packed into the
 317   first 4-bits of the 16-bit destination element (in the first 4 LSBs)
 318 - SVRM bits 6:7 equal to 0b01 - two 4-bit result elements packed into the
 319   first 8-bits of the 16-bit destination element (in the first 8 LSBs)
 320 - SVRM bits 6:7 equal to 0b10 - four 4-bit result elements packed into each
 321   16-bit destination element
 322 - SVRM bits 6:7 equal to 0b11 - eight 4-bit result elements, the first four
 323   of which are packed into the first 16-bit destination element, the
 324   second four of which are packed into the second 16-bit destination element.
 325
 326 Pseudocode example: note that dest elwidth overrides affect the
 327 packing of results. BB.elwidth in effect requests how many 4-bit
 328 result elements would like to be packed, but RT.elwidth determines
 329 the limit. Any parts of the destination elements not containing
 330 results are set to zero.
 331
 332     for i in range(VL):
 333         if BB.isvec:
 334             creg = CR{BB+i}
 335         else:
 336             creg = CR{BB}
 337         n0 = mask[0] & (mode[0] == creg[0])
 338         n1 = mask[1] & (mode[1] == creg[1])
 339         n2 = mask[2] & (mode[2] == creg[2])
 340         n3 = mask[3] & (mode[3] == creg[3])
 341         result = n0||n1||n2||n3 # 4-bit result
 342         if RT.isvec:
 343             # RT.elwidth override can affect the packing
 344             bwid = {0b00:64, 0b01:8, 0b10:16, 0b11:32}[RT.elwidth]
 345             t4, t8 = min(4, bwid//2), min(8, bwid//2)
 346             # yes, really, the CR's elwidth field determines
 347             # the bit-packing into the INT!
 348             if BB.elwidth == 0b00:
 349                 # pack 1 result into 64-bit registers
 350                 idx, boff = i, 0
 351             if BB.elwidth == 0b01:
 352                 # pack 2 results sequentially into INT registers
 353                 idx, boff = i//2, i%2
 354             if BB.elwidth == 0b10:
 355                 # pack 4 results sequentially into INT registers
 356                 idx, boff = i//t4, i%t4
 357             if BB.elwidth == 0b11:
 358                 # pack 8 results sequentially into INT registers
 359                 idx, boff = i//t8, i%t8
 360         else:
 361             # exceeding VL=16 is UNDEFINED
 362             idx, boff = 0, i
 363         iregs[RT+idx][60-boff*4:63-boff*4] = result
 364
 365
 366
 367 # v3.1 setbc instructions
 368
 369 There are additional setb conditional instructions in v3.1 (p129)
 370
 371     RT = (CR[BI] == 1) ? 1 : 0
 372
 373 which also negate that, and also return -1 / 0.  these are similar to
 374 crweird but not the same purpose.  most notable is that crweird acts on
 375 CR fields rather than the entire 32 bit CR.
 376
 377 # Predication Examples
 378
 379 Take the following example:
 380
 381     r10 = 0b00010
 382     sv.mtcrweird/dm=r10/dz cr8.v, 0, 0b0011.0000
 383
 384 Here, RA is zero, so the source input is zero. The destination is CR Field
 385 8, and the destination predicate mask indicates to target the first two
 386 elements.  Destination predicate zeroing is enabled, and the destination
 387 predicate is only set in the 2nd bit.  mask is 0b0011, mode is all zeros.
 388
 389 Let us first consider what should go into element 0 (CR Field 8):
 390
 391 * The destination predicate bit is zero, and zeroing is enabled.
 392 * Therefore, what is in the source is irrelevant: the result must
 393   be zero.
 394 * Therefore all four bits of CR Field 8 are therefore set to zero.
 395
 396 Now the second element, CR Field 9 (CR9):
 397
 398 * Bit 2 of the destination predicate, r10, is 1. Therefore the computation
 399   of the result is relevant.
 400 * RA is zero therefore bit 2 is zero.  mask is 0b0011 and mode is 0b0000
 401 * When calculating n0 thru n3 we get n0=1, n1=2, n2=0, n3=0
 402 * Therefore, CR9 is set (using LSB0 ordering) to 0b0011, i.e. to mask.
 403
 404 It should be clear that this instruction uses bits of the integer
 405 predicate to decide whether to set CR Fields to `(mask & ~mode)` or
 406 to zero.  Thus, in effect, it is the integer predicate that has been
 407 copied into the CR Fields.
 408
 409 By using twin predication, zeroing, and inversion (sm=~r3, dm=r10) for
 410 example, it becomes possible to combine two Integers together in order
 411 to set bits in CR Fields.  Likewise there are dozens of ways that CR
 412 Predicates can be used, on the same sv.mtcrweird instruction.