openpower/sv/cr_int_predication.mdwn

   1 [[!tag standards]]
   2
   3 # New instructions for CR/INT predication
   4
   5 **DRAFT STATUS**
   6
   7 See:
   8
   9 * main bugreport for crweirds
  10   <https://bugs.libre-soc.org/show_bug.cgi?id=533>
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=527>
  12 * <https://bugs.libre-soc.org/show_bug.cgi?id=569>
  13 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
  14
  15 Rationale:
  16
  17 Condition Registers are conceptually perfect for use as predicate masks,
  18 the only problem being that typical Vector ISAs have quite comprehensive
  19 mask-based instructions: set-before-first, popcount and much more.
  20 In fact many Vector ISAs can use Vectors *as* masks, consequently the
  21 entire Vector ISA is usually available for use in creating masks (one
  22 exception being AVX512 which has a dedicated Mask regfile and opcodes).
  23 Duplication of such operations (popcount etc) is not practical for SV
  24 given the strategy of leveraging pre-existing Scalar instructions in a
  25 minimalist way.
  26
  27 With the scalar OpenPOWER v3.0B ISA having already popcnt, cntlz and
  28 others normally seen in Vector Mask operations it makes sense to allow
  29 *both* scalar integers *and* CR-Vectors to be predicate masks.  That in
  30 turn means that much more comprehensive interaction between CRs and scalar
  31 Integers is required, because with the CR Predication Modes designating
  32 CR *Fields* (not CR bits) as Predicate Elements, fast transfers between
  33 CR *Fields* and the Integer Register File is needed.
  34
  35 The opportunity is therefore taken to also augment CR logical arithmetic
  36 as well, using a mask-based paradigm that takes into consideration
  37 multiple bits of each CR Field (eq/lt/gt/ov).  By contrast v3.0B Scalar
  38 CR instructions (crand, crxor) only allow a single bit calculation, and
  39 both mtcr and mfcr are CR-orientated rather than CR *Field* orientated.
  40
  41 Also strangely there is no v3.0 instruction for directly moving CR Fields,
  42 only CR *bits*, so that is corrected here with `mcrfm`. The opportunity
  43 is taken to allow inversion of CR Field bits, when copied.
  44
  45 Basic concept:
  46
  47 * CR-based instructions that perform simple AND/OR from any four bits
  48   of a CR field to create a single bit value (0/1) in an integer register
  49 * Inverse of the same, taking a single bit value (0/1) from an integer
  50   register to selectively target any four bits of a given CR Field
  51 * CR-to-CR version of the same, allowing multiple bits to be AND/OR/XORed
  52   in one hit.
  53 * Optional Vectorisation of the same when SVP64 is implemented
  54
  55 Purpose:
  56
  57 * To provide a merged version of what is currently a multi-sequence of
  58   CR operations (crand, cror, crxor) with mfcr and mtcrf, reducing
  59   instruction count.
  60 * To provide a vectorised version of the same, suitable for advanced
  61   predication
  62
  63 Side-effects:
  64
  65 * mtcrweird when RA=0 is a means to set or clear arbitrary CR bits
  66   using immediates embedded within the instruction.
  67
  68 (Twin) Predication interactions:
  69
  70 * INT twin predication with zeroing is a way to copy an integer into
  71   CRs without necessarily needing the INT register (RA).  if it is, it is
  72   effectively ANDed (or negate-and-ANDed) with the INT Predicate
  73 * CR twin predication with zeroing is likewise a way to interact with
  74   the incoming integer
  75
  76 this gets particularly powerful if data-dependent predication is also
  77 enabled.  further explanation is below.
  78
  79 # Bit ordering.
  80
  81 Please see [[svp64/appendix]] regarding CR bit ordering and for
  82 the definition of `CR{n}`
  83
  84 # Instruction form and pseudocode
  85
  86 **DRAFT** Instruction format (use of MAJOR 19 not approved by
  87 OPF ISA WG):
  88
  89 |0-5|6-10 |11|12-15|16-18|19-20|21-25  |26-30  |31|name      |
  90 |---|---- |--|-----|-----|-----|-----  |-----  |--|----      |
  91 |19 |RT   |  |mask |BFA  |     |XO[0:4]|XO[5:9]|/ |          |
  92 |19 |     |  |     |     |     |1 //// |00011  |  |rsvd      |
  93 |19 |RT   |M |mask |BFA  | 0 0 |0 mode |00011  |Rc|crrweird  |
  94 |19 |RT   |M |mask |BFA  | 0 1 |0 mode |00011  |Rc|mfcrweird |
  95 |19 |RA   |M |mask |BF   | 1 0 |0 mode |00011  |0 |mtcrrweird |
  96 |19 |RA   |M |mask |BF   | 1 0 |0 mode |00011  |1 |mtcrweird |
  97 |19 |BT   |M |mask |BFA  | 1 1 |0 mode |00011  |0 |crweirder |
  98 |19 |BF //|M |mask |BFA  | 1 1 |0 mode |00011  |1 |mcrfm     |
  99
 100 **crrweird**
 101
 102 mode is encoded in XO and is 4 bits
 103
 104     crrweird: RT,BFA,M,mask,mode
 105
 106     creg = CR{BFA}
 107     n0 = mask[0] & (mode[0] == creg[0])
 108     n1 = mask[1] & (mode[1] == creg[1])
 109     n2 = mask[2] & (mode[2] == creg[2])
 110     n3 = mask[3] & (mode[3] == creg[3])
 111     result = n0|n1|n2|n3 if M else n0&n1&n2&n3
 112     RT[63] = result # MSB0 numbering, 63 is LSB
 113     If Rc:
 114         CR0 = analyse(RT)
 115
 116 When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
 117 SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
 118 Mode capability
 119
 120 Also as noted below, element-width override bits normally used
 121 on the source is instead used to allow multiple results to be packed
 122 sequentially into the destination. *Destination elwidth overrides still apply*.
 123
 124 **mfcrrweird**
 125
 126 mode is encoded in XO and is 4 bits
 127
 128     mfcrrweird: RT,BFA,mask,mode
 129
 130     creg = CR{BFA}
 131     n0 = mask[0] & (mode[0] == creg[0])
 132     n1 = mask[1] & (mode[1] == creg[1])
 133     n2 = mask[2] & (mode[2] == creg[2])
 134     n3 = mask[3] & (mode[3] == creg[3])
 135     result = n0||n1||n2||n3
 136     RT[60:63] = result # MSB0 numbering, 63 is LSB
 137     If Rc:
 138         CR0 = analyse(RT)
 139
 140 When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
 141 SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
 142 Mode capability.
 143
 144 Also as noted below, element-width override bits normally used
 145 on the source is instead used to allow multiple results to be packed
 146 into the destination.  *Destination elwidth overrides still apply*
 147
 148 **mtcrrweird**
 149
 150 mode is encoded in XO and is 4 bits
 151
 152     mtcrrweird: BF,RA,M,mask,mode
 153
 154     n0 = mask[0] & (mode[0] == RA[63])
 155     n1 = mask[1] & (mode[1] == RA[62])
 156     n2 = mask[2] & (mode[2] == RA[61])
 157     n3 = mask[3] & (mode[3] == RA[60])
 158     result = n0 || n1 || n2 || n3
 159     if M:
 160         result |= CR{BF} & ~mask
 161     CR{BF} = result
 162
 163 When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
 164 SVP64 type operation and as such can use RC1 Data-dependent
 165 Mode capability
 166
 167 **mtcrweird**
 168
 169     mtcrweird: BF,RA,M,mask,mode
 170
 171     reg = (RA|0)
 172     lsb = reg[63] # MSB0 numbering
 173     n0 = mask[0] & (mode[0] == lsb)
 174     n1 = mask[1] & (mode[1] == lsb)
 175     n2 = mask[2] & (mode[2] == lsb)
 176     n3 = mask[3] & (mode[3] == lsb)
 177     result = n0 || n1 || n2 || n3
 178     if M:
 179         result |= CR{BF} & ~mask
 180     CR{BF} = result
 181
 182 Note that when M=1 this operation is a Read-Modify-Write on the CR Field
 183 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
 184 M=1. Correspondingly when M=0 this operation is an overwrite: no read
 185 of BF is required because the masked-out bits of the BF CR Field are
 186 set to zero.
 187
 188 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
 189 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
 190 capability (BF is 3 bits)
 191
 192 **mcrfm** - Move CR Field, masked.
 193
 194 This instruction copies, sets, or inverts parts of a CR Field
 195 into another CR Field.  `mcrf` copies only one bit of the CR
 196 from any arbitrary bit to any other arbitrary bit, whereas
 197 `mcrfm` copies an entire 4-bit CR Field (or masked parts thereof).
 198 Unlike `mcrf` the bits of the CR Field may not change position:
 199 the EQ bit from the source may only go into the EQ bit of the
 200 destination (optionally inverted, set, or cleared).
 201
 202     mcrfm: BF,BFA,M,mask,mode
 203
 204     result = mask & CR{BFA}
 205     if M:
 206         result |= CR{BF} & ~mask
 207     result ^= mode
 208     CR{BF} = result
 209
 210 When M=1 this operation is a Read-Modify-Write on the CR Field
 211 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
 212 M=1. Correspondingly when M=0 this operation is an overwrite: no read
 213 of BF is required because the masked-out bits of the BF CR Field are
 214 set to zero.
 215
 216 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
 217 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
 218 capability (BF is 3 bits)
 219
 220 *Programmer's note: `mode` being XORed onto the result provides
 221 considerable flexibility. individual bits of BFA may be copied inverted
 222 to BF by ensuring that `mask` and `mode` have the same bit set.  Also,
 223 individual bits in BF may be set to 1 by ensuring that the required bit of
 224 `mask` is set to zero and the same bit in `mode` is set to 1*
 225
 226 **crweirder**
 227
 228     crweirder: BT,BFA,mask,mode
 229
 230     creg = CR{BFA}
 231     n0 = mask[0] & (mode[0] == creg[0])
 232     n1 = mask[1] & (mode[1] == creg[1])
 233     n2 = mask[2] & (mode[2] == creg[2])
 234     n3 = mask[3] & (mode[3] == creg[3])
 235     BF = BT[2:4] # select CR
 236     bit = BT[0:1] # select bit of CR
 237     result = n0|n1|n2|n3 if M else n0&n1&n2&n3
 238     CR{BF}[bit] = result
 239
 240 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
 241 type operation that has 5-bit Data-dependent and 5-bit Predicate-result
 242 capability (BFT is 5 bits)
 243
 244 **Example Pseudo-ops:**
 245
 246     mtcri BF, mode    mtcrweird BF, r0, 0, 0b1111,~mode
 247     mtcrset BF, mask  mtcrweird BF, r0, 1, mask,0b0000
 248     mtcrclr BF, mask  mtcrweird BF, r0, 1, mask,0b1111
 249
 250 # Vectorised versions involving GPRs
 251
 252 The name "weird" refers to a minor violation of SV rules when it comes
 253 to deriving the Vectorised versions of these instructions.
 254
 255 Normally the progression of the SV for-loop would move on to the
 256 next register.  Instead however in the scalar case these instructions
 257 **remain in the same register** and insert or transfer between **bits**
 258 of the scalar integer source or destination.  The reason is that when
 259 using CR Fields as predicate masks and there is a need to transfer
 260 into a GPR, again for use as a predicate mask, the CR Field bits
 261 need to be efficiently packed into that one GPR (r3, r10 or r31).
 262
 263 Further useful violation of the normal SV Elwidth override rules allows
 264 for packing (or unpacking) of multiple CR test results into (or out of)
 265 an Integer Element. Note that the CR (source operand) elwidth field is
 266 utilised to determine the bit- packing size (1/2/4/8 with remaining
 267 bits within the Integer element set to zero) whilst the INT (dest
 268 operand) elwidth field still sets the Integer element size as usual
 269 (8/16/32/default)
 270
 271 **crrweird: RT, BB, mask.mode**
 272
 273     for i in range(VL):
 274         if BB.isvec:
 275             creg = CR{BB+i}
 276         else:
 277             creg = CR{BB}
 278         n0 = mask[0] & (mode[0] == creg[0])
 279         n1 = mask[1] & (mode[1] == creg[1])
 280         n2 = mask[2] & (mode[2] == creg[2])
 281         n3 = mask[3] & (mode[3] == creg[3])
 282         # OR or AND to a single bit
 283         result = n0|n1|n2|n3 if M else n0&n1&n2&n3
 284         if RT.isvec:
 285             # TODO: RT.elwidth override to be also added here
 286             # note, yes, really, the CR's elwidth field determines
 287             # the bit-packing into the INT!
 288             if BB.elwidth == 0b00:
 289                 # pack 1 result into 64-bit registers
 290                 iregs[RT+i][0..62] = 0
 291                 iregs[RT+i][63] = result # sets LSB to result
 292             if BB.elwidth == 0b01:
 293                 # pack 2 results sequentially into INT registers
 294                 iregs[RT+i//2][0..61] = 0
 295                 iregs[RT+i//2][63-(i%2)] = result
 296             if BB.elwidth == 0b10:
 297                 # pack 4 results sequentially into INT registers
 298                 iregs[RT+i//4][0..59] = 0
 299                 iregs[RT+i//4][63-(i%4)] = result
 300             if BB.elwidth == 0b11:
 301                 # pack 8 results sequentially into INT registers
 302                 iregs[RT+i//8][0..55] = 0
 303                 iregs[RT+i//8][63-(i%8)] = result
 304         else:
 305             iregs[RT][63-i] = result # results also in scalar INT
 306
 307 Note that:
 308
 309 * in the scalar case the CR-Vector assessment
 310   is stored bit-wise starting at the LSB of the
 311    destination scalar INT
 312 * in the INT-vector case the results are packed into LSBs
 313   of the INT Elements, the packing arrangement depending on both
 314   elwidth override settings.
 315
 316 **mfcrrweird: RT, BFA, mask.mode**
 317
 318 Unlike `crrweird` the results are 4-bit wide, so the packing
 319 will begin to spill over to other destination elements.  8 results per
 320 destination at 4-bits each still fits into destination elwidth at 32-bit,
 321 but for 16-bit and 8-bit obviously this does not fit, and must split
 322 across to the next element
 323
 324 When for example destination elwidth is 16-bit (0b10) the following packing
 325 occurs:
 326
 327 - SVRM bits 6:7 equal to 0b00 - one 4-bit result element packed into the
 328   first 4-bits of the 16-bit destination element (in the first 4 LSBs)
 329 - SVRM bits 6:7 equal to 0b01 - two 4-bit result elements packed into the
 330   first 8-bits of the 16-bit destination element (in the first 8 LSBs)
 331 - SVRM bits 6:7 equal to 0b10 - four 4-bit result elements packed into each
 332   16-bit destination element
 333 - SVRM bits 6:7 equal to 0b11 - eight 4-bit result elements, the first four
 334   of which are packed into the first 16-bit destination element, the
 335   second four of which are packed into the second 16-bit destination element.
 336
 337 Pseudocode example: note that dest elwidth overrides affect the
 338 packing of results. BB.elwidth in effect requests how many 4-bit
 339 result elements would like to be packed, but RT.elwidth determines
 340 the limit. Any parts of the destination elements not containing
 341 results are set to zero.
 342
 343     for i in range(VL):
 344         if BB.isvec:
 345             creg = CR{BB+i}
 346         else:
 347             creg = CR{BB}
 348         n0 = mask[0] & (mode[0] == creg[0])
 349         n1 = mask[1] & (mode[1] == creg[1])
 350         n2 = mask[2] & (mode[2] == creg[2])
 351         n3 = mask[3] & (mode[3] == creg[3])
 352         result = n0||n1||n2||n3 # 4-bit result
 353         if RT.isvec:
 354             # RT.elwidth override can affect the packing
 355             bwid = {0b00:64, 0b01:8, 0b10:16, 0b11:32}[RT.elwidth]
 356             t4, t8 = min(4, bwid//2), min(8, bwid//2)
 357             # yes, really, the CR's elwidth field determines
 358             # the bit-packing into the INT!
 359             if BB.elwidth == 0b00:
 360                 # pack 1 result into 64-bit registers
 361                 idx, boff = i, 0
 362             if BB.elwidth == 0b01:
 363                 # pack 2 results sequentially into INT registers
 364                 idx, boff = i//2, i%2
 365             if BB.elwidth == 0b10:
 366                 # pack 4 results sequentially into INT registers
 367                 idx, boff = i//t4, i%t4
 368             if BB.elwidth == 0b11:
 369                 # pack 8 results sequentially into INT registers
 370                 idx, boff = i//t8, i%t8
 371         else:
 372             # exceeding VL=16 is UNDEFINED
 373             idx, boff = 0, i
 374         iregs[RT+idx][60-boff*4:63-boff*4] = result
 375
 376
 377
 378 # v3.1 setbc instructions
 379
 380 There are additional setb conditional instructions in v3.1 (p129)
 381
 382     RT = (CR[BI] == 1) ? 1 : 0
 383
 384 which also negate that, and also return -1 / 0.  these are similar to
 385 crweird but not the same purpose.  most notable is that crweird acts on
 386 CR fields rather than the entire 32 bit CR.
 387
 388 # Predication Examples
 389
 390 Take the following example:
 391
 392     r10 = 0b00010
 393     sv.mtcrweird/dm=r10/dz cr8.v, 0, 0b0011.0000
 394
 395 Here, RA is zero, so the source input is zero. The destination is CR Field
 396 8, and the destination predicate mask indicates to target the first two
 397 elements.  Destination predicate zeroing is enabled, and the destination
 398 predicate is only set in the 2nd bit.  mask is 0b0011, mode is all zeros.
 399
 400 Let us first consider what should go into element 0 (CR Field 8):
 401
 402 * The destination predicate bit is zero, and zeroing is enabled.
 403 * Therefore, what is in the source is irrelevant: the result must
 404   be zero.
 405 * Therefore all four bits of CR Field 8 are therefore set to zero.
 406
 407 Now the second element, CR Field 9 (CR9):
 408
 409 * Bit 2 of the destination predicate, r10, is 1. Therefore the computation
 410   of the result is relevant.
 411 * RA is zero therefore bit 2 is zero.  mask is 0b0011 and mode is 0b0000
 412 * When calculating n0 thru n3 we get n0=1, n1=2, n2=0, n3=0
 413 * Therefore, CR9 is set (using LSB0 ordering) to 0b0011, i.e. to mask.
 414
 415 It should be clear that this instruction uses bits of the integer
 416 predicate to decide whether to set CR Fields to `(mask & ~mode)` or
 417 to zero.  Thus, in effect, it is the integer predicate that has been
 418 copied into the CR Fields.
 419
 420 By using twin predication, zeroing, and inversion (sm=~r3, dm=r10) for
 421 example, it becomes possible to combine two Integers together in order
 422 to set bits in CR Fields.  Likewise there are dozens of ways that CR
 423 Predicates can be used, on the same sv.mtcrweird instruction.