openpower/sv/cr_int_predication.mdwn

   1 [[!tag standards]]
   2
   3 # New instructions for CR/INT predication
   4
   5 **DRAFT STATUS**
   6
   7 See:
   8
   9 * main bugreport for crweirds
  10   <https://bugs.libre-soc.org/show_bug.cgi?id=533>
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=527>
  12 * <https://bugs.libre-soc.org/show_bug.cgi?id=569>
  13 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
  14
  15 Rationale:
  16
  17 Condition Registers are conceptually perfect for use as predicate masks,
  18 the only problem being that typical Vector ISAs have quite comprehensive
  19 mask-based instructions: set-before-first, popcount and much more.
  20 In fact many Vector ISAs can use Vectors *as* masks, consequently the
  21 entire Vector ISA is usually available for use in creating masks (one
  22 exception being AVX512 which has a dedicated Mask regfile and opcodes).
  23 Duplication of such operations (popcount etc) is not practical for SV
  24 given the strategy of leveraging pre-existing Scalar instructions in a
  25 minimalist way.
  26
  27 With the scalar OpenPOWER v3.0B ISA having already popcnt, cntlz and
  28 others normally seen in Vector Mask operations it makes sense to allow
  29 *both* scalar integers *and* CR-Vectors to be predicate masks.  That in
  30 turn means that much more comprehensive interaction between CRs and scalar
  31 Integers is required, because with the CR Predication Modes designating
  32 CR *Fields* (not CR bits) as Predicate Elements, fast transfers between
  33 CR *Fields* and the Integer Register File is needed.
  34
  35 The opportunity is therefore taken to also augment CR logical arithmetic
  36 as well, using a mask-based paradigm that takes into consideration
  37 multiple bits of each CR Field (eq/lt/gt/ov).  By contrast v3.0B Scalar
  38 CR instructions (crand, crxor) only allow a single bit calculation, and
  39 both mtcr and mfcr are CR-orientated rather than CR *Field* orientated.
  40
  41 Also strangely there is no v3.0 instruction for directly moving CR Fields,
  42 only CR *bits*, so that is corrected here with `mcrfm`. The opportunity
  43 is taken to allow inversion of CR Field bits, when copied.
  44
  45 Basic concept:
  46
  47 * CR-based instructions that perform simple AND/OR from any four bits
  48   of a CR field to create a single bit value (0/1) in an integer register
  49 * Inverse of the same, taking a single bit value (0/1) from an integer
  50   register to selectively target any four bits of a given CR Field
  51 * CR-to-CR version of the same, allowing multiple bits to be AND/OR/XORed
  52   in one hit.
  53 * Optional Vectorisation of the same when SVP64 is implemented
  54
  55 Purpose:
  56
  57 * To provide a merged version of what is currently a multi-sequence of
  58   CR operations (crand, cror, crxor) with mfcr and mtcrf, reducing
  59   instruction count.
  60 * To provide a vectorised version of the same, suitable for advanced
  61   predication
  62
  63 Useful side-effects:
  64
  65 * mtcrweird when RA=0 is a means to set or clear
  66   multiple arbitrary CR Field bits simultaneously,
  67   using immediates embedded within the instruction.
  68 * With SVP64 on the weird instructions there is bit-for-bit interaction
  69   between GPR predicate masks (r3, r10, r31) and the source
  70   or destination GPR, in ways that are not possible with other
  71   SVP64 instructions because normal SVP64 is bit-per-element.
  72   On these weird instructions the element in effect *is* a bit.
  73 * `mfcrweird` mitigates a need to add `conflictd`, part of
  74   [[sv/vector_ops]], as well as allowing more complex comparisons.
  75
  76 # Bit ordering.
  77
  78 Please see [[svp64/appendix]] regarding CR bit ordering and for
  79 the definition of `CR{n}`
  80
  81 # Instruction form and pseudocode
  82
  83 **DRAFT** Instruction format (use of MAJOR 19 not approved by
  84 OPF ISA WG):
  85
  86 |0-5|6-10 |11|12-15|16-18|19-20|21-25  |26-30  |31|name      |
  87 |---|---- |--|-----|-----|-----|-----  |-----  |--|----      |
  88 |19 |RT   |  |fmsk |BFA  |     |XO[0:4]|XO[5:9]|/ |          |
  89 |19 |     |  |     |     |     |1 //// |00011  |  |rsvd      |
  90 |19 |RT   |M |fmsk |BFA  | 0 0 |0 fmap |00011  |Rc|crrweird  |
  91 |19 |RT   |M |fmsk |BFA  | 0 1 |0 fmap |00011  |Rc|mfcrweird |
  92 |19 |RA   |M |fmsk |BF   | 1 0 |0 fmap |00011  |0 |mtcrrweird |
  93 |19 |RA   |M |fmsk |BF   | 1 0 |0 fmap |00011  |1 |mtcrweird |
  94 |19 |BT   |M |fmsk |BFA  | 1 1 |0 fmap |00011  |0 |crweirder |
  95 |19 |BF //|M |fmsk |BFA  | 1 1 |0 fmap |00011  |1 |mcrfm     |
  96
  97 **crrweird**
  98
  99 fmap is encoded in XO and is 4 bits
 100
 101     crrweird: RT,BFA,M,fmsk,fmap
 102
 103     creg = CR{BFA}
 104     n0 = fmsk[0] & (fmap[0] == creg[0])
 105     n1 = fmsk[1] & (fmap[1] == creg[1])
 106     n2 = fmsk[2] & (fmap[2] == creg[2])
 107     n3 = fmsk[3] & (fmap[3] == creg[3])
 108     n = (n0||n1||n2||n3) & fmsk
 109     result = (n != 0) if M else (n == fmsk)
 110     RT[63] = result # MSB0 numbering, 63 is LSB
 111     If Rc:
 112         CR0 = analyse(RT)
 113
 114 When used with SVP64 Prefixing this is a [[sv/normal]]
 115 SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
 116 Mode capability
 117
 118 Also as noted below, element-width override bits normally used
 119 on the source is instead used to allow multiple results to be packed
 120 sequentially into the destination. *Destination elwidth overrides still apply*.
 121
 122 **mfcrrweird**
 123
 124 fmap is encoded in XO and is 4 bits
 125
 126     mfcrrweird: RT,BFA,fmsk,fmap
 127
 128     creg = CR{BFA}
 129     n0 = fmsk[0] & (fmap[0] == creg[0])
 130     n1 = fmsk[1] & (fmap[1] == creg[1])
 131     n2 = fmsk[2] & (fmap[2] == creg[2])
 132     n3 = fmsk[3] & (fmap[3] == creg[3])
 133     result = n0||n1||n2||n3
 134     RT[60:63] = result # MSB0 numbering, 63 is LSB
 135     If Rc:
 136         CR0 = analyse(RT)
 137
 138 When used with SVP64 Prefixing this is a [[sv/normal]]
 139 SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
 140 Mode capability.
 141
 142 Also as noted below, element-width override bits normally used
 143 on the source is instead used to allow multiple results to be packed
 144 into the destination.  *Destination elwidth overrides still apply*
 145
 146 **mtcrrweird**
 147
 148 fmap is encoded in XO and is 4 bits
 149
 150     mtcrrweird: BF,RA,M,fmsk,fmap
 151
 152     a = (RA|0)
 153     n0 = fmsk[0] & (fmap[0] == a[63])
 154     n1 = fmsk[1] & (fmap[1] == a[62])
 155     n2 = fmsk[2] & (fmap[2] == a[61])
 156     n3 = fmsk[3] & (fmap[3] == a[60])
 157     result = n0 || n1 || n2 || n3
 158     if M:
 159         result |= CR{BF} & ~fmsk
 160     CR{BF} = result
 161
 162 When used with SVP64 Prefixing this is a [[sv/normal]]
 163 SVP64 type operation and as such can use RC1 Data-dependent
 164 Mode capability
 165
 166 **mtcrweird**
 167
 168     mtcrweird: BF,RA,M,fmsk,fmap
 169
 170     reg = (RA|0)
 171     lsb = reg[63] # MSB0 numbering
 172     n0 = fmsk[0] & (fmap[0] == lsb)
 173     n1 = fmsk[1] & (fmap[1] == lsb)
 174     n2 = fmsk[2] & (fmap[2] == lsb)
 175     n3 = fmsk[3] & (fmap[3] == lsb)
 176     result = n0 || n1 || n2 || n3
 177     if M:
 178         result |= CR{BF} & ~fmsk
 179     CR{BF} = result
 180
 181 Note that when M=1 this operation is a Read-Modify-Write on the CR Field
 182 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
 183 M=1. Correspondingly when M=0 this operation is an overwrite: no read
 184 of BF is required because the masked-out bits of the BF CR Field are
 185 set to zero.
 186
 187 When used with SVP64 Prefixing this is a [[sv/cr_ops]] SVP64
 188 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
 189 capability (BF is 3 bits)
 190
 191 **mcrfm** - Move CR Field, masked.
 192
 193 This instruction copies, sets, or inverts parts of a CR Field
 194 into another CR Field.  `mcrf` copies only one bit of the CR
 195 from any arbitrary bit to any other arbitrary bit, whereas
 196 `mcrfm` copies an entire 4-bit CR Field (or masked parts thereof).
 197 Unlike `mcrf` the bits of the CR Field may not change position:
 198 the EQ bit from the source may only go into the EQ bit of the
 199 destination (optionally inverted, set, or cleared).
 200
 201     mcrfm: BF,BFA,M,fmsk,fmap
 202
 203     result = fmsk & CR{BFA}
 204     if M:
 205         result |= CR{BF} & ~fmsk
 206     result ^= fmap
 207     CR{BF} = result
 208
 209 When M=1 this operation is a Read-Modify-Write on the CR Field
 210 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
 211 M=1. Correspondingly when M=0 this operation is an overwrite: no read
 212 of BF is required because the masked-out bits of the BF CR Field are
 213 set to zero.
 214
 215 When used with SVP64 Prefixing this is a [[sv/cr_ops]] SVP64
 216 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
 217 capability (BF is 3 bits)
 218
 219 *Programmer's note: `fmap` being XORed onto the result provides
 220 considerable flexibility. individual bits of BFA may be copied inverted
 221 to BF by ensuring that `fmsk` and `fmap` have the same bit set.  Also,
 222 individual bits in BF may be set to 1 by ensuring that the required bit of
 223 `fmsk` is set to zero and the same bit in `fmap` is set to 1*
 224
 225 **crweirder**
 226
 227     crweirder: BT,BFA,fmsk,fmap
 228
 229     creg = CR{BFA}
 230     n0 = fmsk[0] & (fmap[0] == creg[0])
 231     n1 = fmsk[1] & (fmap[1] == creg[1])
 232     n2 = fmsk[2] & (fmap[2] == creg[2])
 233     n3 = fmsk[3] & (fmap[3] == creg[3])
 234     bf = BT[2:4] # select CR field
 235     bit = BT[0:1] # select bit of CR field
 236     n = (n0||n1||n2||n3) & fmsk
 237     result = (n != 0) if M else (n == fmsk)
 238     CR{bf}[bit] = result
 239
 240 When used with SVP64 Prefixing this is a [[sv/cr_ops]] SVP64
 241 type operation that has 5-bit Data-dependent and 5-bit Predicate-result
 242 capability (BT is 5 bits)
 243
 244 **Example Pseudo-ops:**
 245
 246     mtcri BF, fmap    mtcrweird BF, r0, 0, 0b1111,~fmap
 247     mtcrset BF, fmsk  mtcrweird BF, r0, 1, fmsk,0b0000
 248     mtcrclr BF, fmsk  mtcrweird BF, r0, 1, fmsk,0b1111
 249
 250 # Vectorised versions involving GPRs
 251
 252 The name "weird" refers to a minor violation of SV rules when it comes
 253 to deriving the Vectorised versions of these instructions.
 254
 255 Normally the progression of the SV for-loop would move on to the
 256 next register.  Instead however in the scalar case these instructions
 257 **remain in the same register** and insert or transfer between **bits**
 258 of the scalar integer source or destination.  The reason is that when
 259 using CR Fields as predicate masks and there is a need to transfer
 260 into a GPR, again for use as a predicate mask, the CR Field bits
 261 need to be efficiently packed into that one GPR (r3, r10 or r31).
 262
 263 Further useful violation of the normal SV Elwidth override rules allows
 264 for packing (or unpacking) of multiple CR test results into (or out of)
 265 an Integer Element. Note that the CR (source operand) elwidth field is
 266 utilised to determine the bit- packing size (1/2/4/8 with remaining
 267 bits within the Integer element set to zero) whilst the INT (dest
 268 operand) elwidth field still sets the Integer element size as usual
 269 (8/16/32/default)
 270
 271 **crrweird: RT, BB, fmsk.fmap**
 272
 273     for i in range(VL):
 274         if BB.isvec:
 275             creg = CR{BB+i}
 276         else:
 277             creg = CR{BB}
 278         n0 = fmsk[0] & (fmap[0] == creg[0])
 279         n1 = fmsk[1] & (fmap[1] == creg[1])
 280         n2 = fmsk[2] & (fmap[2] == creg[2])
 281         n3 = fmsk[3] & (fmap[3] == creg[3])
 282         # OR or AND to a single bit
 283         n = (n0||n1||n2||n3) & fmsk
 284         result = (n != 0) if M else (n == fmsk)
 285         if RT.isvec:
 286             # TODO: RT.elwidth override to be also added here
 287             # note, yes, really, the CR's elwidth field determines
 288             # the bit-packing into the INT!
 289             if BB.elwidth == 0b00:
 290                 # pack 1 result into 64-bit registers
 291                 iregs[RT+i][0..62] = 0
 292                 iregs[RT+i][63] = result # sets LSB to result
 293             if BB.elwidth == 0b01:
 294                 # pack 2 results sequentially into INT registers
 295                 iregs[RT+i//2][0..61] = 0
 296                 iregs[RT+i//2][63-(i%2)] = result
 297             if BB.elwidth == 0b10:
 298                 # pack 4 results sequentially into INT registers
 299                 iregs[RT+i//4][0..59] = 0
 300                 iregs[RT+i//4][63-(i%4)] = result
 301             if BB.elwidth == 0b11:
 302                 # pack 8 results sequentially into INT registers
 303                 iregs[RT+i//8][0..55] = 0
 304                 iregs[RT+i//8][63-(i%8)] = result
 305         else:
 306             iregs[RT][63-i] = result # results also in scalar INT
 307
 308 Note that:
 309
 310 * in the scalar case the CR-Vector assessment
 311   is stored bit-wise starting at the LSB of the
 312    destination scalar INT
 313 * in the INT-vector case the results are packed into LSBs
 314   of the INT Elements, the packing arrangement depending on both
 315   elwidth override settings.
 316
 317 **mfcrrweird: RT, BFA, fmsk.fmap**
 318
 319 Unlike `crrweird` the results are 4-bit wide, so the packing
 320 will begin to spill over to other destination elements.  8 results per
 321 destination at 4-bits each still fits into destination elwidth at 32-bit,
 322 but for 16-bit and 8-bit obviously this does not fit, and must split
 323 across to the next element
 324
 325 When for example destination elwidth is 16-bit (0b10) the following packing
 326 occurs:
 327
 328 - SVRM bits 6:7 equal to 0b00 - one 4-bit result element packed into the
 329   first 4-bits of the 16-bit destination element (in the first 4 LSBs)
 330 - SVRM bits 6:7 equal to 0b01 - two 4-bit result elements packed into the
 331   first 8-bits of the 16-bit destination element (in the first 8 LSBs)
 332 - SVRM bits 6:7 equal to 0b10 - four 4-bit result elements packed into each
 333   16-bit destination element
 334 - SVRM bits 6:7 equal to 0b11 - eight 4-bit result elements, the first four
 335   of which are packed into the first 16-bit destination element, the
 336   second four of which are packed into the second 16-bit destination element.
 337
 338 Pseudocode example: note that dest elwidth overrides affect the
 339 packing of results. BB.elwidth in effect requests how many 4-bit
 340 result elements would like to be packed, but RT.elwidth determines
 341 the limit. Any parts of the destination elements not containing
 342 results are set to zero.
 343
 344     for i in range(VL):
 345         if BB.isvec:
 346             creg = CR{BB+i}
 347         else:
 348             creg = CR{BB}
 349         n0 = fmsk[0] & (fmap[0] == creg[0])
 350         n1 = fmsk[1] & (fmap[1] == creg[1])
 351         n2 = fmsk[2] & (fmap[2] == creg[2])
 352         n3 = fmsk[3] & (fmap[3] == creg[3])
 353         result = n0||n1||n2||n3 # 4-bit result
 354         if RT.isvec:
 355             # RT.elwidth override can affect the packing
 356             bwid = {0b00:64, 0b01:8, 0b10:16, 0b11:32}[RT.elwidth]
 357             t4, t8 = min(4, bwid//2), min(8, bwid//2)
 358             # yes, really, the CR's elwidth field determines
 359             # the bit-packing into the INT!
 360             if BB.elwidth == 0b00:
 361                 # pack 1 result into 64-bit registers
 362                 idx, boff = i, 0
 363             if BB.elwidth == 0b01:
 364                 # pack 2 results sequentially into INT registers
 365                 idx, boff = i//2, i%2
 366             if BB.elwidth == 0b10:
 367                 # pack 4 results sequentially into INT registers
 368                 idx, boff = i//t4, i%t4
 369             if BB.elwidth == 0b11:
 370                 # pack 8 results sequentially into INT registers
 371                 idx, boff = i//t8, i%t8
 372         else:
 373             # exceeding VL=16 is UNDEFINED
 374             idx, boff = 0, i
 375         iregs[RT+idx][60-boff*4:63-boff*4] = result
 376
 377 # v3.1 setbc instructions
 378
 379 There are additional setb conditional instructions in v3.1 (p129)
 380
 381     RT = (CR[BI] == 1) ? 1 : 0
 382
 383 which also negate that, and also return -1 / 0.  these are similar to
 384 crweird but not the same purpose.  most notable is that crweird acts on
 385 CR fields rather than the entire 32 bit CR.
 386
 387 # Predication Examples
 388
 389 Take the following example:
 390
 391     r10 = 0b00010
 392     sv.mtcrweird/dm=r10/dz cr8.v, 0, 0b0011.0000
 393
 394 Here, RA is zero, so the source input is zero. The destination is CR Field
 395 8, and the destination predicate mask indicates to target the first two
 396 elements.  Destination predicate zeroing is enabled, and the destination
 397 predicate is only set in the 2nd bit.  fmsk is 0b0011, fmap is all zeros.
 398
 399 Let us first consider what should go into element 0 (CR Field 8):
 400
 401 * The destination predicate bit is zero, and zeroing is enabled.
 402 * Therefore, what is in the source is irrelevant: the result must
 403   be zero.
 404 * Therefore all four bits of CR Field 8 are therefore set to zero.
 405
 406 Now the second element, CR Field 9 (CR9):
 407
 408 * Bit 2 of the destination predicate, r10, is 1. Therefore the computation
 409   of the result is relevant.
 410 * RA is zero therefore bit 2 is zero.  fmsk is 0b0011 and fmap is 0b0000
 411 * When calculating n0 thru n3 we get n0=1, n1=2, n2=0, n3=0
 412 * Therefore, CR9 is set (using LSB0 ordering) to 0b0011, i.e. to fmsk.
 413
 414 It should be clear that this instruction uses bits of the integer
 415 predicate to decide whether to set CR Fields to `(fmsk & ~fmap)` or
 416 to zero.  Thus, in effect, it is the integer predicate that has been
 417 copied into the CR Fields.
 418
 419 By using twin predication, zeroing, and inversion (sm=~r3, dm=r10) for
 420 example, it becomes possible to combine two Integers together in order
 421 to set bits in CR Fields.  Likewise there are dozens of ways that CR
 422 Predicates can be used, on the same sv.mtcrweird instruction.