openpower/sv/cr_int_predication.mdwn

   1 [[!tag standards]]
   2
   3 # New instructions for CR/INT predication
   4
   5 **DRAFT STATUS**
   6
   7 See:
   8
   9 * main bugreport for crweirds
  10   <https://bugs.libre-soc.org/show_bug.cgi?id=533>
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=527>
  12 * <https://bugs.libre-soc.org/show_bug.cgi?id=569>
  13 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
  14
  15 Rationale:
  16
  17 Condition Registers are conceptually perfect for use as predicate masks,
  18 the only problem being that typical Vector ISAs have quite comprehensive
  19 mask-based instructions: set-before-first, popcount and much more.
  20 In fact many Vector ISAs can use Vectors *as* masks, consequently the
  21 entire Vector ISA is usually available for use in creating masks (one
  22 exception being AVX512 which has a dedicated Mask regfile and opcodes).
  23 Duplication of such operations (popcount etc) is not practical for SV
  24 given the strategy of leveraging pre-existing Scalar instructions in a
  25 minimalist way.
  26
  27 With the scalar OpenPOWER v3.0B ISA having already popcnt, cntlz and
  28 others normally seen in Vector Mask operations it makes sense to allow
  29 *both* scalar integers *and* CR-Vectors to be predicate masks.  That in
  30 turn means that much more comprehensive interaction between CRs and scalar
  31 Integers is required, because with the CR Predication Modes designating
  32 CR *Fields* (not CR bits) as Predicate Elements, fast transfers between
  33 CR *Fields* and the Integer Register File is needed.
  34
  35 The opportunity is therefore taken to also augment CR logical arithmetic
  36 as well, using a mask-based paradigm that takes into consideration
  37 multiple bits of each CR Field (eq/lt/gt/ov).  By contrast v3.0B Scalar
  38 CR instructions (crand, crxor) only allow a single bit calculation, and
  39 both mtcr and mfcr are CR-orientated rather than CR *Field* orientated.
  40
  41 Also strangely there is no v3.0 instruction for directly moving CR Fields,
  42 only CR *bits*, so that is corrected here with `mcrfm`. The opportunity
  43 is taken to allow inversion of CR Field bits, when copied.
  44
  45 Basic concept:
  46
  47 * CR-based instructions that perform simple AND/OR from any four bits
  48   of a CR field to create a single bit value (0/1) in an integer register
  49 * Inverse of the same, taking a single bit value (0/1) from an integer
  50   register to selectively target any four bits of a given CR Field
  51 * CR-to-CR version of the same, allowing multiple bits to be AND/OR/XORed
  52   in one hit.
  53 * Optional Vectorisation of the same when SVP64 is implemented
  54
  55 Purpose:
  56
  57 * To provide a merged version of what is currently a multi-sequence of
  58   CR operations (crand, cror, crxor) with mfcr and mtcrf, reducing
  59   instruction count.
  60 * To provide a vectorised version of the same, suitable for advanced
  61   predication
  62
  63 Side-effects:
  64
  65 * mtcrweird when RA=0 is a means to set or clear arbitrary CR bits
  66   using immediates embedded within the instruction.
  67
  68 (Twin) Predication interactions:
  69
  70 * INT twin predication with zeroing is a way to copy an integer into
  71   CRs without necessarily needing the INT register (RA).  if it is, it is
  72   effectively ANDed (or negate-and-ANDed) with the INT Predicate
  73 * CR twin predication with zeroing is likewise a way to interact with
  74   the incoming integer
  75
  76 this gets particularly powerful if data-dependent predication is also
  77 enabled.  further explanation is below.
  78
  79 # Bit ordering.
  80
  81 Please see [[svp64/appendix]] regarding CR bit ordering and for
  82 the definition of `CR{n}`
  83
  84 # Instruction form and pseudocode
  85
  86 **DRAFT** Instruction format (use of MAJOR 19 not approved by
  87 OPF ISA WG):
  88
  89 |0-5|6-10 |11|12-15|16-18|19-20|21-25  |26-30  |31|name      |
  90 |---|---- |--|-----|-----|-----|-----  |-----  |--|----      |
  91 |19 |RT   |  |mask |BFA  |     |XO[0:4]|XO[5:9]|/ |          |
  92 |19 |RT   |M |mask |BFA  | 0 0 |XO[0:4]|0 mode |Rc|crrweird  |
  93 |19 |RA   |M |mask |BF   | 0 1 |XO[0:4]|0 mode |/ |mtcrweird |
  94 |19 |BT   |M |mask |BFA  | 1 0 |XO[0:4]|0 mode |/ |crweirder |
  95 |19 |BF //|M |mask |BFA  | 1 1 |XO[0:4]|0 mode |0 |crweird   |
  96 |19 |BF //|M |mask |BFA  | 1 1 |XO[0:4]|0 mode |1 |mcrfm     |
  97
  98 **crrweird**
  99
 100 mode is encoded in XO and is 4 bits
 101
 102 bit 19=0, bit 20=0
 103
 104     crrweird: RT, BFA, M, mask.mode
 105
 106     creg = CR{BFA}
 107     n0 = mask[0] & (mode[0] == creg[0])
 108     n1 = mask[1] & (mode[1] == creg[1])
 109     n2 = mask[2] & (mode[2] == creg[2])
 110     n3 = mask[3] & (mode[3] == creg[3])
 111     result = n0|n1|n2|n3 if M else n0&n1&n2&n3
 112     RT[63] = result # MSB0 numbering, 63 is LSB
 113     If Rc:
 114         CR0 = analyse(RT)
 115
 116 When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
 117 SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
 118 Mode capability
 119
 120 **mtcrweird**
 121
 122 bit 19=0, bit 20=1
 123
 124     mtcrweird: BF, RA, M, mask.mode
 125
 126     reg = (RA|0)
 127     lsb = reg[63] # MSB0 numbering
 128     n0 = mask[0] & (mode[0] == lsb)
 129     n1 = mask[1] & (mode[1] == lsb)
 130     n2 = mask[2] & (mode[2] == lsb)
 131     n3 = mask[3] & (mode[3] == lsb)
 132     result = n0 || n1 || n2 || n3
 133     if M:
 134         result |= CR{BF} & ~mask
 135     CR{BF} = result
 136
 137 Note that when M=1 this operation is a Read-Modify-Write on the CR Field
 138 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
 139 M=1. Correspondingly when M=0 this operation is an overwrite: no read
 140 of BF is required because the masked-out bits of the BF CR Field are
 141 set to zero.
 142
 143 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
 144 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
 145 capability (BF is 3 bits)
 146
 147 **crweird**
 148
 149 bit 19=1, bit 20=0, bit 30=0
 150
 151     crweird: BF, BFA, M, mask.mode
 152
 153     creg = CR{BFA}
 154     n0 = mask[0] & (mode[0] == creg[0])
 155     n1 = mask[1] & (mode[1] == creg[1])
 156     n2 = mask[2] & (mode[2] == creg[2])
 157     n3 = mask[3] & (mode[3] == creg[3])
 158     result = n0 || n1 || n2 || n3
 159     if M:
 160         result |= CR{BF} & ~mask
 161     CR{BF} = result
 162
 163 Note that when M=1 this operation is a Read-Modify-Write on the CR Field
 164 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
 165 M=1. Correspondingly when M=0 this operation is an overwrite: no read
 166 of BF is required because the masked-out bits of the BF CR Field are
 167 set to zero.
 168
 169 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
 170 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
 171 capability (BF is 3 bits)
 172
 173 **mcrfm** - Move CR Field, masked.
 174
 175 bit 19=1, bit 20=0, bit 30=1
 176
 177     mcrfm: BF, BFA, M, mask.mode
 178
 179     result = mask & CR{BFA}
 180     if M:
 181         result |= CR{BF} & ~mask
 182     result ^= mode
 183     CR{BF} = result
 184
 185 Note that when M=1 this operation is a Read-Modify-Write on the CR Field
 186 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
 187 M=1. Correspondingly when M=0 this operation is an overwrite: no read
 188 of BF is required because the masked-out bits of the BF CR Field are
 189 set to zero.
 190
 191 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
 192 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
 193 capability (BF is 3 bits)
 194
 195 *Programmer's note: `mode` being XORed onto the result provides
 196 considerable flexibility. individual bits of BFA may be copied inverted
 197 to BF by ensuring that `mask` and `mode` have the same bit set.  Also,
 198 individual bits in BF may be set to 1 by ensuring that the required bit of
 199 `mask` is set to zero and the same bit in `mode` is set to 1*
 200
 201 **crweirder**
 202
 203 bit 19=1, bit 20=1
 204
 205     crweirder: BT, BFA, mask.mode
 206
 207     creg = CR{BFA}
 208     n0 = mask[0] & (mode[0] == creg[0])
 209     n1 = mask[1] & (mode[1] == creg[1])
 210     n2 = mask[2] & (mode[2] == creg[2])
 211     n3 = mask[3] & (mode[3] == creg[3])
 212     BF = BT[2:4] # select CR
 213     bit = BT[0:1] # select bit of CR
 214     result = n0|n1|n2|n3 if M else n0&n1&n2&n3
 215     CR{BF}[bit] = result
 216
 217 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
 218 type operation that has 5-bit Data-dependent and 5-bit Predicate-result
 219 capability (BFT is 5 bits)
 220
 221 **Example Pseudo-ops:**
 222
 223     mtcri BF, mode    mtcrweird BF, r0, 0, 0b1111.~mode
 224     mtcrset BF, mask  mtcrweird BF, r0, 1, mask.0b0000
 225     mtcrclr BF, mask  mtcrweird BF, r0, 1, mask.0b1111
 226
 227 # Vectorised versions
 228
 229 The name "weird" refers to a minor violation of SV rules when it comes
 230 to deriving the Vectorised versions of these instructions.
 231
 232 Normally the progression of the SV for-loop would move on to the
 233 next register.  Instead however in the scalar case these instructions
 234 **remain in the same register** and insert or transfer between **bits**
 235 of the scalar integer source or destination.
 236
 237 Further useful violation of the normal SV Elwidth override rules allows
 238 for packing (or unpacking) of multiple CR test results into (or out of)
 239 an Integer Element. Note that the CR (source operand) elwidth field is
 240 utilised to determine the bit- packing size (1/2/4/8 with remaining
 241 bits within the Integer element set to zero) whilst the INT (dest
 242 operand) elwidth field still sets the Integer element size as usual
 243 (8/16/32/default)
 244
 245     crrweird: RT, BB, mask.mode
 246
 247     for i in range(VL):
 248         if BB.isvec:
 249             creg = CR{BB+i}
 250         else:
 251             creg = CR{BB}
 252         n0 = mask[0] & (mode[0] == creg[0])
 253         n1 = mask[1] & (mode[1] == creg[1])
 254         n2 = mask[2] & (mode[2] == creg[2])
 255         n3 = mask[3] & (mode[3] == creg[3])
 256         # OR or AND to a single bit
 257         result = n0|n1|n2|n3 if M else n0&n1&n2&n3
 258         if RT.isvec:
 259             # TODO: RT.elwidth override to be also added here
 260             # note, yes, really, the CR's elwidth field determines
 261             # the bit-packing into the INT!
 262             if BB.elwidth == 0b00:
 263                 # pack 1 result into 64-bit registers
 264                 iregs[RT+i][0..62] = 0
 265                 iregs[RT+i][63] = result # sets LSB to result
 266             if BB.elwidth == 0b01:
 267                 # pack 2 results sequentially into INT registers
 268                 iregs[RT+i//2][0..61] = 0
 269                 iregs[RT+i//2][63-(i%2)] = result
 270             if BB.elwidth == 0b10:
 271                 # pack 4 results sequentially into INT registers
 272                 iregs[RT+i//4][0..59] = 0
 273                 iregs[RT+i//4][63-(i%4)] = result
 274             if BB.elwidth == 0b11:
 275                 # pack 8 results sequentially into INT registers
 276                 iregs[RT+i//8][0..55] = 0
 277                 iregs[RT+i//8][63-(i%8)] = result
 278         else:
 279             iregs[RT][63-i] = result # results also in scalar INT
 280
 281 Note that:
 282
 283 * in the scalar case the CR-Vector assessment
 284   is stored bit-wise starting at the LSB of the
 285    destination scalar INT
 286 * in the INT-vector case the results are packed into LSBs
 287   of the INT Elements, the packing arrangement depending on both
 288   elwidth override settings.
 289
 290 # v3.1 setbc instructions
 291
 292 There are additional setb conditional instructions in v3.1 (p129)
 293
 294     RT = (CR[BI] == 1) ? 1 : 0
 295
 296 which also negate that, and also return -1 / 0.  these are similar to
 297 crweird but not the same purpose.  most notable is that crweird acts on
 298 CR fields rather than the entire 32 bit CR.
 299
 300 # Predication Examples
 301
 302 Take the following example:
 303
 304     r10 = 0b00010
 305     sv.mtcrweird/dm=r10/dz cr8.v, 0, 0b0011.0000
 306
 307 Here, RA is zero, so the source input is zero. The destination is CR Field
 308 8, and the destination predicate mask indicates to target the first two
 309 elements.  Destination predicate zeroing is enabled, and the destination
 310 predicate is only set in the 2nd bit.  mask is 0b0011, mode is all zeros.
 311
 312 Let us first consider what should go into element 0 (CR Field 8):
 313
 314 * The destination predicate bit is zero, and zeroing is enabled.
 315 * Therefore, what is in the source is irrelevant: the result must
 316   be zero.
 317 * Therefore all four bits of CR Field 8 are therefore set to zero.
 318
 319 Now the second element, CR Field 9 (CR9):
 320
 321 * Bit 2 of the destination predicate, r10, is 1. Therefore the computation
 322   of the result is relevant.
 323 * RA is zero therefore bit 2 is zero.  mask is 0b0011 and mode is 0b0000
 324 * When calculating n0 thru n3 we get n0=1, n1=2, n2=0, n3=0
 325 * Therefore, CR9 is set (using LSB0 ordering) to 0b0011, i.e. to mask.
 326
 327 It should be clear that this instruction uses bits of the integer
 328 predicate to decide whether to set CR Fields to `(mask & ~mode)` or
 329 to zero.  Thus, in effect, it is the integer predicate that has been
 330 copied into the CR Fields.
 331
 332 By using twin predication, zeroing, and inversion (sm=~r3, dm=r10) for
 333 example, it becomes possible to combine two Integers together in order
 334 to set bits in CR Fields.  Likewise there are dozens of ways that CR
 335 Predicates can be used, on the same sv.mtcrweird instruction.