openpower/sv/cr_int_predication.mdwn

   1 # New instructions for CR/INT predication
   2
   3 <!-- hide -->
   4 See:
   5
   6 * main bugreport for crweirds
   7   <https://bugs.libre-soc.org/show_bug.cgi?id=533>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=527>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=569>
  10 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
  11 * [[discussion]]
  12 <!-- show-->
  13
  14 ## crrweird
  15
  16 CW2-Form
  17
  18 ```
  19     |0     |6   |9 |11|12   |16   |19  |22   |26   |31|
  20     | PO   | RT    |M |fmsk |BFA  |XO  |fmap | XO  |Rc|
  21
  22 ```
  23
  24 * crrweird RT,BFA,M,fmsk,fmap (Rc=0)
  25 * crrweird. RT,BFA,M,fmsk,fmap (Rc=1)
  26
  27 ```
  28     creg <- CR[4*BFA+32:4*BFA+35]
  29     n <- (¬fmap ^ creg) & fmsk
  30     result <- (n != 0) if M else (n == fmsk)
  31     RT <- [0] * 63 || result
  32     if Rc then
  33         CR0 <- analyse(RT)
  34 ```
  35
  36 When used with SVP64 Prefixing this is a [[sv/normal]]
  37 SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
  38 Mode capability
  39
  40 Also as noted below, element-width override bits normally used
  41 on the source is instead used to allow multiple results to be packed
  42 sequentially into the destination. *Destination elwidth overrides still apply*.
  43
  44 Special registers altered:
  45
  46 ```
  47     CR0        (Rc=1)
  48 ```
  49
  50 ## mfcrrweird
  51
  52 CW2-Form
  53
  54 ```
  55     |0     |6   |9 |11|12   |16   |19  |22   |26   |31|
  56     | PO   | RT    |M |fmsk |BFA  |XO  |fmap | XO  |Rc|
  57
  58 ```
  59
  60 * mfcrrweird RT,BFA,fmsk,fmap (Rc=0)
  61 * mfcrrweird. RT,BFA,fmsk,fmap (Rc=1)
  62
  63 ```
  64     creg = CR[4*BFA+32:4*BFA+35]
  65     result = (¬fmap ^ creg) & fmsk
  66     RT = [0] * 60 || result
  67     If Rc:
  68         CR0 = analyse(RT)
  69 ```
  70
  71 When used with SVP64 Prefixing this is a [[sv/normal]]
  72 SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
  73 Mode capability.
  74
  75 Also as noted below, element-width override bits normally used
  76 on the source is instead used to allow multiple results to be packed
  77 into the destination.  *Destination elwidth overrides still apply*
  78
  79 ## mtcrrweird
  80
  81 CW-Form
  82
  83 ```
  84     |0     |6   |9 |11|12   |16   |19  |22   |26   |31|
  85     | PO   | RA    |M |fmsk |BF   |XO  |fmap | XO     |
  86     | PO   | BT    |M |fmsk |BF   |XO  |fmap | XO     |
  87     | PO   | BF |  |M |fmsk |BF   |XO  |fmap | XO     |
  88 ```
  89
  90 * mtcrrweird BF,RA,M,fmsk,fmap
  91
  92 ```
  93     a = (RA|0)
  94     creg = a[60:63]
  95     result = (¬fmap ^ creg) & fmsk
  96     if M:
  97         result |= CR[4*BF+32:4*BF+35]  & ~fmsk
  98     CR[4*BF+32:4*BF+35]  = result
  99 ```
 100
 101 When used with SVP64 Prefixing this is a [[sv/normal]]
 102 SVP64 type operation and as such can use RC1 Data-dependent
 103 Mode capability
 104
 105 Hardware Architectural Note: when M=1 this instruction is a Read-Modify-Write
 106 on the `BF` CR Field. When M=0 it is a more normal Write.
 107
 108 Special Registers Altered:
 109
 110 ```
 111     CR Field BF
 112 ```
 113
 114 ## mtcrweird
 115
 116 CW-Form
 117
 118 ```
 119     |0     |6   |9 |11|12   |16   |19  |22   |26   |31|
 120     | PO   | RA    |M |fmsk |BF   |XO  |fmap | XO     |
 121     | PO   | BT    |M |fmsk |BF   |XO  |fmap | XO     |
 122     | PO   | BF |  |M |fmsk |BF   |XO  |fmap | XO     |
 123 ```
 124
 125 * mtcrweird BF,RA,M,fmsk,fmap
 126
 127 ```
 128     reg = (RA|0)
 129     creg = reg[63] || reg[63] || reg[63] || reg[63]
 130     result = (¬fmap ^ creg) & fmsk
 131     if M:
 132         result |= CR[4*BF+32:4*BF+35] & ~fmsk
 133     CR[4*BF+32:4*BF+35]  = result
 134 ```
 135
 136 Note that when M=1 this operation is a Read-Modify-Write on the CR Field
 137 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
 138 M=1. Correspondingly when M=0 this operation is an overwrite: no read
 139 of BF is required because the masked-out bits of the BF CR Field are
 140 set to zero.
 141
 142 When used with SVP64 Prefixing this is a [[sv/cr_ops]] SVP64
 143 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
 144 capability (BF is 3 bits)
 145
 146 Special Registers Altered:
 147
 148 ```
 149     CR Field BF
 150 ```
 151
 152 ## mcrfm - Move CR Field, masked.
 153
 154 CW-Form
 155
 156 ```
 157     |0     |6   |9 |11|12   |16   |19  |22   |26   |31|
 158     | PO   | BF |  |M |fmsk |BF   |XO  |fmap | XO     |
 159 ```
 160
 161 * mcrfm: BF,BFA,M,fmsk,fmap
 162
 163 ```
 164     result = fmsk & CR[4*BFA+32:4*BFA+35]
 165     if M:
 166         result |= CR[4*BF+32:4*BF+35]  & ~fmsk
 167     result ^= fmap
 168     CR[4*BF+32:4*BF+35]  = result
 169 ```
 170
 171 This instruction copies, sets, or inverts parts of a CR Field
 172 into another CR Field.  `mcrf` copies only one bit of the CR
 173 from any arbitrary bit to any other arbitrary bit, whereas
 174 `mcrfm` copies an entire 4-bit CR Field (or masked parts thereof).
 175 Unlike `mcrf` the bits of the CR Field may not change position:
 176 the EQ bit from the source may only go into the EQ bit of the
 177 destination (optionally inverted, set, or cleared).
 178
 179 When M=1 this operation is a Read-Modify-Write on the CR Field
 180 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
 181 M=1. Correspondingly when M=0 this operation is an overwrite: no read
 182 of BF is required because the masked-out bits of the BF CR Field are
 183 set to zero.
 184
 185 When used with SVP64 Prefixing this is a [[sv/cr_ops]] SVP64
 186 type operation that has 3-bit Data-dependent and 3-bit Predicate-result
 187 capability (BF is 3 bits)
 188
 189 *Programmer's note: `fmap` being XORed onto the result provides
 190 considerable flexibility. individual bits of BFA may be copied inverted
 191 to BF by ensuring that `fmsk` and `fmap` have the same bit set.  Also,
 192 individual bits in BF may be set to 1 by ensuring that the required bit of
 193 `fmsk` is set to zero and the same bit in `fmap` is set to 1*
 194
 195 Special Registers Altered:
 196
 197 ```
 198     CR Field BF
 199 ```
 200
 201 ## crweirder
 202
 203 ```
 204     |0     |6   |9 |11|12   |16   |19  |22   |26   |31|
 205     | PO   | BT    |M |fmsk |BF   |XO  |fmap | XO     |
 206 ```
 207
 208 * crweirder: BT,BFA,fmsk,fmap
 209
 210 ```
 211     creg = CR[4*BFA+32:4*BFA+35]
 212     n = (¬fmap ^ creg) & fmsk
 213     result = (n != 0) if M else (n == fmsk)
 214     CR[32+BT] = result
 215 ```
 216
 217 Special Registers Altered:
 218
 219 ```
 220     CR[BT+32]
 221 ```
 222
 223 When used with SVP64 Prefixing this is a [[sv/cr_ops]] SVP64
 224 type operation that has 5-bit Data-dependent
 225 capability (BT is 5 bits)
 226
 227 Hardware Architectural Note: this instruction is always a Read-Modify-Write
 228 on the CR Field containing `BT`.
 229
 230 **Example Pseudo-ops:**
 231
 232 ```
 233     mtcri BF, fmap    mtcrweird BF, r0, 0, 0b1111,~fmap
 234     mtcrset BF, fmsk  mtcrweird BF, r0, 1, fmsk,0b0000
 235     mtcrclr BF, fmsk  mtcrweird BF, r0, 1, fmsk,0b1111
 236 ```
 237
 238 ----------
 239
 240 \newpage{}
 241
 242 # Vectorized versions involving GPRs
 243
 244 The name "weird" refers to a minor violation of SV rules when it comes
 245 to deriving the Vectorized versions of these instructions.
 246
 247 Normally the progression of the SV for-loop would move on to the
 248 next register.  Instead however in the scalar case these instructions
 249 **remain in the same register** and insert or transfer between **bits**
 250 of the scalar integer source or destination.  The reason is that when
 251 using CR Fields as predicate masks and there is a need to transfer
 252 into a GPR, again for use as a predicate mask, the CR Field bits
 253 need to be efficiently packed into that one GPR (r3, r10 or r31).
 254
 255 Further useful violation of the normal SV Elwidth override rules allows
 256 for packing (or unpacking) of multiple CR test results into (or out of)
 257 an Integer Element. Note that the CR (source operand) elwidth field is
 258 utilised to determine the bit- packing size (1/2/4/8 with remaining
 259 bits within the Integer element set to zero) whilst the INT (dest
 260 operand) elwidth field still sets the Integer element size as usual
 261 (8/16/32/default)
 262
 263 **sv.crrweird: RT, BB, fmsk, fmap**
 264
 265 ```
 266     for i in range(VL):
 267         if BB.isvec: # Vector CR Field source?
 268             creg = CR{BB+i}
 269         else:
 270             creg = CR{BB}
 271         n = (¬fmap ^ creg) & fmsk
 272         result = (n != 0) if M else (n == fmsk)
 273         if RT.isvec:
 274             # TODO: RT.elwidth override to be also added here
 275             # note, yes, really, the CR's elwidth field determines
 276             # the bit-packing into the INT!
 277             if BB.elwidth == 0b00:
 278                 # pack 1 result into 64-bit registers
 279                 iregs[RT+i][0..62] = 0
 280                 iregs[RT+i][63] = result # sets LSB to result
 281             if BB.elwidth == 0b01:
 282                 # pack 2 results sequentially into INT registers
 283                 iregs[RT+i//2][0..61] = 0
 284                 iregs[RT+i//2][63-(i%2)] = result
 285             if BB.elwidth == 0b10:
 286                 # pack 4 results sequentially into INT registers
 287                 iregs[RT+i//4][0..59] = 0
 288                 iregs[RT+i//4][63-(i%4)] = result
 289             if BB.elwidth == 0b11:
 290                 # pack 8 results sequentially into INT registers
 291                 iregs[RT+i//8][0..55] = 0
 292                 iregs[RT+i//8][63-(i%8)] = result
 293         else:
 294             # scalar RT destination: exceeding VL=64 is UNDEFINED
 295             iregs[RT][63-i] = result # results also in scalar INT
 296             # only mapreduce mode (/mr) allows continuation here
 297             if not SVRM.mapreduce: break
 298 ```
 299
 300 Note that:
 301
 302 * in the scalar case the CR-Vector assessment
 303   is stored bit-wise starting at the LSB of the
 304    destination scalar INT
 305 * in the INT-vector case the results are packed into LSBs
 306   of the INT Elements, the packing arrangement depending on both
 307   elwidth override settings.
 308
 309 **mfcrrweird: RT, BFA, fmsk.fmap**
 310
 311 Unlike `crrweird` the results are 4-bit wide, so the packing
 312 will begin to spill over to other destination elements.  8 results per
 313 destination at 4-bits each still fits into destination elwidth at 32-bit,
 314 but for 16-bit and 8-bit obviously this does not fit, and must split
 315 across to the next element
 316
 317 When for example destination elwidth is 16-bit (0b10) the following packing
 318 occurs:
 319
 320 - SVRM bits 6:7 equal to 0b00 - one 4-bit result element packed into the
 321   first 4-bits of the 16-bit destination element (in the first 4 LSBs)
 322 - SVRM bits 6:7 equal to 0b01 - two 4-bit result elements packed into the
 323   first 8-bits of the 16-bit destination element (in the first 8 LSBs)
 324 - SVRM bits 6:7 equal to 0b10 - four 4-bit result elements packed into each
 325   16-bit destination element
 326 - SVRM bits 6:7 equal to 0b11 - eight 4-bit result elements, the first four
 327   of which are packed into the first 16-bit destination element, the
 328   second four of which are packed into the second 16-bit destination element.
 329
 330 Pseudocode example: note that dest elwidth overrides affect the
 331 packing of results. BB.elwidth in effect requests how many 4-bit
 332 result elements would like to be packed, but RT.elwidth determines
 333 the limit. Any parts of the destination elements not containing
 334 results are set to zero.
 335
 336 ```
 337     for i in range(VL):
 338         if BB.isvec:
 339             creg = CR{BB+i}
 340         else:
 341             creg = CR{BB}
 342         result = (¬fmap ^ creg) & fmsk # 4-bit result
 343         if RT.isvec:
 344             # RT.elwidth override can affect the packing
 345             bwid = {0b00:64, 0b01:8, 0b10:16, 0b11:32}[RT.elwidth]
 346             t4, t8 = min(4, bwid//2), min(8, bwid//2)
 347             # yes, really, the CR's elwidth field determines
 348             # the bit-packing into the INT!
 349             if BB.elwidth == 0b00:
 350                 # pack 1 result into 64-bit registers
 351                 idx, boff = i, 0
 352             if BB.elwidth == 0b01:
 353                 # pack 2 results sequentially into INT registers
 354                 idx, boff = i//2, i%2
 355             if BB.elwidth == 0b10:
 356                 # pack 4 results sequentially into INT registers
 357                 idx, boff = i//t4, i%t4
 358             if BB.elwidth == 0b11:
 359                 # pack 8 results sequentially into INT registers
 360                 idx, boff = i//t8, i%t8
 361         else:
 362             # scalar RT destination: exceeding VL=16 is UNDEFINED
 363             idx, boff = 0, i
 364         # store 4-bit result in Vector starting from RT
 365         iregs[RT+idx][60-boff*4:63-boff*4] = result
 366         if not RT.isvec:
 367             # only mapreduce mode (/mr) allows continuation here
 368             if not SVRM.mapreduce: break
 369 ```
 370
 371 # Predication Examples
 372
 373 Take the following example:
 374
 375 ```
 376     r10 = 0b00010
 377     sv.mtcrweird/dm=r10/dz cr8.v, 0, 0b0011.0000
 378 ```
 379
 380 Here, RA is zero, so the source input is zero. The destination is CR Field
 381 8, and the destination predicate mask indicates to target the first two
 382 elements.  Destination predicate zeroing is enabled, and the destination
 383 predicate is only set in the 2nd bit.  fmsk is 0b0011, fmap is all zeros.
 384
 385 Let us first consider what should go into element 0 (CR Field 8):
 386
 387 * The destination predicate bit is zero, and zeroing is enabled.
 388 * Therefore, what is in the source is irrelevant: the result must
 389   be zero.
 390 * Therefore all four bits of CR Field 8 are therefore set to zero.
 391
 392 Now the second element, CR Field 9 (CR9):
 393
 394 * Bit 2 of the destination predicate, r10, is 1. Therefore the computation
 395   of the result is relevant.
 396 * RA is zero therefore bit 2 is zero.  fmsk is 0b0011 and fmap is 0b0000
 397 * When calculating n0 thru n3 we get n0=1, n1=2, n2=0, n3=0
 398 * Therefore, CR9 is set (using LSB0 ordering) to 0b0011, i.e. to fmsk.
 399
 400 It should be clear that this instruction uses bits of the integer
 401 predicate to decide whether to set CR Fields to `(fmsk & ~fmap)` or
 402 to zero.  Thus, in effect, it is the integer predicate that has been
 403 copied into the CR Fields.
 404
 405 By using twin predication, zeroing, and inversion (sm=~r3, dm=r10) for
 406 example, it becomes possible to combine two Integers together in order
 407 to set bits in CR Fields.  Likewise there are dozens of ways that CR
 408 Predicates can be used, on the same sv.mtcrweird instruction.
 409
 410
 411 [[!tag standards]]
 412