openpower/sv/cr_int_predication.mdwn

   1 [[!tag standards]]
   2
   3 # New instructions for CR/INT predication
   4
   5 **DRAFT STATUS**
   6
   7 See:
   8
   9 * main bugreport for crweirds
  10   <https://bugs.libre-soc.org/show_bug.cgi?id=533>
  11 * <https://bugs.libre-soc.org/show_bug.cgi?id=527>
  12 * <https://bugs.libre-soc.org/show_bug.cgi?id=569>
  13 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
  14
  15 Rationale:
  16
  17 Condition Registers are conceptually perfect for use as predicate masks, the only problem being that typical Vector ISAs have quite comprehensive mask-based instructions: set-before-first, popcount and much more.  In fact many Vector ISAs can use Vectors *as* masks, consequently the entire Vector ISA is available for use in creating masks.  This is not practical for SV given the premise to minimise adding of instructions.
  18
  19 With the scalar OpenPOWER v3.0B ISA having already popcnt, cntlz and others normally seen in Vector Mask operations it makes sense to allow *both* scalar integers *and* CR-Vectors to be predicate masks.  That in turn means that much more comprehensive interaction between CRs and scalar Integers is required.
  20
  21 The opportunity is therefore taken to also augment CR logical arithmetic as well, using a mask-based paradigm that takes into consideration multiple bits of each CR (eq/lt/gt/ov).  By contrast
  22 v3.0B Scalar CR instructions (crand, crxor) only allow a single bit calculation.
  23
  24 Basic concept:
  25
  26 * CR-based instructions that perform simple AND/OR/XOR from any four bits
  27   of a CR field to create a single bit value (0/1) in an integer register
  28 * Inverse of the same, taking a single bit value (0/1) from an integer
  29   register to selectively target any four bits of a given CR Field
  30 * CR-to-CR version of the same, allowing multiple bits to be AND/OR/XORed
  31   in one hit.
  32 * Optional Vectorisation of the same when SVP64 is implemented
  33
  34 Purpose:
  35
  36 * To provide a merged version of what is currently a multi-sequence of
  37   CR operations (crand, cror, crxor) with mfcr and mtcrf, reducing
  38   instruction count.
  39 * To provide a vectorised version of the same, suitable for advanced
  40   predication
  41
  42 Side-effects:
  43
  44 * mtcrweird when RA=0 is a means to set or clear arbitrary CR bits
  45   using immediates embedded within the instruction.
  46
  47 (Twin) Predication interactions:
  48
  49 * INT twin predication with zeroing is a way to copy an integer into CRs without necessarily needing the INT register (RA).  if it is, it is effectively ANDed (or negate-and-ANDed) with the INT Predicate
  50 * CR twin predication with zeroing is likewise a way to interact with the incoming integer
  51
  52 this gets particularly powerful if data-dependent predication is also enabled.  further explanation is below.
  53
  54 # Bit ordering.
  55
  56 IBM chose MSB0 for the OpenPOWER v3.0B specification.  This makes things slightly hair-raising and the relationship between the CR and the CR Field
  57 numbers is not clearly defined.  To make it clear we define a new
  58 term, `CR{n}`.
  59 `CR{n}` refers to `CR0` when `n=0` and consequently, for CR0-7, is defined, in v3.0B pseudocode, as:
  60
  61      CR{7-n} = CR[32+n*4:35+n*4]
  62
  63 Also note that for SVP64 the relationship for the sequential
  64 numbering of elements is to the CR **fields** within
  65 the CR Register, not to individual bits within the CR register.
  66
  67 # Instruction form and pseudocode
  68
  69 **DRAFT** Instruction format (use of MAJOR 19 not approved by
  70 OPF ISA WG):
  71
  72 |0-5|6-10 |11|12-15|16-18|19-20|21-25  |26-30  |31|name      |
  73 |---|---- |--|-----|-----|-----|-----  |-----  |--|----      |
  74 |19 |RT   |  |mask |BFA  |     |XO[0:4]|XO[5:9]|/ |          |
  75 |19 |RT   |M |mask |BFA  | 0 0 |XO[0:4]|0 mode |Rc|crrweird  |
  76 |19 |RA   |M |mask |BF   | 0 1 |XO[0:4]|0 mode |/ |mtcrweird |
  77 |19 |BFT//|M |mask |BFA  | 1 0 |XO[0:4]|0 mode |/ |crweirder |
  78 |19 |BF   |M |mask |BFA  | 1 1 |XO[0:4]|0 mode |/ |crweird   |
  79
  80 **crrweird**
  81
  82 mode is encoded in XO and is 4 bits
  83
  84 bit 19=0, bit 20=0
  85
  86     crrweird: RT, BFA, M, mask.mode
  87
  88     creg = CR{BFA}
  89     n0 = mask[0] & (mode[0] == creg[0])
  90     n1 = mask[1] & (mode[1] == creg[1])
  91     n2 = mask[2] & (mode[2] == creg[2])
  92     n3 = mask[3] & (mode[3] == creg[3])
  93     result = n0|n1|n2|n3 if M else n0&n1&n2&n3
  94     RT[63] = result # MSB0 numbering, 63 is LSB
  95     If Rc:
  96         CR0 = analyse(RT)
  97
  98 When used with SVP64 Prefixing this is a [[openpower/sv/normal]] SVP64 type operation and as
  99 such can use Rc=1 and RC1 Data-dependent Mode capability
 100
 101 **mtcrweird**
 102
 103 bit 19=0, bit 20=1
 104
 105     mtcrweird: BF, RA, M, mask.mode
 106
 107     reg = (RA|0)
 108     lsb = reg[63] # MSB0 numbering
 109     n0 = mask[0] & (mode[0] == lsb)
 110     n1 = mask[1] & (mode[1] == lsb)
 111     n2 = mask[2] & (mode[2] == lsb)
 112     n3 = mask[3] & (mode[3] == lsb)
 113     result = n0 || n1 || n2 || n3
 114     if M:
 115         result |= CR{BF} & ~mask
 116     CR{BF} = result
 117
 118 Note that when M=1 this operation is a Read-Modify-Write on the CR Field
 119 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
 120 M=1. Correspondingly when M=0 this operation is an overwrite: no read
 121 of BF is required because the masked-out bits of the BF CR Field are
 122 set to zero.
 123
 124 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64 type operation that has
 125 3-bit Data-dependent and 3-bit Predicate-result capability
 126 (BF is 3 bits)
 127
 128 **crweird**
 129
 130 bit 19=1, bit 20=0
 131
 132     crweird: BF, BFA, M, mask.mode
 133
 134     creg = CR{BFA}
 135     n0 = mask[0] & (mode[0] == creg[0])
 136     n1 = mask[1] & (mode[1] == creg[1])
 137     n2 = mask[2] & (mode[2] == creg[2])
 138     n3 = mask[3] & (mode[3] == creg[3])
 139     result = n0 || n1 || n2 || n3
 140     if M:
 141         result |= CR{BF} & ~mask
 142     CR{BF} = result
 143
 144 Note that when M=1 this operation is a Read-Modify-Write on the CR Field
 145 BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
 146 M=1. Correspondingly when M=0 this operation is an overwrite: no read
 147 of BF is required because the masked-out bits of the BF CR Field are
 148 set to zero.
 149
 150 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64 type operation that has
 151 3-bit Data-dependent and 3-bit Predicate-result capability
 152 (BF is 3 bits)
 153
 154 **crweirder**
 155
 156 bit 19=1, bit 20=1
 157
 158     crweirder: BT, BFA, mask.mode
 159
 160     creg = CR{BFA}
 161     n0 = mask[0] & (mode[0] == creg[0])
 162     n1 = mask[1] & (mode[1] == creg[1])
 163     n2 = mask[2] & (mode[2] == creg[2])
 164     n3 = mask[3] & (mode[3] == creg[3])
 165     BF = BT[2:4] # select CR
 166     bit = BT[0:1] # select bit of CR
 167     result = n0|n1|n2|n3 if M else n0&n1&n2&n3
 168     CR{BF}[bit] = result
 169
 170 When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64 type operation that has
 171 5-bit Data-dependent and 5-bit Predicate-result capability
 172 (BFT is 5 bits)
 173
 174 **Example Pseudo-ops:**
 175
 176     mtcri BF, mode    mtcrweird BF, r0, 0, 0b1111.~mode
 177     mtcrset BF, mask  mtcrweird BF, r0, 1, mask.0b0000
 178     mtcrclr BF, mask  mtcrweird BF, r0, 1, mask.0b1111
 179
 180 # Vectorised versions
 181
 182 The name "weird" refers to a minor violation of SV rules when it comes to deriving the Vectorised versions of these instructions.
 183
 184 Normally the progression of the SV for-loop would move on to the next register.
 185 Instead however in the scalar case these instructions **remain in the same register** and insert or transfer between **bits** of the scalar integer source or destination.
 186
 187 Further useful violation of the normal SV Elwidth override rules allows
 188 for packing (or unpacking) of multiple CR test results into
 189 (or out of) an Integer Element. Note
 190 that the CR (source operand) elwidth field is utilised to determine the bit-
 191 packing size (1/2/4/8 with remaining bits within the Integer element
 192 set to zero) whilst the INT (dest operand) elwidth field still sets
 193 the Integer element size as usual (8/16/32/default)
 194
 195     crrweird: RT, BB, mask.mode
 196
 197     for i in range(VL):
 198         if BB.isvec:
 199             creg = CR{BB+i}
 200         else:
 201             creg = CR{BB}
 202         n0 = mask[0] & (mode[0] == creg[0])
 203         n1 = mask[1] & (mode[1] == creg[1])
 204         n2 = mask[2] & (mode[2] == creg[2])
 205         n3 = mask[3] & (mode[3] == creg[3])
 206         # OR or AND to a single bit
 207         result = n0|n1|n2|n3 if M else n0&n1&n2&n3
 208         if RT.isvec:
 209             # TODO: RT.elwidth override to be also added here
 210             # note, yes, really, the CR's elwidth field determines
 211             # the bit-packing into the INT!
 212             if BB.elwidth == 0b00:
 213                 # pack 1 result into 64-bit registers
 214                 iregs[RT+i][0..62] = 0
 215                 iregs[RT+i][63] = result # sets LSB to result
 216             if BB.elwidth == 0b01:
 217                 # pack 2 results sequentially into INT registers
 218                 iregs[RT+i//2][0..61] = 0
 219                 iregs[RT+i//2][63-(i%2)] = result
 220             if BB.elwidth == 0b10:
 221                 # pack 4 results sequentially into INT registers
 222                 iregs[RT+i//4][0..59] = 0
 223                 iregs[RT+i//4][63-(i%4)] = result
 224             if BB.elwidth == 0b11:
 225                 # pack 8 results sequentially into INT registers
 226                 iregs[RT+i//8][0..55] = 0
 227                 iregs[RT+i//8][63-(i%8)] = result
 228         else:
 229             iregs[RT][63-i] = result # results also in scalar INT
 230
 231 Note that:
 232
 233 * in the scalar case the CR-Vector assessment
 234   is stored bit-wise starting at the LSB of the
 235    destination scalar INT
 236 * in the INT-vector case the results are packed into LSBs
 237   of the INT Elements, the packing arrangement depending on both
 238   elwidth override settings.
 239
 240 # v3.1 setbc instructions
 241
 242 there are additional setb conditional instructions in v3.1 (p129)
 243
 244     RT = (CR[BI] == 1) ? 1 : 0
 245
 246 which also negate that, and also return -1 / 0.  these are similar to crweird but not the same purpose.  most notable is that crweird acts on CR fields rather than the entire 32 bit CR.
 247
 248 # Predication Examples
 249
 250 Take the following example:
 251
 252     r10 = 0b00010
 253     sv.mtcrweird/dm=r10/dz cr8.v, 0, 0b0011.0000
 254
 255 Here, RA is zero, so the source input is zero. The destination
 256 is CR Field 8, and the destination predicate mask indicates
 257 to target the first two elements.  Destination predicate zeroing is
 258 enabled, and the destination predicate is only set in the 2nd bit.
 259 mask is 0b0011, mode is all zeros.
 260
 261 Let us first consider what should go into element 0 (CR Field 8):
 262
 263 * The destination predicate bit is zero, and zeroing is enabled.
 264 * Therefore, what is in the source is irrelevant: the result must
 265   be zero.
 266 * Therefore all four bits of CR Field 8 are therefore set to zero.
 267
 268 Now the second element, CR Field 9 (CR9):
 269
 270 * Bit 2 of the destination predicate, r10, is 1. Therefore the computation
 271   of the result is relevant.
 272 * RA is zero therefore bit 2 is zero.  mask is 0b0011 and mode is 0b0000
 273 * When calculating n0 thru n3 we get n0=1, n1=2, n2=0, n3=0
 274 * Therefore, CR9 is set (using LSB0 ordering) to 0b0011, i.e. to mask.
 275
 276 It should be clear that this instruction uses bits of the integer
 277 predicate to decide whether to set CR Fields to `(mask & ~mode)`
 278 or to zero.  Thus, in effect, it is the integer predicate that has
 279 been copied into the CR Fields.
 280
 281 By using twin predication, zeroing, and inversion (sm=~r3, dm=r10) for example, it becomes possible to combine two Integers together in
 282 order to set bits in CR Fields.
 283 Likewise there are dozens of ways that CR Predicates can be used, on the
 284 same sv.mtcrweird instruction.