+[[!tag standards]]
+
# New instructions for CR/INT predication
+**DRAFT STATUS**
+
See:
-* <https://bugs.libre-soc.org/show_bug.cgi?id=533>
+* main bugreport for crweirds
+ <https://bugs.libre-soc.org/show_bug.cgi?id=533>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=527>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=569>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
+
+Rationale:
+
+Condition Registers are conceptually perfect for use as predicate masks,
+the only problem being that typical Vector ISAs have quite comprehensive
+mask-based instructions: set-before-first, popcount and much more.
+In fact many Vector ISAs can use Vectors *as* masks, consequently the
+entire Vector ISA is usually available for use in creating masks (one
+exception being AVX512 which has a dedicated Mask regfile and opcodes).
+Duplication of such operations (popcount etc) is not practical for SV
+given the strategy of leveraging pre-existing Scalar instructions in a
+minimalist way.
+
+With the scalar OpenPOWER v3.0B ISA having already popcnt, cntlz and
+others normally seen in Vector Mask operations it makes sense to allow
+*both* scalar integers *and* CR-Vectors to be predicate masks. That in
+turn means that much more comprehensive interaction between CRs and scalar
+Integers is required, because with the CR Predication Modes designating
+CR *Fields* (not CR bits) as Predicate Elements, fast transfers between
+CR *Fields* and the Integer Register File is needed.
+
+The opportunity is therefore taken to also augment CR logical arithmetic
+as well, using a mask-based paradigm that takes into consideration
+multiple bits of each CR Field (eq/lt/gt/ov). By contrast v3.0B Scalar
+CR instructions (crand, crxor) only allow a single bit calculation, and
+both mtcr and mfcr are CR-orientated rather than CR *Field* orientated.
+
+Also strangely there is no v3.0 instruction for directly moving CR Fields,
+only CR *bits*, so that is corrected here with `mcrfm`. The opportunity
+is taken to allow inversion of CR Field bits, when copied.
Basic concept:
-* CR-based instructions that perform simple AND/OR/XOR from all four bits
- of a CR to create a single bit value (0/1) in an integer register
+* CR-based instructions that perform simple AND/OR from any four bits
+ of a CR field to create a single bit value (0/1) in an integer register
* Inverse of the same, taking a single bit value (0/1) from an integer
- register to selectively target all four bits of a given CR
+ register to selectively target any four bits of a given CR Field
* CR-to-CR version of the same, allowing multiple bits to be AND/OR/XORed
in one hit.
-* Vectorisation of the same
+* Optional Vectorisation of the same when SVP64 is implemented
Purpose:
Side-effects:
-* mtcrweird when RA=0 is a means to set or clear arbitrary CR bits from immediates
+* mtcrweird when RA=0 is a means to set or clear arbitrary CR bits
+ using immediates embedded within the instruction.
+
+(Twin) Predication interactions:
+
+* INT twin predication with zeroing is a way to copy an integer into
+ CRs without necessarily needing the INT register (RA). if it is, it is
+ effectively ANDed (or negate-and-ANDed) with the INT Predicate
+* CR twin predication with zeroing is likewise a way to interact with
+ the incoming integer
+
+this gets particularly powerful if data-dependent predication is also
+enabled. further explanation is below.
+
+# Bit ordering.
+
+Please see [[svp64/appendix]] regarding CR bit ordering and for
+the definition of `CR{n}`
# Instruction form and pseudocode
- | 0-5 | 6-10 | 11 | 12-15 | 16-18 | 19-20 | 21-30 | 31 |
- | 19 | RT | 0 | mask | BB | 0 / | XO | / |
- | 19 | RA | 1 | mask | BB | 0 / | XO | / |
- | 19 | BT // | 0 | mask | BB | 1 / | XO | / |
- | 19 | BFT | 1 | mask | BB | 1 / | XO | / |
+**DRAFT** Instruction format (use of MAJOR 19 not approved by
+OPF ISA WG):
+
+|0-5|6-10 |11|12-15|16-18|19-20|21-25 |26-30 |31|name |
+|---|---- |--|-----|-----|-----|----- |----- |--|---- |
+|19 |RT | |mask |BFA | |XO[0:4]|XO[5:9]|/ | |
+|19 | | | | | |1 //// |00011 | |rsvd |
+|19 |RT |M |mask |BFA | 0 0 |1 mode |00011 |Rc|crrweird |
+|19 |RT |M |mask |BFA | 0 1 |1 mode |00011 |Rc|mfcrweird |
+|19 |RA |M |mask |BF | 0 0 |0 mode |00011 |1 |mtcrrweird |
+|19 |RA |M |mask |BF | 0 1 |0 mode |00011 |0 |mtcrweird |
+|19 |BT |M |mask |BFA | 0 1 |0 mode |00011 |1 |crweirder |
+|19 |BF //|M |mask |BFA | 1 1 |0 mode |00011 |0 |crweird |
+|19 |BF //|M |mask |BFA | 1 1 |0 mode |00011 |1 |mcrfm |
+
+**crrweird**
mode is encoded in XO and is 4 bits
-bit 11=0, bit 19=0
+bit 19=0, bit 20=0
- crrweird: RT, BB, mask.mode
+ crrweird: RT, BFA, M, mask.mode
- creg = CRfile[32+BB*4:36+BB*4]
+ creg = CR{BFA}
n0 = mask[0] & (mode[0] == creg[0])
n1 = mask[1] & (mode[1] == creg[1])
n2 = mask[2] & (mode[2] == creg[2])
n3 = mask[3] & (mode[3] == creg[3])
- RT[0] = n0|n1|n2|n3
+ result = n0|n1|n2|n3 if M else n0&n1&n2&n3
+ RT[63] = result # MSB0 numbering, 63 is LSB
+ If Rc:
+ CR0 = analyse(RT)
+
+When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
+SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
+Mode capability
-bit 11=1, bit 19=0
+Also as noted below, element-width override bits normally used
+on the source is instead used to allow multiple results to be packed
+sequentially into the destination. *Destination elwidth overrides still apply*.
- mtcrweird: RA, BB, mask.mode
+**mfcrrweird**
+
+mode is encoded in XO and is 4 bits
+
+bit 19=0, bit 20=0
+
+ mfcrrweird: RT, BFA, mask.mode
+
+ creg = CR{BFA}
+ n0 = mask[0] & (mode[0] == creg[0])
+ n1 = mask[1] & (mode[1] == creg[1])
+ n2 = mask[2] & (mode[2] == creg[2])
+ n3 = mask[3] & (mode[3] == creg[3])
+ result = n0||n1||n2||n3
+ RT[60:63] = result # MSB0 numbering, 63 is LSB
+ If Rc:
+ CR0 = analyse(RT)
+
+When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
+SVP64 type operation and as such can use Rc=1 and RC1 Data-dependent
+Mode capability.
+
+Also as noted below, element-width override bits normally used
+on the source is instead used to allow multiple results to be packed
+into the destination. *Destination elwidth overrides still apply*
+
+**mtcrrweird**
+
+mode is encoded in XO and is 4 bits
+
+bit 19=0, bit 20=0
+
+ mtcrrweird: BF, RA, M, mask.mode
+
+ n0 = mask[0] & (mode[0] == RA[63])
+ n1 = mask[1] & (mode[1] == RA[62])
+ n2 = mask[2] & (mode[2] == RA[61])
+ n3 = mask[3] & (mode[3] == RA[60])
+ result = n0 || n1 || n2 || n3
+ if M:
+ result |= CR{BF} & ~mask
+ CR{BF} = result
+
+When used with SVP64 Prefixing this is a [[openpower/sv/normal]]
+SVP64 type operation and as such can use RC1 Data-dependent
+Mode capability
+
+**mtcrweird**
+
+bit 19=0, bit 20=1
+
+ mtcrweird: BF, RA, M, mask.mode
reg = (RA|0)
- n0 = mask[0] & (mode[0] == reg[0])
- n1 = mask[1] & (mode[1] == reg[0])
- n2 = mask[2] & (mode[2] == reg[0])
- n3 = mask[3] & (mode[3] == reg[0])
- CRfile[32+BB*4:36+BB*4] = n0 || n1 || n2 || n3
+ lsb = reg[63] # MSB0 numbering
+ n0 = mask[0] & (mode[0] == lsb)
+ n1 = mask[1] & (mode[1] == lsb)
+ n2 = mask[2] & (mode[2] == lsb)
+ n3 = mask[3] & (mode[3] == lsb)
+ result = n0 || n1 || n2 || n3
+ if M:
+ result |= CR{BF} & ~mask
+ CR{BF} = result
+
+Note that when M=1 this operation is a Read-Modify-Write on the CR Field
+BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
+M=1. Correspondingly when M=0 this operation is an overwrite: no read
+of BF is required because the masked-out bits of the BF CR Field are
+set to zero.
+
+When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
+type operation that has 3-bit Data-dependent and 3-bit Predicate-result
+capability (BF is 3 bits)
+
+**crweird**
-bit 11=0, bit 19=1
+bit 19=1, bit 20=0, bit 30=0
- crweird: BT, BB, mask.mode
+ crweird: BF, BFA, M, mask.mode
- creg = CRfile[32+BB*4:36+BB*4]
+ creg = CR{BFA}
n0 = mask[0] & (mode[0] == creg[0])
n1 = mask[1] & (mode[1] == creg[1])
n2 = mask[2] & (mode[2] == creg[2])
n3 = mask[3] & (mode[3] == creg[3])
- CRfile[32+BT*4:36+BT*4] = n0 || n1 || n2 || n3
+ result = n0 || n1 || n2 || n3
+ if M:
+ result |= CR{BF} & ~mask
+ CR{BF} = result
-bit 11=1, bit 19=1
+Note that when M=1 this operation is a Read-Modify-Write on the CR Field
+BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
+M=1. Correspondingly when M=0 this operation is an overwrite: no read
+of BF is required because the masked-out bits of the BF CR Field are
+set to zero.
- crweirder: BFT, BB, mask.mode
+When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
+type operation that has 3-bit Data-dependent and 3-bit Predicate-result
+capability (BF is 3 bits)
- creg = CRfile[32+BB*4:36+BB*4]
+**mcrfm** - Move CR Field, masked.
+
+bit 19=1, bit 20=0, bit 30=1
+
+ mcrfm: BF, BFA, M, mask.mode
+
+ result = mask & CR{BFA}
+ if M:
+ result |= CR{BF} & ~mask
+ result ^= mode
+ CR{BF} = result
+
+Note that when M=1 this operation is a Read-Modify-Write on the CR Field
+BF. Masked-out bits of the 4-bit CR Field BF will not be changed when
+M=1. Correspondingly when M=0 this operation is an overwrite: no read
+of BF is required because the masked-out bits of the BF CR Field are
+set to zero.
+
+When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
+type operation that has 3-bit Data-dependent and 3-bit Predicate-result
+capability (BF is 3 bits)
+
+*Programmer's note: `mode` being XORed onto the result provides
+considerable flexibility. individual bits of BFA may be copied inverted
+to BF by ensuring that `mask` and `mode` have the same bit set. Also,
+individual bits in BF may be set to 1 by ensuring that the required bit of
+`mask` is set to zero and the same bit in `mode` is set to 1*
+
+**crweirder**
+
+bit 19=1, bit 20=1
+
+ crweirder: BT, BFA, mask.mode
+
+ creg = CR{BFA}
n0 = mask[0] & (mode[0] == creg[0])
n1 = mask[1] & (mode[1] == creg[1])
n2 = mask[2] & (mode[2] == creg[2])
n3 = mask[3] & (mode[3] == creg[3])
- CRfile[32+BFT] = n0|n1|n2|n3
+ BF = BT[2:4] # select CR
+ bit = BT[0:1] # select bit of CR
+ result = n0|n1|n2|n3 if M else n0&n1&n2&n3
+ CR{BF}[bit] = result
+
+When used with SVP64 Prefixing this is a [[openpower/sv/cr_ops]] SVP64
+type operation that has 5-bit Data-dependent and 5-bit Predicate-result
+capability (BFT is 5 bits)
+
+**Example Pseudo-ops:**
+
+ mtcri BF, mode mtcrweird BF, r0, 0, 0b1111.~mode
+ mtcrset BF, mask mtcrweird BF, r0, 1, mask.0b0000
+ mtcrclr BF, mask mtcrweird BF, r0, 1, mask.0b1111
+
+# Vectorised versions
+
+The name "weird" refers to a minor violation of SV rules when it comes
+to deriving the Vectorised versions of these instructions.
+
+Normally the progression of the SV for-loop would move on to the
+next register. Instead however in the scalar case these instructions
+**remain in the same register** and insert or transfer between **bits**
+of the scalar integer source or destination.
+
+Further useful violation of the normal SV Elwidth override rules allows
+for packing (or unpacking) of multiple CR test results into (or out of)
+an Integer Element. Note that the CR (source operand) elwidth field is
+utilised to determine the bit- packing size (1/2/4/8 with remaining
+bits within the Integer element set to zero) whilst the INT (dest
+operand) elwidth field still sets the Integer element size as usual
+(8/16/32/default)
+
+**crrweird: RT, BB, mask.mode**
+
+ for i in range(VL):
+ if BB.isvec:
+ creg = CR{BB+i}
+ else:
+ creg = CR{BB}
+ n0 = mask[0] & (mode[0] == creg[0])
+ n1 = mask[1] & (mode[1] == creg[1])
+ n2 = mask[2] & (mode[2] == creg[2])
+ n3 = mask[3] & (mode[3] == creg[3])
+ # OR or AND to a single bit
+ result = n0|n1|n2|n3 if M else n0&n1&n2&n3
+ if RT.isvec:
+ # TODO: RT.elwidth override to be also added here
+ # note, yes, really, the CR's elwidth field determines
+ # the bit-packing into the INT!
+ if BB.elwidth == 0b00:
+ # pack 1 result into 64-bit registers
+ iregs[RT+i][0..62] = 0
+ iregs[RT+i][63] = result # sets LSB to result
+ if BB.elwidth == 0b01:
+ # pack 2 results sequentially into INT registers
+ iregs[RT+i//2][0..61] = 0
+ iregs[RT+i//2][63-(i%2)] = result
+ if BB.elwidth == 0b10:
+ # pack 4 results sequentially into INT registers
+ iregs[RT+i//4][0..59] = 0
+ iregs[RT+i//4][63-(i%4)] = result
+ if BB.elwidth == 0b11:
+ # pack 8 results sequentially into INT registers
+ iregs[RT+i//8][0..55] = 0
+ iregs[RT+i//8][63-(i%8)] = result
+ else:
+ iregs[RT][63-i] = result # results also in scalar INT
+
+Note that:
+
+* in the scalar case the CR-Vector assessment
+ is stored bit-wise starting at the LSB of the
+ destination scalar INT
+* in the INT-vector case the results are packed into LSBs
+ of the INT Elements, the packing arrangement depending on both
+ elwidth override settings.
+
+**mfcrrweird: RT, BFA, mask.mode**
+
+Unlike `crrweird` the results are 4-bit wide, so the packing
+will begin to spill over to other destination elements. 8 results per
+destination at 4-bits each still fits into destination elwidth at 32-bit,
+but for 16-bit and 8-bit obviously this does not fit, and must split
+across to the next element
+
+When for example destination elwidth is 16-bit (0b10) the following packing
+occurs:
+
+- SVRM bits 6:7 equal to 0b00 - one 4-bit result element packed into the
+ first 4-bits of the 16-bit destination element (in the first 4 LSBs)
+- SVRM bits 6:7 equal to 0b01 - two 4-bit result elements packed into the
+ first 8-bits of the 16-bit destination element (in the first 8 LSBs)
+- SVRM bits 6:7 equal to 0b10 - four 4-bit result elements packed into each
+ 16-bit destination element
+- SVRM bits 6:7 equal to 0b11 - eight 4-bit result elements, the first four
+ of which are packed into the first 16-bit destination element, the
+ second four of which are packed into the second 16-bit destination element.
+
+Pseudocode example: note that dest elwidth overrides affect the
+packing of results. BB.elwidth in effect requests how many 4-bit
+result elements would like to be packed, but RT.elwidth determines
+the limit. Any parts of the destination elements not containing
+results are set to zero.
+
+ for i in range(VL):
+ if BB.isvec:
+ creg = CR{BB+i}
+ else:
+ creg = CR{BB}
+ n0 = mask[0] & (mode[0] == creg[0])
+ n1 = mask[1] & (mode[1] == creg[1])
+ n2 = mask[2] & (mode[2] == creg[2])
+ n3 = mask[3] & (mode[3] == creg[3])
+ result = n0||n1||n2||n3 # 4-bit result
+ if RT.isvec:
+ # RT.elwidth override can affect the packing
+ bwid = {0b00:64, 0b01:8, 0b10:16, 0b11:32}[RT.elwidth]
+ t4, t8 = min(4, bwid//2), min(8, bwid//2)
+ # yes, really, the CR's elwidth field determines
+ # the bit-packing into the INT!
+ if BB.elwidth == 0b00:
+ # pack 1 result into 64-bit registers
+ idx, boff = i, 0
+ if BB.elwidth == 0b01:
+ # pack 2 results sequentially into INT registers
+ idx, boff = i//2, i%2
+ if BB.elwidth == 0b10:
+ # pack 4 results sequentially into INT registers
+ idx, boff = i//t4, i%t4
+ if BB.elwidth == 0b11:
+ # pack 8 results sequentially into INT registers
+ idx, boff = i//t8, i%t8
+ else:
+ # exceeding VL=16 is UNDEFINED
+ idx, boff = 0, i
+ iregs[RT+idx][60-boff*4:63-boff*4] = result
+
+
+
+# v3.1 setbc instructions
+
+There are additional setb conditional instructions in v3.1 (p129)
+
+ RT = (CR[BI] == 1) ? 1 : 0
+
+which also negate that, and also return -1 / 0. these are similar to
+crweird but not the same purpose. most notable is that crweird acts on
+CR fields rather than the entire 32 bit CR.
+
+# Predication Examples
+
+Take the following example:
+
+ r10 = 0b00010
+ sv.mtcrweird/dm=r10/dz cr8.v, 0, 0b0011.0000
+
+Here, RA is zero, so the source input is zero. The destination is CR Field
+8, and the destination predicate mask indicates to target the first two
+elements. Destination predicate zeroing is enabled, and the destination
+predicate is only set in the 2nd bit. mask is 0b0011, mode is all zeros.
+
+Let us first consider what should go into element 0 (CR Field 8):
+
+* The destination predicate bit is zero, and zeroing is enabled.
+* Therefore, what is in the source is irrelevant: the result must
+ be zero.
+* Therefore all four bits of CR Field 8 are therefore set to zero.
-Pseudo-op:
+Now the second element, CR Field 9 (CR9):
- mtcri BB, mode mtcrweird r0, BB, 0b1111.~mode
- mtcrset BB, mask mtcrweird r0, BB, mask.0b0000
- mtcrclr BB, mask mtcrweird r0, BB, mask.0b1111
+* Bit 2 of the destination predicate, r10, is 1. Therefore the computation
+ of the result is relevant.
+* RA is zero therefore bit 2 is zero. mask is 0b0011 and mode is 0b0000
+* When calculating n0 thru n3 we get n0=1, n1=2, n2=0, n3=0
+* Therefore, CR9 is set (using LSB0 ordering) to 0b0011, i.e. to mask.
+It should be clear that this instruction uses bits of the integer
+predicate to decide whether to set CR Fields to `(mask & ~mode)` or
+to zero. Thus, in effect, it is the integer predicate that has been
+copied into the CR Fields.
+By using twin predication, zeroing, and inversion (sm=~r3, dm=r10) for
+example, it becomes possible to combine two Integers together in order
+to set bits in CR Fields. Likewise there are dozens of ways that CR
+Predicates can be used, on the same sv.mtcrweird instruction.