-# SV Vector Operations.
+[[!tag standards]]
-The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example. This section includes such examples. Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
+# SV Vector-assist Operations.
-However some of these actually could be added to a scalar ISA as bitmanipulation instructions. These are separated out into their own section.
-Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
-
-.
Links:
+* [[discussion]]
* <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
-* <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
+* <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
+* <https://bugs.libre-soc.org/show_bug.cgi?id=865> implementation in simulator
* <https://bugs.libre-soc.org/show_bug.cgi?id=213>
* <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
- out of scope for this document
+ out of scope for this document [[openpower/sv/3d_vector_ops]]
* [[simple_v_extension/specification/bitmanip]] previous version,
contains pseudocode for sof, sif, sbf
+* <https://en.m.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#TBM_(Trailing_Bit_Manipulation)>
-# Vector
-
-## conflictd
-
-This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in multi-issue LD/ST operations. Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion. the instruction may be used for histograms (computed in parallel)
-
- input = [100, 100, 3, 100, 5, 100, 100, 3]
- conflict result = [
- 0b00000000, // Note: first element always zero
- 0b00000001, // 100 is present on #0
- 0b00000000,
- 0b00000011, // 100 is present on #0 and #1
- 0b00000000,
- 0b00001011, // 100 is present on #0, #1, #3
- 0b00011011, // .. and #4
- 0b00000100 // 3 is present on #2
- ]
-
-Pseudocode:
-
- for i in range(VL):
- for j in range(1, i):
- if src1[i] == src2[j]:
- result[j] |= 1<<i
-
-## iota
-
-Based on RVV vmiota. vmiota may be viewed as a cumulative variant of popcount, generating multiple results. successive iterations include more and more bits of the bitstream being tested.
-
-When masked, only the bits not masked out are included in the count process.
-
- viota RT/v, RA, RB
-
-Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
-
-Example
-
- 7 6 5 4 3 2 1 0 Element number
-
- 1 0 0 1 0 0 0 1 v2 contents
- viota.m v4, v2 # Unmasked
- 2 2 2 1 1 1 1 0 v4 result
-
- 1 1 1 0 1 0 1 1 v0 contents
- 1 0 0 1 0 0 0 1 v2 contents
- 2 3 4 5 6 7 8 9 v4 contents
- viota.m v4, v2, v0.t # Masked
- 1 1 1 5 1 7 1 0 v4 results
+The core Power ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism.
+Therefore there are not that many cases where *actual* Vector
+instructions are needed. If they are, they are more "assistance"
+functions. Two traditional Vector instructions were initially
+considered (conflictd and vmiota) however they may be synthesised
+from existing SVP64 instructions: vmiota may use [[svstep]].
+Details in [[discussion]]
- def iota(RT, RA, RB):
- mask = RB ? iregs[RB] : 0b111111...1
- val = RA ? iregs[RA] : 0b111111...1
- for i in range(VL):
- if RA.scalar:
- testmask = (1<<i)-1 # only count below
- to_test = val & testmask & mask
- iregs[RT+i] = popcount(to_test)
+Notes:
-a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
-where bit 2 is inv, bits 0:1 select the bit of the CR.
+* Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
+* Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See [[sv/cr_int_predication]].
- def test_CR_bit(CR, BO):
- return CR[BO[0:1]] == BO[2]
+# Mask-suited Bitmanipulation
- def iotacr(RT, BA, BO):
- mask = get_src_predicate()
- count = 0
- for i in range(VL):
- if mask & (1<<i) == 0: continue
- iregs[RT+i] = count
- if test_CR_bit(CR[i+BA], BO):
- count += 1
+Based on RVV masked set-before-first, set-after-first etc.
+and Intel and AMD Bitmanip instructions made generalised then
+advanced further to include masks, this is a single instruction
+covering 24 individual instructions in other ISAs.
+*(sbf/sof/sif moved to [[discussion]])*
-the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway. The integer version covers it, by not reading the int regfile at all.
+BM2-Form
-# Scalar
+|0..5 |6..10|11..15|16..20|21-25|26|27..31| Form |
+|------|-----|------|------|-----|--|------|------|
+| PO | RS | RA | RB |bm |L | XO | BM2-Form |
-These may all be viewed as suitable for fitting into a scalar bitmanip extension.
+* bmask RS,RA,RB,bm,L
-## sbfm
+The patterns within the pseudocode for AMD TBM and x86 BMI1 are
+as follows:
- sbfm RT, RA, RB!=0
+* first pattern A: two options `x` or `~x`
+* second pattern B: three options `|` `&` or `^`
+* third pattern C: four options `x+1`, `x-1`, `~(x+1)` or `(~x)+1`
-Example
+Thus it makes sense to create a single instruction
+that covers all of these. A crucial addition that is essential
+for Scalable Vector usage as Predicate Masks, is the second mask parameter
+(RB). The additional paramater, L, if set, will leave bits of RA masked
+by RB unaltered, otherwise those bits are set to zero. Note that when `RB=0`
+then instead of reading from the register file the mask is set to all ones.
- 7 6 5 4 3 2 1 0 Bit index
+The lower two bits of `bm` set to 0b11 are `RESERVED`. An illegal instruction
+trap must be raised.
- 1 0 0 1 0 1 0 0 v3 contents
- vmsbf.m v2, v3
- 0 0 0 0 0 0 1 1 v2 contents
+Executable pseudocode demo:
- 1 0 0 1 0 1 0 1 v3 contents
- vmsbf.m v2, v3
- 0 0 0 0 0 0 0 0 v2
+```
+[[!inline pages="openpower/sv/bmask.py" quick="yes" raw="yes" ]]
+```
- 0 0 0 0 0 0 0 0 v3 contents
- vmsbf.m v2, v3
- 1 1 1 1 1 1 1 1 v2
+# Carry-lookahead
- 1 1 0 0 0 0 1 1 RB vcontents
- 1 0 0 1 0 1 0 0 v3 contents
- vmsbf.m v2, v3, v0.t
- 0 1 x x x x 1 1 v2 contents
+As a single scalar 32-bit instruction, up to 64 carry-propagation bits
+may be computed. When the output is then used as a Predicate mask it can
+be used to selectively perform the "add carry" of biginteger math, with
+`sv.addi/sm=rN RT.v, RA.v, 1`.
-The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
+* cprop RT,RA,RB
+* cprop. RT,RA,RB
pseudocode:
- def sbf(rd, rs1, rs2):
- rd = 0
- # start setting if no predicate or if 1st predicate bit set
- setting_mode = rs2 == x0 or (regs[rs2] & 1)
- while i < XLEN:
- bit = 1<<i
- if rs2 != x0 and (regs[rs2] & bit):
- # reset searching
- setting_mode = False
- if setting_mode:
- if regs[rs1] & bit: # found a bit in rs1: stop setting rd
- setting_mode = False
- else:
- regs[rd] |= bit
- else if rs2 != x0: # searching mode
- if (regs[rs2] & bit):
- setting_mode = True # back into "setting" mode
- i += 1
-
-## sifm
-
-The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
-
- sifm RT, RA, RB!=0
-
- # Example
-
- 7 6 5 4 3 2 1 0 Bit number
-
- 1 0 0 1 0 1 0 0 v3 contents
- vmsif.m v2, v3
- 0 0 0 0 0 1 1 1 v2 contents
-
- 1 0 0 1 0 1 0 1 v3 contents
- vmsif.m v2, v3
- 0 0 0 0 0 0 0 1 v2
-
- 1 1 0 0 0 0 1 1 RB vcontents
- 1 0 0 1 0 1 0 0 v3 contents
- vmsif.m v2, v3, v0.t
- 1 1 x x x x 1 1 v2 contents
-
-Pseudo-code:
-
- def sif(rd, rs1, rs2):
- rd = 0
- setting_mode = rs2 == x0 or (regs[rs2] & 1)
-
- while i < XLEN:
- bit = 1<<i
-
- # only reenable when predicate in use, and bit valid
- if !setting_mode && rs2 != x0:
- if (regs[rs2] & bit):
- # back into "setting" mode
- setting_mode = True
-
- # skipping mode
- if !setting_mode:
- # skip any more 1s
- if regs[rs1] & bit == 1:
- i += 1
- continue
-
- # setting mode, search for 1
- regs[rd] |= bit # always set during search
- if regs[rs1] & bit: # found a bit in rs1:
- setting_mode = False
- # next loop starts skipping
+ P = (RA)
+ G = (RB)
+ RT = ((P|G)+G)^P
- i += 1
+X-Form
+| 0.5|6.10|11.15|16.20| 21..30 |31| name | Form |
+| -- | -- | --- | --- | --------- |--| ---- | ------- |
+| NN | RT | RA | RB | 0110001110 |Rc| cprop | X-Form |
-## vmsof
+used not just for carry lookahead, also a special type of predication mask operation.
-The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
+* <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
+* <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
+* <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
+* <https://i.stack.imgur.com/QSLKY.png>
+* <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
+ `((P|G)+G)^P`
+* <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
- sofm RT, RA, RB
+From QLSKY.png:
-Example
+```
+ x0 = nand(CIn, P0)
+ C0 = nand(x0, ~G0)
- 7 6 5 4 3 2 1 0 Bit number
+ x1 = nand(CIn, P0, P1)
+ y1 = nand(G0, P1)
+ C1 = nand(x1, y1, ~G1)
- 1 0 0 1 0 1 0 0 v3 contents
- vmsof.m v2, v3
- 0 0 0 0 0 1 0 0 v2 contents
+ x2 = nand(CIn, P0, P1, P2)
+ y2 = nand(G0, P1, P2)
+ z2 = nand(G1, P2)
+ C1 = nand(x2, y2, z2, ~G2)
- 1 0 0 1 0 1 0 1 v3 contents
- vmsof.m v2, v3
- 0 0 0 0 0 0 0 1 v2
+ # Gen*
+ x3 = nand(G0, P1, P2, P3)
+ y3 = nand(G1, P2, P3)
+ z3 = nand(G2, P3)
+ G* = nand(x3, y3, z3, ~G3)
+```
- 1 1 0 0 0 0 1 1 RB vcontents
- 1 1 0 1 0 1 0 0 v3 contents
- vmsof.m v2, v3, v0.t
- 0 1 x x x x 0 0 v2 content
+```
+ P = (A | B) & Ci
+ G = (A & B)
+```
-Pseudo-code:
+Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
- def sof(rd, rs1, rs2):
- rd = 0
- setting_mode = rs2 == x0 or (regs[rs2] & 1)
+```
+ At each id, compute C[id] = A[id]+B[id]+0
+ Get G[id] = C[id] > radix -1
+ Get P[id] = C[id] == radix-1
+ Join all P[id] together, likewise G[id]
+ Compute newC = ((P|G)+G)^P
+ result[id] = (C[id] + newC[id]) % radix
+```
- while i < XLEN:
- bit = 1<<i
+two versions: scalar int version and CR based version.
- # only reenable when predicate in use, and bit valid
- if !setting_mode && rs2 != x0:
- if (regs[rs2] & bit):
- # back into "setting" mode
- setting_mode = True
+scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge
- # skipping mode
- if !setting_mode:
- # skip any more 1s
- if regs[rs1] & bit == 1:
- i += 1
- continue
+vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
- # setting mode, search for 1
- if regs[rs1] & bit: # found a bit in rs1:
- regs[rd] |= bit # only set when search succeeds
- setting_mode = False
- # next loop starts skipping
+if zero (no propagation) then CR0.eq is zero
- i += 1
+CR based version, TODO.