+[[!tag standards]]
+
# SV Vector Operations.
-The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVC512 conflictd for example. This section includes such examples. Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
+TODO merge old standards page [[simple_v_extension/vector_ops/]]
+
+The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example. This section includes such examples. Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
+
+Notes:
-However some of these actually could be added to a scalar ISA as bitmanipulation instructions. These are separated out into their own section.
-Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
+* Some of these actually could be added to a scalar ISA as bitmanipulation instructions. These are separated out into their own section.
+* Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
+* Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See [[sv/cr_int_predication]].
-.
Links:
* <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
## conflictd
-This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in multi-issue LD/ST operations. Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion
+This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in multi-issue LD/ST operations. Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion. the instruction may be used for histograms (computed in parallel)
input = [100, 100, 3, 100, 5, 100, 100, 3]
conflict result = [
When masked, only the bits not masked out are included in the count process.
- viota.m vd, vs2, vm
+ viota RT/v, RA, RB
+
+Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
Example
1 1 1 5 1 7 1 0 v4 results
def iota(RT, RA, RB):
- mask = iregs[RB] # or if zero, all 1s.
+ mask = RB ? iregs[RB] : 0b111111...1
+ val = RA ? iregs[RA] : 0b111111...1
for i in range(VL):
+ if RA.scalar:
testmask = (1<<i)-1 # only count below
- to_test = iregs[RA] & testmask & mask
+ to_test = val & testmask & mask
iregs[RT+i] = popcount(to_test)
-TODO: a Vector CR-based version of the same, due to CRs being used for predication.
+a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
+where bit 2 is inv, bits 0:1 select the bit of the CR.
+
+ def test_CR_bit(CR, BO):
+ return CR[BO[0:1]] == BO[2]
- def iotacr(RT, BA):
+ def iotacr(RT, BA, BO):
mask = get_src_predicate()
count = 0
for i in range(VL):
if mask & (1<<i) == 0: continue
iregs[RT+i] = count
- if test_CR_bit(CR[i+BA]):
+ if test_CR_bit(CR[i+BA], BO):
count += 1
+the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway. The integer version covers it, by not reading the int regfile at all.
+
# Scalar
These may all be viewed as suitable for fitting into a scalar bitmanip extension.
# next loop starts skipping
i += 1
+
+# Carry-lookahead
+
+used not just for carry lookahead, also a special type of predication mask operation.
+
+* <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
+* <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
+* <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
+* <https://i.stack.imgur.com/QSLKY.png>
+* <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
+ `((P|G)+G)^P`
+* <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
+
+From QLSKY.png:
+
+```
+ x0 = nand(CIn, P0)
+ C0 = nand(x0, ~G0)
+
+ x1 = nand(CIn, P0, P1)
+ y1 = nand(G0, P1)
+ C1 = nand(x1, y1, ~G1)
+
+ x2 = nand(CIn, P0, P1, P2)
+ y2 = nand(G0, P1, P2)
+ z2 = nand(G1, P2)
+ C1 = nand(x2, y2, z2, ~G2)
+
+ # Gen*
+ x3 = nand(G0, P1, P2, P3)
+ y3 = nand(G1, P2, P3)
+ z3 = nand(G2, P3)
+ G* = nand(x3, y3, z3, ~G3)
+```
+
+```
+ P = (A | B) & Ci
+ G = (A & B)
+```
+
+Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
+
+```
+ At each id, compute C[id] = A[id]+B[id]+0
+ Get G[id] = C[id] > radix -1
+ Get P[id] = C[id] == radix-1
+ Join all P[id] together, likewise G[id]
+ Compute newC = ((P|G)+G)^P
+ result[id] = (C[id] + newC[id]) % radix
+```
+
+two versions: scalar int version and CR based version.
+
+scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge
+
+vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
+
+if zero (no propagation) then CR0.eq is zero
+
+CR based version, TODO.