+[[!tag standards]]
+
# SV Vector Operations.
-The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVC512 conflictd for example. This section includes such examples. Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
+The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example. This section includes such examples. Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors)
+
+Notes:
-However some of these actually could be added to a scalar ISA as bitmanipulation instructions. These are separated out into their own section.
-Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
+* Some of these actually could be added to a scalar ISA as bitmanipulation instructions. These are separated out into their own section.
+* Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
+* Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See [[sv/cr_int_predication]].
-.
Links:
* <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
+* <http://0x80.pl/notesen/2016-10-23-avx512-conflict-detection.html> conflictd example
* <https://bugs.libre-soc.org/show_bug.cgi?id=213>
* <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
out of scope for this document
+* [[simple_v_extension/specification/bitmanip]] previous version,
+ contains pseudocode for sof, sif, sbf
# Vector
## conflictd
-This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in LD/ST operations. Two arrays of indices are given.
+This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in multi-issue LD/ST operations. Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion. the instruction may be used for histograms (computed in parallel)
+
+ input = [100, 100, 3, 100, 5, 100, 100, 3]
+ conflict result = [
+ 0b00000000, // Note: first element always zero
+ 0b00000001, // 100 is present on #0
+ 0b00000000,
+ 0b00000011, // 100 is present on #0 and #1
+ 0b00000000,
+ 0b00001011, // 100 is present on #0, #1, #3
+ 0b00011011, // .. and #4
+ 0b00000100 // 3 is present on #2
+ ]
+
+Pseudocode:
+
+ for i in range(VL):
+ for j in range(1, i):
+ if src1[i] == src2[j]:
+ result[j] |= 1<<i
## iota
-Based on RVV vmiota. vmiota may be viewed as a cumulative variant of cntlz, where instead of stopping at the first zero with a count to produce a single scalar result, the process continues on, producing another element at the next encounter of a 1.
+Based on RVV vmiota. vmiota may be viewed as a cumulative variant of popcount, generating multiple results. successive iterations include more and more bits of the bitstream being tested.
+
+When masked, only the bits not masked out are included in the count process.
+
+ viota RT/v, RA, RB
+
+Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0).
+
+Example
+
+ 7 6 5 4 3 2 1 0 Element number
+
+ 1 0 0 1 0 0 0 1 v2 contents
+ viota.m v4, v2 # Unmasked
+ 2 2 2 1 1 1 1 0 v4 result
+
+ 1 1 1 0 1 0 1 1 v0 contents
+ 1 0 0 1 0 0 0 1 v2 contents
+ 2 3 4 5 6 7 8 9 v4 contents
+ viota.m v4, v2, v0.t # Masked
+ 1 1 1 5 1 7 1 0 v4 results
+
+ def iota(RT, RA, RB):
+ mask = RB ? iregs[RB] : 0b111111...1
+ val = RA ? iregs[RA] : 0b111111...1
+ for i in range(VL):
+ if RA.scalar:
+ testmask = (1<<i)-1 # only count below
+ to_test = val & testmask & mask
+ iregs[RT+i] = popcount(to_test)
+
+a Vector CR-based version of the same, due to CRs being used for predication. This would use the same testing mechanism as branch: BO[0:2]
+where bit 2 is inv, bits 0:1 select the bit of the CR.
+
+ def test_CR_bit(CR, BO):
+ return CR[BO[0:1]] == BO[2]
+
+ def iotacr(RT, BA, BO):
+ mask = get_src_predicate()
+ count = 0
+ for i in range(VL):
+ if mask & (1<<i) == 0: continue
+ iregs[RT+i] = count
+ if test_CR_bit(CR[i+BA], BO):
+ count += 1
+
+the variant of iotacr which is vidcr, this is not appropriate to have BA=0, plus, it is pointless to have it anyway. The integer version covers it, by not reading the int regfile at all.
# Scalar
The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
+pseudocode:
+
+ def sbf(rd, rs1, rs2):
+ rd = 0
+ # start setting if no predicate or if 1st predicate bit set
+ setting_mode = rs2 == x0 or (regs[rs2] & 1)
+ while i < XLEN:
+ bit = 1<<i
+ if rs2 != x0 and (regs[rs2] & bit):
+ # reset searching
+ setting_mode = False
+ if setting_mode:
+ if regs[rs1] & bit: # found a bit in rs1: stop setting rd
+ setting_mode = False
+ else:
+ regs[rd] |= bit
+ else if rs2 != x0: # searching mode
+ if (regs[rs2] & bit):
+ setting_mode = True # back into "setting" mode
+ i += 1
+
+## sifm
+
+The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
+
+ sifm RT, RA, RB!=0
+
+ # Example
+
+ 7 6 5 4 3 2 1 0 Bit number
+
+ 1 0 0 1 0 1 0 0 v3 contents
+ vmsif.m v2, v3
+ 0 0 0 0 0 1 1 1 v2 contents
+
+ 1 0 0 1 0 1 0 1 v3 contents
+ vmsif.m v2, v3
+ 0 0 0 0 0 0 0 1 v2
+
+ 1 1 0 0 0 0 1 1 RB vcontents
+ 1 0 0 1 0 1 0 0 v3 contents
+ vmsif.m v2, v3, v0.t
+ 1 1 x x x x 1 1 v2 contents
+
+Pseudo-code:
+
+ def sif(rd, rs1, rs2):
+ rd = 0
+ setting_mode = rs2 == x0 or (regs[rs2] & 1)
+
+ while i < XLEN:
+ bit = 1<<i
+
+ # only reenable when predicate in use, and bit valid
+ if !setting_mode && rs2 != x0:
+ if (regs[rs2] & bit):
+ # back into "setting" mode
+ setting_mode = True
+
+ # skipping mode
+ if !setting_mode:
+ # skip any more 1s
+ if regs[rs1] & bit == 1:
+ i += 1
+ continue
+
+ # setting mode, search for 1
+ regs[rd] |= bit # always set during search
+ if regs[rs1] & bit: # found a bit in rs1:
+ setting_mode = False
+ # next loop starts skipping
+
+ i += 1
-## vmsif
## vmsof
+The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
+
+ sofm RT, RA, RB
+
+Example
+
+ 7 6 5 4 3 2 1 0 Bit number
+
+ 1 0 0 1 0 1 0 0 v3 contents
+ vmsof.m v2, v3
+ 0 0 0 0 0 1 0 0 v2 contents
+
+ 1 0 0 1 0 1 0 1 v3 contents
+ vmsof.m v2, v3
+ 0 0 0 0 0 0 0 1 v2
+
+ 1 1 0 0 0 0 1 1 RB vcontents
+ 1 1 0 1 0 1 0 0 v3 contents
+ vmsof.m v2, v3, v0.t
+ 0 1 x x x x 0 0 v2 content
+
+Pseudo-code:
+
+ def sof(rd, rs1, rs2):
+ rd = 0
+ setting_mode = rs2 == x0 or (regs[rs2] & 1)
+
+ while i < XLEN:
+ bit = 1<<i
+
+ # only reenable when predicate in use, and bit valid
+ if !setting_mode && rs2 != x0:
+ if (regs[rs2] & bit):
+ # back into "setting" mode
+ setting_mode = True
+
+ # skipping mode
+ if !setting_mode:
+ # skip any more 1s
+ if regs[rs1] & bit == 1:
+ i += 1
+ continue
+
+ # setting mode, search for 1
+ if regs[rs1] & bit: # found a bit in rs1:
+ regs[rd] |= bit # only set when search succeeds
+ setting_mode = False
+ # next loop starts skipping
+
+ i += 1
+
+# Carry-lookahead
+
+used not just for carry lookahead, also a special type of predication mask operation.
+
+* <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
+* <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
+* <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
+* <https://i.stack.imgur.com/QSLKY.png>
+* <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
+ `((P|G)+G)^P`
+
+two versions: scalar int version and CR based version.
+
+scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge
+
+vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
+
+if zero (no propagation) then CR0.eq is zero
+
+CR based version, TODO.