From cd2ab2c86cf4141b39353fa47f9b66d47d7977bf Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Mon, 20 Jun 2022 17:34:19 +0100 Subject: [PATCH] add copy of vector_ops page as discussion --- openpower/sv/vector_ops/discussion.mdwn | 302 ++++++++++++++++++++++++ 1 file changed, 302 insertions(+) create mode 100644 openpower/sv/vector_ops/discussion.mdwn diff --git a/openpower/sv/vector_ops/discussion.mdwn b/openpower/sv/vector_ops/discussion.mdwn new file mode 100644 index 000000000..9efe21825 --- /dev/null +++ b/openpower/sv/vector_ops/discussion.mdwn @@ -0,0 +1,302 @@ +[[!tag standards]] + +# SV Vector Operations. + +Links: + +* +* conflictd example +* +* +* specialist vector ops + out of scope for this document [[openpower/sv/3d_vector_ops]] +* [[simple_v_extension/specification/bitmanip]] previous version, + contains pseudocode for sof, sif, sbf + +The core OpenPOWER ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism. However, certain classes of instructions only make sense in a Vector context: AVX512 conflictd for example. This section includes such examples. Many of them are from the RISC-V Vector ISA (with thanks to the efforts of RVV's contributors) + +Notes: + +* Some of these actually could be added to a scalar ISA as bitmanipulation instructions. These are separated out into their own section. +* Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU) +* Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See [[sv/cr_int_predication]]. + +# Vector + +Both of these instructions may be synthesised from SVP64 Vector +instructions. conflictd is an O(N^2) instruction based on +`sv.cmpi` and iota is an O(N) instruction based on `sv.addi` +with the appropriate predication + +## conflictd + +This is based on the AVX512 conflict detection instruction. Internally the logic is used to detect address conflicts in multi-issue LD/ST operations. Two arrays of values are given: the indices are compared and duplicates reported in a triangular fashion. the instruction may be used for histograms (computed in parallel) + + input = [100, 100, 3, 100, 5, 100, 100, 3] + conflict result = [ + 0b00000000, // Note: first element always zero + 0b00000001, // 100 is present on #0 + 0b00000000, + 0b00000011, // 100 is present on #0 and #1 + 0b00000000, + 0b00001011, // 100 is present on #0, #1, #3 + 0b00011011, // .. and #4 + 0b00000100 // 3 is present on #2 + ] + +Pseudocode: + + for i in range(VL): + for j in range(1, i): + if src1[i] == src2[j]: + result[j] |= 1< +* + +## iota + +Based on RVV vmiota. vmiota may be viewed as a cumulative variant of popcount, generating multiple results. successive iterations include more and more bits of the bitstream being tested. + +When masked, only the bits not masked out are included in the count process. + + viota RT/v, RA, RB + +Note that when RA=0 this indicates to test against all 1s, resulting in the instruction generating a vector sequence [0, 1, 2... VL-1]. This will be equivalent to RVV vid.m which is a pseudo-op, here (RA=0). + +Example + + 7 6 5 4 3 2 1 0 Element number + + 1 0 0 1 0 0 0 1 v2 contents + viota.m v4, v2 # Unmasked + 2 2 2 1 1 1 1 0 v4 result + + 1 1 1 0 1 0 1 1 v0 contents + 1 0 0 1 0 0 0 1 v2 contents + 2 3 4 5 6 7 8 9 v4 contents + viota.m v4, v2, v0.t # Masked + 1 1 1 5 1 7 1 0 v4 results + + def iota(RT, RA, RB): + mask = RB ? iregs[RB] : 0b111111...1 + val = RA ? iregs[RA] : 0b111111...1 + for i in range(VL): + if RA.scalar: + testmask = (1< +* +* +* +* + `((P|G)+G)^P` +* + +From QLSKY.png: + +``` + x0 = nand(CIn, P0) + C0 = nand(x0, ~G0) + + x1 = nand(CIn, P0, P1) + y1 = nand(G0, P1) + C1 = nand(x1, y1, ~G1) + + x2 = nand(CIn, P0, P1, P2) + y2 = nand(G0, P1, P2) + z2 = nand(G1, P2) + C1 = nand(x2, y2, z2, ~G2) + + # Gen* + x3 = nand(G0, P1, P2, P3) + y3 = nand(G1, P2, P3) + z3 = nand(G2, P3) + G* = nand(x3, y3, z3, ~G3) +``` + +``` + P = (A | B) & Ci + G = (A & B) +``` + +Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element. + +``` + At each id, compute C[id] = A[id]+B[id]+0 + Get G[id] = C[id] > radix -1 + Get P[id] = C[id] == radix-1 + Join all P[id] together, likewise G[id] + Compute newC = ((P|G)+G)^P + result[id] = (C[id] + newC[id]) % radix +``` + +two versions: scalar int version and CR based version. + +scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge + +vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits. + +if zero (no propagation) then CR0.eq is zero + +CR based version, TODO. -- 2.30.2