3 # SV Vector Operations.
8 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
9 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
10 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
11 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
12 out of scope for this document [[openpower/sv/3d_vector_ops]]
13 * [[simple_v_extension/specification/bitmanip]] previous version,
14 contains pseudocode for sof, sif, sbf
15 * https://en.m.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#TBM_(Trailing_Bit_Manipulation)
17 The core Power ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism.
18 Therefore there are not that many cases where *actual* Vector
19 instructions are needed. If they are, they are more "assistance"
20 functions. Two traditional Vector instructions were initially
21 considered (conflictd and vmiota) however they may be synthesised
22 from existing SVP64 instructions: details in [[discussion]]
26 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
27 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See [[sv/cr_int_predication]].
35 7 6 5 4 3 2 1 0 Bit index
37 1 0 0 1 0 1 0 0 v3 contents
39 0 0 0 0 0 0 1 1 v2 contents
41 1 0 0 1 0 1 0 1 v3 contents
45 0 0 0 0 0 0 0 0 v3 contents
49 1 1 0 0 0 0 1 1 RB vcontents
50 1 0 0 1 0 1 0 0 v3 contents
52 0 1 x x x x 1 1 v2 contents
54 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
56 Executable pseudocode demo:
59 [[!inline quick="yes" raw="yes" pages="openpower/sv/sbf.py"]]
64 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
70 7 6 5 4 3 2 1 0 Bit number
72 1 0 0 1 0 1 0 0 v3 contents
74 0 0 0 0 0 1 1 1 v2 contents
76 1 0 0 1 0 1 0 1 v3 contents
80 1 1 0 0 0 0 1 1 RB vcontents
81 1 0 0 1 0 1 0 0 v3 contents
83 1 1 x x x x 1 1 v2 contents
85 Executable pseudocode demo:
88 [[!inline quick="yes" raw="yes" pages="openpower/sv/sif.py"]]
93 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
99 7 6 5 4 3 2 1 0 Bit number
101 1 0 0 1 0 1 0 0 v3 contents
103 0 0 0 0 0 1 0 0 v2 contents
105 1 0 0 1 0 1 0 1 v3 contents
109 1 1 0 0 0 0 1 1 RB vcontents
110 1 1 0 1 0 1 0 0 v3 contents
112 0 1 x x x x 0 0 v2 content
114 Executable pseudocode demo:
117 [[!inline quick="yes" raw="yes" pages="openpower/sv/sof.py"]]
122 used not just for carry lookahead, also a special type of predication mask operation.
124 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
125 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
126 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
127 * <https://i.stack.imgur.com/QSLKY.png>
128 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
130 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
138 x1 = nand(CIn, P0, P1)
140 C1 = nand(x1, y1, ~G1)
142 x2 = nand(CIn, P0, P1, P2)
143 y2 = nand(G0, P1, P2)
145 C1 = nand(x2, y2, z2, ~G2)
148 x3 = nand(G0, P1, P2, P3)
149 y3 = nand(G1, P2, P3)
151 G* = nand(x3, y3, z3, ~G3)
159 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
162 At each id, compute C[id] = A[id]+B[id]+0
163 Get G[id] = C[id] > radix -1
164 Get P[id] = C[id] == radix-1
165 Join all P[id] together, likewise G[id]
166 Compute newC = ((P|G)+G)^P
167 result[id] = (C[id] + newC[id]) % radix
170 two versions: scalar int version and CR based version.
172 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge
174 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
176 if zero (no propagation) then CR0.eq is zero
178 CR based version, TODO.