cd6d492ef5346698f321ca4cc54a072dd5c9e441
[libreriscv.git] / openpower / sv / vector_ops.mdwn
1 [[!tag standards]]
2
3 # SV Vector Operations.
4
5 Links:
6
7 * [[discussion]]
8 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
9 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
10 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
11 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
12 out of scope for this document [[openpower/sv/3d_vector_ops]]
13 * [[simple_v_extension/specification/bitmanip]] previous version,
14 contains pseudocode for sof, sif, sbf
15 * https://en.m.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#TBM_(Trailing_Bit_Manipulation)
16
17 The core Power ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism.
18 Therefore there are not that many cases where *actual* Vector
19 instructions are needed. If they are, they are more "assistance"
20 functions. Two traditional Vector instructions were initially
21 considered (conflictd and vmiota) however they may be synthesised
22 from existing SVP64 instructions: details in [[discussion]]
23
24 Notes:
25
26 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
27 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See [[sv/cr_int_predication]].
28
29 # sbfm
30
31 sbfm RT, RA, RB!=0
32
33 Example
34
35 7 6 5 4 3 2 1 0 Bit index
36
37 1 0 0 1 0 1 0 0 v3 contents
38 vmsbf.m v2, v3
39 0 0 0 0 0 0 1 1 v2 contents
40
41 1 0 0 1 0 1 0 1 v3 contents
42 vmsbf.m v2, v3
43 0 0 0 0 0 0 0 0 v2
44
45 0 0 0 0 0 0 0 0 v3 contents
46 vmsbf.m v2, v3
47 1 1 1 1 1 1 1 1 v2
48
49 1 1 0 0 0 0 1 1 RB vcontents
50 1 0 0 1 0 1 0 0 v3 contents
51 vmsbf.m v2, v3, v0.t
52 0 1 x x x x 1 1 v2 contents
53
54 The vmsbf.m instruction takes a mask register as input and writes results to a mask register. The instruction writes a 1 to all active mask elements before the first source element that is a 1, then writes a 0 to that element and all following active elements. If there is no set bit in the source vector, then all active elements in the destination are written with a 1.
55
56 Executable pseudocode demo:
57
58 ```
59 [[!inline quick="yes" raw="yes" pages="openpower/sv/sbf.py"]]
60 ```
61
62 # sifm
63
64 The vector mask set-including-first instruction is similar to set-before-first, except it also includes the element with a set bit.
65
66 sifm RT, RA, RB!=0
67
68 # Example
69
70 7 6 5 4 3 2 1 0 Bit number
71
72 1 0 0 1 0 1 0 0 v3 contents
73 vmsif.m v2, v3
74 0 0 0 0 0 1 1 1 v2 contents
75
76 1 0 0 1 0 1 0 1 v3 contents
77 vmsif.m v2, v3
78 0 0 0 0 0 0 0 1 v2
79
80 1 1 0 0 0 0 1 1 RB vcontents
81 1 0 0 1 0 1 0 0 v3 contents
82 vmsif.m v2, v3, v0.t
83 1 1 x x x x 1 1 v2 contents
84
85 Executable pseudocode demo:
86
87 ```
88 [[!inline quick="yes" raw="yes" pages="openpower/sv/sif.py"]]
89 ```
90
91 # vmsof
92
93 The vector mask set-only-first instruction is similar to set-before-first, except it only sets the first element with a bit set, if any.
94
95 sofm RT, RA, RB
96
97 Example
98
99 7 6 5 4 3 2 1 0 Bit number
100
101 1 0 0 1 0 1 0 0 v3 contents
102 vmsof.m v2, v3
103 0 0 0 0 0 1 0 0 v2 contents
104
105 1 0 0 1 0 1 0 1 v3 contents
106 vmsof.m v2, v3
107 0 0 0 0 0 0 0 1 v2
108
109 1 1 0 0 0 0 1 1 RB vcontents
110 1 1 0 1 0 1 0 0 v3 contents
111 vmsof.m v2, v3, v0.t
112 0 1 x x x x 0 0 v2 content
113
114 Executable pseudocode demo:
115
116 ```
117 [[!inline quick="yes" raw="yes" pages="openpower/sv/sof.py"]]
118 ```
119
120 # Carry-lookahead
121
122 used not just for carry lookahead, also a special type of predication mask operation.
123
124 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
125 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
126 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
127 * <https://i.stack.imgur.com/QSLKY.png>
128 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
129 `((P|G)+G)^P`
130 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
131
132 From QLSKY.png:
133
134 ```
135 x0 = nand(CIn, P0)
136 C0 = nand(x0, ~G0)
137
138 x1 = nand(CIn, P0, P1)
139 y1 = nand(G0, P1)
140 C1 = nand(x1, y1, ~G1)
141
142 x2 = nand(CIn, P0, P1, P2)
143 y2 = nand(G0, P1, P2)
144 z2 = nand(G1, P2)
145 C1 = nand(x2, y2, z2, ~G2)
146
147 # Gen*
148 x3 = nand(G0, P1, P2, P3)
149 y3 = nand(G1, P2, P3)
150 z3 = nand(G2, P3)
151 G* = nand(x3, y3, z3, ~G3)
152 ```
153
154 ```
155 P = (A | B) & Ci
156 G = (A & B)
157 ```
158
159 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
160
161 ```
162 At each id, compute C[id] = A[id]+B[id]+0
163 Get G[id] = C[id] > radix -1
164 Get P[id] = C[id] == radix-1
165 Join all P[id] together, likewise G[id]
166 Compute newC = ((P|G)+G)^P
167 result[id] = (C[id] + newC[id]) % radix
168 ```
169
170 two versions: scalar int version and CR based version.
171
172 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge
173
174 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
175
176 if zero (no propagation) then CR0.eq is zero
177
178 CR based version, TODO.