(no commit message)
[libreriscv.git] / openpower / sv / vector_ops.mdwn
1 [[!tag standards]]
2
3 # SV Vector Operations.
4
5 Links:
6
7 * [[discussion]]
8 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions>
9 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-May/004884.html>
10 * <https://bugs.libre-soc.org/show_bug.cgi?id=865> implementation in simulator
11 * <https://bugs.libre-soc.org/show_bug.cgi?id=213>
12 * <https://bugs.libre-soc.org/show_bug.cgi?id=142> specialist vector ops
13 out of scope for this document [[openpower/sv/3d_vector_ops]]
14 * [[simple_v_extension/specification/bitmanip]] previous version,
15 contains pseudocode for sof, sif, sbf
16 * <https://en.m.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#TBM_(Trailing_Bit_Manipulation)>
17
18 The core Power ISA was designed as scalar: SV provides a level of abstraction to add variable-length element-independent parallelism.
19 Therefore there are not that many cases where *actual* Vector
20 instructions are needed. If they are, they are more "assistance"
21 functions. Two traditional Vector instructions were initially
22 considered (conflictd and vmiota) however they may be synthesised
23 from existing SVP64 instructions: details in [[discussion]]
24
25 Notes:
26
27 * Instructions suited to 3D GPU workloads (dotproduct, crossproduct, normalise) are out of scope: this document is for more general-purpose instructions that underpin and are critical to general-purpose Vector workloads (including GPU and VPU)
28 * Instructions related to the adaptation of CRs for use as predicate masks are covered separately, by crweird operations. See [[sv/cr_int_predication]].
29
30 # Mask-suited Bitmanipulation
31
32 Based on RVV masked set-before-first, set-after-first etc.
33 and Intel and AMD Bitmanip instructions made generalised then
34 advanced further to include masks, this is a single instruction
35 covering 24 individual instructions in other ISAs.
36 *(sbf/sof/sif moved to [[discussion]])*
37
38 BM2-Form
39
40 |0 |6 |11 |16 |21-25|26|27..31|
41 |------|-----|------|------|-----|--|------|
42 | PO | RS | RA | RB |mode |L | XO |
43
44 * bmask RT,RA,RB,mode,L
45
46 The patterns within the pseudocode for AMD TBM and x86 BMI1 are
47 as follows:
48
49 * first pattern A: `x / ~x`
50 * second pattern B: `| / & / ^`
51 * third pattern C: `x+1 / x-1 / ~(x+1) / -x`
52
53 Thus it makes sense to create a single instruction
54 that covers all of these.
55
56 Executable pseudocode demo:
57
58 ```
59 [[!inline quick="yes" raw="yes" pages="openpower/sv/bmask.py"]]
60 ```
61
62 # Carry-lookahead
63
64 As a single scalar 32-bit instruction, up to 64 carry-propagation bits
65 may be computed. When the output is then used as a Predicate mask it can
66 be used to selectively perform the "add carry" of biginteger math, with
67 `sv.addi/sm=rN RT.v, RA.v, 1`.
68
69 * cprop RT,RA,RB
70 * cprop. RT,RA,RB
71
72 pseudocode:
73
74 P = (RA)
75 G = (RB)
76 RT = ((P|G)+G)^P
77
78 X-Form
79
80 | 0.5|6.10|11.15|16.20| 21..30 |31| name | Form |
81 | -- | -- | --- | --- | --------- |--| ---- | ------- |
82 | NN | RT | RA | RB | 0110001110 |Rc| cprop | X-Form |
83
84 used not just for carry lookahead, also a special type of predication mask operation.
85
86 * <https://www.geeksforgeeks.org/carry-look-ahead-adder/>
87 * <https://media.geeksforgeeks.org/wp-content/uploads/digital_Logic6.png>
88 * <https://electronics.stackexchange.com/questions/20085/whats-the-difference-with-carry-look-ahead-generator-block-carry-look-ahead-ge>
89 * <https://i.stack.imgur.com/QSLKY.png>
90 * <https://stackoverflow.com/questions/27971757/big-integer-addition-code>
91 `((P|G)+G)^P`
92 * <https://en.m.wikipedia.org/wiki/Carry-lookahead_adder>
93
94 From QLSKY.png:
95
96 ```
97 x0 = nand(CIn, P0)
98 C0 = nand(x0, ~G0)
99
100 x1 = nand(CIn, P0, P1)
101 y1 = nand(G0, P1)
102 C1 = nand(x1, y1, ~G1)
103
104 x2 = nand(CIn, P0, P1, P2)
105 y2 = nand(G0, P1, P2)
106 z2 = nand(G1, P2)
107 C1 = nand(x2, y2, z2, ~G2)
108
109 # Gen*
110 x3 = nand(G0, P1, P2, P3)
111 y3 = nand(G1, P2, P3)
112 z3 = nand(G2, P3)
113 G* = nand(x3, y3, z3, ~G3)
114 ```
115
116 ```
117 P = (A | B) & Ci
118 G = (A & B)
119 ```
120
121 Stackoverflow algorithm `((P|G)+G)^P` works on the cumulated bits of P and G from associated vector units (P and G are integers here). The result of the algorithm is the new carry-in which already includes ripple, one bit of carry per element.
122
123 ```
124 At each id, compute C[id] = A[id]+B[id]+0
125 Get G[id] = C[id] > radix -1
126 Get P[id] = C[id] == radix-1
127 Join all P[id] together, likewise G[id]
128 Compute newC = ((P|G)+G)^P
129 result[id] = (C[id] + newC[id]) % radix
130 ```
131
132 two versions: scalar int version and CR based version.
133
134 scalar int version acts as a scalar carry-propagate, reading XER.CA as input, P and G as regs, and taking a radix argument. the end bits go into XER.CA and CR0.ge
135
136 vector version takes CR0.so as carry in, stores in CR0.so and CR.ge end bits.
137
138 if zero (no propagation) then CR0.eq is zero
139
140 CR based version, TODO.