(no commit message)
[libreriscv.git] / openpower / sv / av_opcodes.mdwn
1 [[!tag standards]]
2
3 # Scalar OpenPOWER Audio and Video Opcodes
4
5 the fundamental principle of SV is a hardware for-loop. therefore the first (and in nearly 100% of cases only) place to put Vector operations is first and foremost in the *scalar* ISA. However only by analysing those scalar opcodes *in* a SV Vectorisation context does it become clear why they are needed and how they may be designed.
6
7 This page therefore has accompanying discussion at <https://bugs.libre-soc.org/show_bug.cgi?id=230> for evolution of suitable opcodes.
8
9 Links
10
11 * <https://bugs.libre-soc.org/show_bug.cgi?id=863> add pseudocode etc.
12 * <https://bugs.libre-soc.org/show_bug.cgi?id=234> hardware implementation
13 * <https://bugs.libre-soc.org/show_bug.cgi?id=910> mins/maxs zero-option?
14 * [[vpu]]
15 * [[sv/int_fp_mv]]
16 * [[openpower/isa/av]] pseudocode
17 * TODO review HP 1994-6 PA-RISC MAX <https://en.m.wikipedia.org/wiki/Multimedia_Acceleration_eXtensions>
18 * <https://en.m.wikipedia.org/wiki/Sum_of_absolute_differences>
19 * List of MMX instructions <https://cs.fit.edu/~mmahoney/cse3101/mmx.html>
20
21 # Summary
22
23 In-advance, the summary of base scalar operations that need to be added is:
24
25 | instruction | pseudocode |
26 | ------------ | ------------------------ |
27 | average-add. | result = (src1 + src2 + 1) >> 1 |
28 | abs-diff | result = abs (src1-src2) |
29 | abs-accumulate| result += abs (src1-src2) |
30 | (un)signed min| result = (src1 < src2) ? src1 : src2 use bitmanip |
31 | (un)signed max| result = (src1 > src2) ? src1 : src2 use bitmanip |
32 | bitwise sel | (a ? b : c) - use [[sv/bitmanip]] ternary |
33 | int/fp move | covered by [[sv/int_fp_mv]] |
34
35 Implemented at the [[openpower/isa/av]] pseudocode page.
36
37 All other capabilities (saturate in particular) are achieved with [[sv/svp64]] modes and swizzle. Note that minmax and ternary are added in bitmanip.
38
39 # Audio
40
41 The fundamental principle for these instructions is:
42
43 * identify the scalar primitive
44 * assume that longer runs of scalars will have Simple-V vectorisatin applied
45 * assume that "swizzle" may be applied at the (vec2 - SUBVL=2) Vector level,
46 (even if that involves a mv.swizxle which may be macro-op fused)
47 in order to perform the necessary HI/LO selection normally hard-coded
48 into SIMD ISAs.
49
50 Thus for example, where OpenPOWER VSX has vpkswss, this would be achieved in SV with simply:
51
52 * addition of a scalar ext/clamp instruction
53 * 1st op, swizzle-selection vec2 "select X only" from source to dest:
54 dest.X = extclamp(src.X)
55 * 2nd op, swizzle-select vec2 "select Y only" from source to dest
56 dest.Y = extclamp(src.Y)
57
58 Macro-op fusion may be used to detect that these two interleave cleanly, overlapping the vec2.X with vec2.Y to produce a single vec2.XY operation.
59
60 ## Scalar element operations
61
62 * clamping / saturation for signed and unsigned. best done similar to FP rounding modes, i.e. with an SPR.
63 * average-add. result = (src1 + src2 + 1) >> 1
64 * abs-diff: result = (src1 > src2) ? (src1-src2) : (src2-src1)
65 * signed min/max
66
67 # Video
68
69 TODO
70
71 * DCT <https://users.cs.cf.ac.uk/Dave.Marshall/Multimedia/node231.html>
72 * <https://www.nayuki.io/page/fast-discrete-cosine-transform-algorithms>
73 * Absolute-diff Accumulation, used in Motion Estimation
74
75 # VSX SIMD
76
77 Useful parts of VSX, and how they might map.
78
79 ## vpks[\*][\*]s (vec_pack*)
80
81 signed and unsigned, these are N-to-M (N=64/32/16, M=32/16/8) chop/clamp/sign/zero-extend operations. May be implemented by a clamped move to a smaller elwidth.
82
83 The other direction, vec_unpack widening ops, may need some way to tell whether to sign-extend or zero-extend.
84
85 *scalar extsw/b/h gives one set, mv gives another. src elwidth override and dest elwidth override provide the pack/unpack*.
86
87 implemented by [[sv/mv.vec]] RM Pack/Unpack mode as long as these instructions
88 have that RM Mode.
89
90 ## vavgs\* (vec_avg)
91
92 signed and unsigned, 8/16/32: these are all of the form:
93
94 result = truncate((a + b + 1) >> 1))
95
96 *These do not exist in scalar ISA and would need to be added. Essentially it is a type of post-processing involving the CA bit so could be included in the existing scalar pipeline ALU*
97
98 ## vabsdu\* (vec_abs)
99
100 unsigned 8/16/32: these are all of the form:
101
102 result = (src1 > src2) ? truncate(src1-src2) :
103 truncate(src2-src1)
104
105 *These do not exist in the scalar ISA and would need to be added*
106
107 ## abs-accumulate
108
109 signed and unsigned variants needed:
110
111 result += (src1 > src2) ? truncate(src1-src2) :
112 truncate(src2-src1)
113
114 *These do not exist in the scalar ISA and would need to be added*
115
116 ## vmaxs\* / vmaxu\* (and min)
117
118 signed and unsigned, 8/16/32: these are all of the form:
119
120 result = (src1 > src2) ? src1 : src2 # max
121 result = (src1 < src2) ? src1 : src2 # min
122
123 *These do not exist in the scalar INTEGER ISA and would need to be added*.
124 There are additionally no scalar FP min/max, either. These also
125 need to be added.
126
127 Also it makes sense for both the integer and FP variants
128 to have Rc=1 modes, where those modes are based on the
129 respective cmp (or fsel / isel) behaviour. In other words,
130 the Rc=1 setting is based on the *comparison* of the
131 two inputs, rather than on which of the two results was
132 returned by the min/max opcode.
133
134 result = (src1 > src2) ? src1 : src2 # max
135 CR0 = CR_computr(src2-src1) # not based on result
136
137 ## vmerge operations
138
139 Their main point was to work around the odd/even multiplies. SV swizzles and mv.x should handle all cases.
140
141 these take two src vectors of various widths and splice them together. the best technique to cover these is a simple straightforward predicated pair of mv operations, inverting the predicate in the second case, or, alternately, to use a pair of vec2 (SUBVL=2) swizzled operations.
142
143 in the swizzle case the first instruction would be destvec2.X = srcvec2.X and the second would swizzle-select Y: destvec2.Y = srcvec2.Y. macro-op fusion in both the predicated variant and the swizzle variant would interleave the two into the same SIMD backend ALUs (or macro-op fusion identifies the patterns)
144
145 with twin predication the elwidth can be overridden on both src and dest such that either straight scalar mv or extsw/b/h can be used to provide the combinations of coverage needed, with only 2 actual instructions (plus vector prefixing)
146
147 See [[sv/mv.vec]] and [[sv/mv.swizzle]]
148
149 ## Float estimates
150
151 vec_expte - float 2^x
152 vec_loge - float log2(x)
153 vec_re - float 1/x
154 vec_rsqrte - float 1/sqrt(x)
155
156 The spec says the max relative inaccuracy is 1/4096.
157
158 *In conjunction with the FPSPR "accuracy" bit These could be done by assigning meaning to the "sat mode" SVP64 bits in a FP context. 0b00 is IEEE754 FP, 0b01 is 2^12 accuracy for FP32. These can be applied to standard scalar FP ops*
159
160 The other alternative is to use the "single precision" FP operations on a 32-bit elwidth override. As explained in [[sv/fcvt]] this halves the precision,
161 operating at FP16 accuracy but storing in a FP32 format.
162
163 ## vec_madd(s) - FMA, multiply-add, optionally saturated
164
165 a * b + c
166
167 *Standard scalar madd*
168
169 ## vec_msum(s) - horizontal gather multiply-add, optionally saturated
170
171 This should be separated to a horizontal multiply and a horizontal add. How a horizontal operation would work in SV is TBD, how wide is it, etc.
172
173 a.x + a.y + a.z ...
174 a.x * a.y * a.z ...
175
176 *This would realistically need to be done with a loop doing a mapreduce sequence. I looked very early on at doing this type of operation and concluded it would be better done with a series of halvings each time, as separate instructions: VL=16 then VL=8 then 4 then 2 and finally one scalar. i.e. not an actual part of SV al all. An OoO multi-issue engine would be more than capable of dealing with the Dependencies.*
177
178 That has the issue that's it's a massive PITA to code, plus it's slow. Plus there's the "access to non-4-offset regs stalls". Even if there's no ready operation, it should be made easier and faster than a manual mapreduce loop.
179
180 --
181
182 As a mid-solution, 4-element gathers were discussed. 4 elements would also make them useful for pixel packing, not just the general vector gather. This is because OR and ADD are the same operation when bits don't overlap.
183
184 gather-add: d = a.x + a.y + a.z + a.w
185 gather-mul: d = a.x * a.y * a.z * a.w
186
187 But can the SV loop increment the src reg # by 4? Hmm.
188
189 The idea then leads to the opposite operation, a 1-to-4 bit scatter instruction. Like gather, it could be implemented with a normal loop, but it's faster for certain uses.
190
191 bit-scatter dest, src, bits
192
193 bit-scatter rd, rs, 8 # assuming source and dest are 32-bit wide
194 rd = (rs >> 0 * 8) & (2^8 - 1)
195 rd+1 = (rs >> 1 * 8) & (2^8 - 1)
196 rd+2 = (rs >> 2 * 8) & (2^8 - 1)
197 rd+3 = (rs >> 3 * 8) & (2^8 - 1)
198
199 So at the start you have a RGBA packed pixel in one 32-bit register, at the end you have each channel separated into its own register, in the low bits, and ANDed so only the relevant bits are there.
200
201 ## vec_mul*
202
203 There should be both a same-width multiply and a widening multiply. Signed and unsigned versions. Optionally saturated.
204
205 u8 * u8 = u8
206 u8 * u8 = u16
207
208 For 8,16,32,64, resulting in 8,16,32,64,128.
209
210 *All of these can be done with SV elwidth overrides, as long as the dest is no greater than 128. SV specifically does not do 128 bit arithmetic. Instead, vec2.X mul-lo followed by vec2.Y mul-hi can be macro-op fused to get at the full 128 bit internal result. Specifying e.g. src elwidth=8 and dest elwidth=16 will give a widening multiply*
211
212 (Now added `madded` which is twin-half 64x64->HI64/LO64 in [[sv/biginteger]])
213
214 ## vec_rl - rotate left
215
216 (a << x) | (a >> (WIDTH - x))
217
218 *Standard scalar rlwinm*
219
220 ## vec_sel - bitwise select
221
222 (a ? b : c)
223
224 *This does not exist in the scalar ISA and would need to be added*
225
226 Interesting operation: Tim.Forsyth's video on Larrabee they added a logical ternary lookup table op, which can cover this and more. similar to crops 2-2 bit lookup.
227
228 * <http://0x80.pl/articles/avx512-ternary-functions.html>
229 * <https://github.com/WojciechMula/ternary-logic/blob/master/py/show-function.py>
230 * [[sv/bitmanip]]
231
232
233 ## vec_splat - scatter
234
235 Implemented using swizzle/predicate.
236
237 ## vec_perm - permute
238
239 Implemented using swizzle, mv.x.
240
241 ## vec_*c[tl]z, vec_popcnt - count leading/trailing zeroes, set bits
242
243 Bit counts.
244
245 ctz - count trailing zeroes
246 clz - count leading zeroes
247 popcnt - count set bits
248
249 *These all exist in the scalar ISA*