working version, need to document limits in precision
[libreriscv.git] / openpower / sv / twin_butterfly.mdwn
1 # Introduction
2
3 <!-- hide -->
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
11 <!-- show -->
12
13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
14 context to save considerably on DCT, DFT and FFT processing. Whilst some hardware
15 implementations may not necessarily implement them efficiently (slower Micro-coding)
16 savings still come from the reduction in temporary registers as well as instruction
17 count.
18
19 # Rationale for Twin Butterfly Integer DCT Instruction(s)
20
21 The number of general-purpose uses for DCT is huge. The number of
22 instructions needed instead of these Twin-Butterfly instructions is also
23 huge (**eight**) and given that it is extremely common to explicitly
24 loop-unroll them quantity hundreds to thousands of instructions are
25 dismayingly common (for all ISAs).
26
27 The goal is to implement instructions that calculate the expression:
28
29 ```
30 fdct_round_shift((a +/- b) * c)
31 ```
32
33 For the single-coefficient butterfly instruction, and:
34
35 ```
36 fdct_round_shift(a * c1 +/- b * c2)
37 ```
38
39 For the double-coefficient butterfly instruction.
40
41 In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
42
43 ```
44 #define ROUND_POWER_OF_TWO(value, n) \
45 (((value) + (1 << ((n)-1))) >> (n))
46 ```
47
48 These instructions are at the core of **ALL** FDCT calculations in many
49 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
50 ARM includes special instructions to optimize these operations, although
51 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
52
53 The suggestion is to have a single instruction to calculate both values
54 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
55 run in accumulate mode, so in order to calculate the 2-coeff version
56 one would just have to call the same instruction with different order a,
57 b and a different constant c.
58
59 Example taken from libvpx
60 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
61
62 ```
63 #include <stdint.h>
64 #define ROUND_POWER_OF_TWO(value, n) \
65 (((value) + (1 << ((n)-1))) >> (n))
66 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
67 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
68 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
69 }
70 ```
71
72 8 instructions are required - replaced by just the one (maddsubrs):
73
74 ```
75 add 9,5,4
76 subf 5,5,4
77 mullw 9,9,6
78 mullw 5,5,6
79 addi 9,9,8192
80 addi 5,5,8192
81 srawi 9,9,14
82 srawi 5,5,14
83 ```
84
85 -------
86
87 \newpage{}
88
89 ## Integer Butterfly Multiply Add/Sub FFT/DCT
90
91 **Add the following to Book I Section 3.3.9.1**
92
93 A-Form
94
95 ```
96 |0 |6 |11 |16 |21 |26 |31 |
97 | PO | RT | RA | RB | SH | XO |Rc |
98 ```
99
100 * maddsubrs RT,RA,SH,RB
101
102 Pseudo-code:
103
104 ```
105 n <- SH
106 sum <- (RT) + (RA)
107 diff <- (RT) - (RA)
108 prod1 <- MULS(RB, sum)
109 prod1_lo <- prod1[XLEN:(XLEN*2)-1]
110 prod2 <- MULS(RB, diff)
111 prod2_lo <- prod2[XLEN:(XLEN*2)-1]
112 if n = 0 then
113 RT <- prod1_lo
114 RS <- prod2_lo
115 else
116 round <- [0]*XLEN
117 round[XLEN -n] <- 1
118 prod1_lo <- prod1_lo + round
119 prod2_lo <- prod2_lo + round
120 m <- MASK(n, (XLEN-1))
121 res1 <- ROTL64(prod1_lo, XLEN-n) & m
122 res2 <- ROTL64(prod2_lo, XLEN-n) & m
123 signbit1 <- prod1_lo[0]
124 signbit2 <- prod2_lo[0]
125 smask1 <- ([signbit1]*XLEN) & ¬m
126 smask2 <- ([signbit2]*XLEN) & ¬m
127 RT <- (res1 | smask1)
128 RS <- (res2 | smask2)
129 ```
130
131 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
132
133 Similar to `RTp`, this instruction produces an implicit result, `RS`,
134 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
135 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
136 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
137
138 Special Registers Altered:
139
140 ```
141 None
142 ```
143
144 -------
145
146 \newpage{}
147
148 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
149
150 **Add the following to Book I Section 4.6.6.3**
151
152 ## Floating-Point Twin Multiply-Add DCT [Single]
153
154 X-Form
155
156 ```
157 |0 |6 |11 |16 |21 |31 |
158 | PO | FRT | FRA | FRB | XO |Rc |
159 ```
160
161 * fdmadds FRT,FRA,FRB (Rc=0)
162
163 Pseudo-code:
164
165 ```
166 FRS <- FPADD32(FRT, FRB)
167 sub <- FPSUB32(FRT, FRB)
168 FRT <- FPMUL32(FRA, sub)
169 ```
170
171 The two IEEE754-FP32 operations
172
173 ```
174 FRS <- [(FRT) + (FRB)]
175 FRT <- [(FRT) - (FRB)] * (FRA)
176 ```
177
178 are simultaneously performed.
179
180 The Floating-Point operand in register FRT is added to the floating-point
181 operand in register FRB and the result stored in FRS.
182
183 Using the exact same operand input register values from FRT and FRB
184 that were used to create FRS, the Floating-Point operand in register
185 FRB is subtracted from the floating-point operand in register FRT and
186 the result then rounded before being multiplied by FRA to create an
187 intermediate result that is stored in FRT.
188
189 The add into FRS is treated exactly as `fadds`. The creation of the
190 result FRT is **not** the same as that of `fmsubs`, but is instead as if
191 `fsubs` were performed first followed by `fmuls`. The creation of FRS
192 and FRT are treated as parallel independent operations which occur at
193 the same time.
194
195 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
196
197 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
198 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
199 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
200 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
201
202 Special Registers Altered:
203
204 ```
205 FPRF FR FI
206 FX OX UX XX
207 VXSNAN VXISI VXIMZ
208 ```
209
210 ## Floating-Point Multiply-Add FFT [Single]
211
212 X-Form
213
214 ```
215 |0 |6 |11 |16 |21 |31 |
216 | PO | FRT | FRA | FRB | XO |Rc |
217 ```
218
219 * ffmadds FRT,FRA,FRB (Rc=0)
220
221 Pseudo-code:
222
223 ```
224 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
225 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
226 ```
227
228 The two operations
229
230 ```
231 FRS <- -([(FRT) * (FRA)] - (FRB))
232 FRT <- [(FRT) * (FRA)] + (FRB)
233 ```
234
235 are performed.
236
237 The floating-point operand in register FRT is multiplied by the
238 floating-point operand in register FRA. The floating-point operand in
239 register FRB is added to this intermediate result, and the intermediate
240 stored in FRS.
241
242 Using the exact same values of FRT, FRT and FRB as used to create
243 FRS, the floating-point operand in register FRT is multiplied by the
244 floating-point operand in register FRA. The floating-point operand
245 in register FRB is subtracted from this intermediate result, and the
246 intermediate stored in FRT.
247
248 FRT is created as if a `fmadds` operation had been performed. FRS is
249 created as if a `fnmsubs` operation had simultaneously been performed
250 with the exact same register operands, in parallel, independently,
251 at exactly the same time.
252
253 FRT is a Read-Modify-Write operation.
254
255 Note that if Rc=1 an Illegal Instruction is raised.
256 Rc=1 is `RESERVED`
257
258 Similar to `FRTp`, this instruction produces an implicit result,
259 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
260 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
261 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
262 (Max Vector Length).
263
264 Special Registers Altered:
265
266 ```
267 FPRF FR FI
268 FX OX UX XX
269 VXSNAN VXISI VXIMZ
270 ```
271
272 ## Floating-Point Twin Multiply-Add DCT
273
274 X-Form
275
276 ```
277 |0 |6 |11 |16 |21 |31 |
278 | PO | FRT | FRA | FRB | XO |Rc |
279 ```
280
281 * fdmadd FRT,FRA,FRB (Rc=0)
282
283 Pseudo-code:
284
285 ```
286 FRS <- FPADD64(FRT, FRB)
287 sub <- FPSUB64(FRT, FRB)
288 FRT <- FPMUL64(FRA, sub)
289 ```
290
291 The two IEEE754-FP64 operations
292
293 ```
294 FRS <- [(FRT) + (FRB)]
295 FRT <- [(FRT) - (FRB)] * (FRA)
296 ```
297
298 are simultaneously performed.
299
300 The Floating-Point operand in register FRT is added to the floating-point
301 operand in register FRB and the result stored in FRS.
302
303 Using the exact same operand input register values from FRT and FRB
304 that were used to create FRS, the Floating-Point operand in register
305 FRB is subtracted from the floating-point operand in register FRT and
306 the result then rounded before being multiplied by FRA to create an
307 intermediate result that is stored in FRT.
308
309 The add into FRS is treated exactly as `fadd`. The creation of the
310 result FRT is **not** the same as that of `fmsub`, but is instead as if
311 `fsub` were performed first followed by `fmuls. The creation of FRS
312 and FRT are treated as parallel independent operations which occur at
313 the same time.
314
315 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
316
317 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
318 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
319 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
320 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
321
322 Special Registers Altered:
323
324 ```
325 FPRF FR FI
326 FX OX UX XX
327 VXSNAN VXISI VXIMZ
328 ```
329
330 ## Floating-Point Twin Multiply-Add FFT
331
332 X-Form
333
334 ```
335 |0 |6 |11 |16 |21 |31 |
336 | PO | FRT | FRA | FRB | XO |Rc |
337 ```
338
339 * ffmadd FRT,FRA,FRB (Rc=0)
340
341 Pseudo-code:
342
343 ```
344 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
345 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
346 ```
347
348 The two operations
349
350 ```
351 FRS <- -([(FRT) * (FRA)] - (FRB))
352 FRT <- [(FRT) * (FRA)] + (FRB)
353 ```
354
355 are performed.
356
357 The floating-point operand in register FRT is multiplied by the
358 floating-point operand in register FRA. The float- ing-point operand in
359 register FRB is added to this intermediate result, and the intermediate
360 stored in FRS.
361
362 Using the exact same values of FRT, FRT and FRB as used to create
363 FRS, the floating-point operand in register FRT is multiplied by the
364 floating-point operand in register FRA. The float- ing-point operand
365 in register FRB is subtracted from this intermediate result, and the
366 intermediate stored in FRT.
367
368 FRT is created as if a `fmadd` operation had been performed. FRS is
369 created as if a `fnmsub` operation had simultaneously been performed
370 with the exact same register operands, in parallel, independently,
371 at exactly the same time.
372
373 FRT is a Read-Modify-Write operation.
374
375 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
376
377 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
378 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
379 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
380 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
381
382 Special Registers Altered:
383
384 ```
385 FPRF FR FI
386 FX OX UX XX
387 VXSNAN VXISI VXIMZ
388 ```
389
390
391 ## Floating-Point Add FFT/DCT [Single]
392
393 A-Form
394
395 ```
396 |0 |6 |11 |16 |21 |26 |31 |
397 | PO | FRT | FRA | FRB | / | XO |Rc |
398 ```
399
400 * ffadds FRT,FRA,FRB (Rc=0)
401
402 Pseudo-code:
403
404 ```
405 FRT <- FPADD32(FRA, FRB)
406 FRS <- FPSUB32(FRB, FRA)
407 ```
408
409 Special Registers Altered:
410
411 ```
412 FPRF FR FI
413 FX OX UX XX
414 VXSNAN VXISI
415 ```
416
417 ## Floating-Point Add FFT/DCT [Double]
418
419 A-Form
420
421 ```
422 |0 |6 |11 |16 |21 |26 |31 |
423 | PO | FRT | FRA | FRB | / | XO |Rc |
424 ```
425
426 * ffadd FRT,FRA,FRB (Rc=0)
427
428 Pseudo-code:
429
430 ```
431 FRT <- FPADD64(FRA, FRB)
432 FRS <- FPSUB64(FRB, FRA)
433 ```
434
435 Special Registers Altered:
436
437 ```
438 FPRF FR FI
439 FX OX UX XX
440 VXSNAN VXISI
441 ```
442
443 ## Floating-Point Subtract FFT/DCT [Single]
444
445 A-Form
446
447 ```
448 |0 |6 |11 |16 |21 |26 |31 |
449 | PO | FRT | FRA | FRB | / | XO |Rc |
450 ```
451
452 * ffsubs FRT,FRA,FRB (Rc=0)
453
454 Pseudo-code:
455
456 ```
457 FRT <- FPSUB32(FRB, FRA)
458 FRS <- FPADD32(FRA, FRB)
459 ```
460
461 Special Registers Altered:
462
463 ```
464 FPRF FR FI
465 FX OX UX XX
466 VXSNAN VXISI
467 ```
468
469 ## Floating-Point Subtract FFT/DCT [Double]
470
471 A-Form
472
473 ```
474 |0 |6 |11 |16 |21 |26 |31 |
475 | PO | FRT | FRA | FRB | / | XO |Rc |
476 ```
477
478 * ffsub FRT,FRA,FRB (Rc=0)
479
480 Pseudo-code:
481
482 ```
483 FRT <- FPSUB64(FRB, FRA)
484 FRS <- FPADD64(FRA, FRB)
485 ```
486
487 Special Registers Altered:
488
489 ```
490 FPRF FR FI
491 FX OX UX XX
492 VXSNAN VXISI
493 ```