fix wrong paste of pseudo-code
[libreriscv.git] / openpower / sv / twin_butterfly.mdwn
1 # Introduction
2
3 <!-- hide -->
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
11 <!-- show -->
12
13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
14 context to save considerably on DCT, DFT and FFT processing. Whilst some hardware
15 implementations may not necessarily implement them efficiently (slower Micro-coding)
16 savings still come from the reduction in temporary registers as well as instruction
17 count.
18
19 # Rationale for Twin Butterfly Integer DCT Instruction(s)
20
21 The number of general-purpose uses for DCT is huge. The number of
22 instructions needed instead of these Twin-Butterfly instructions is also
23 huge (**eight**) and given that it is extremely common to explicitly
24 loop-unroll them quantity hundreds to thousands of instructions are
25 dismayingly common (for all ISAs).
26
27 The goal is to implement instructions that calculate the expression:
28
29 ```
30 fdct_round_shift((a +/- b) * c)
31 ```
32
33 For the single-coefficient butterfly instruction, and:
34
35 ```
36 fdct_round_shift(a * c1 +/- b * c2)
37 ```
38
39 For the double-coefficient butterfly instruction.
40
41 In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
42
43 ```
44 #define ROUND_POWER_OF_TWO(value, n) \
45 (((value) + (1 << ((n)-1))) >> (n))
46 ```
47
48 These instructions are at the core of **ALL** FDCT calculations in many
49 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
50 ARM includes special instructions to optimize these operations, although
51 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
52
53 The suggestion is to have a single instruction to calculate both values
54 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
55 run in accumulate mode, so in order to calculate the 2-coeff version
56 one would just have to call the same instruction with different order a,
57 b and a different constant c.
58
59 Example taken from libvpx
60 <https://chromium.googlesource.com/webm/libvpx/+/refs/tags/v1.13.0/vpx_dsp/fwd_txfm.c#132>:
61
62 ```
63 #include <stdint.h>
64 #define ROUND_POWER_OF_TWO(value, n) \
65 (((value) + (1 << ((n)-1))) >> (n))
66 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
67 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
68 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
69 }
70 ```
71
72 8 instructions are required - replaced by just the one (maddsubrs):
73
74 ```
75 add 9,5,4
76 subf 5,5,4
77 mullw 9,9,6
78 mullw 5,5,6
79 addi 9,9,8192
80 addi 5,5,8192
81 srawi 9,9,14
82 srawi 5,5,14
83 ```
84
85 -------
86
87 \newpage{}
88
89 ## Integer Butterfly Multiply Add/Sub FFT/DCT
90
91 **Add the following to Book I Section 3.3.9.1**
92
93 A-Form
94
95 ```
96 |0 |6 |11 |16 |21 |26 |31 |
97 | PO | RT | RA | RB | SH | XO |Rc |
98 ```
99
100 * maddsubrs RT,RA,SH,RB
101
102 Pseudo-code:
103
104 ```
105 n <- SH
106 sum <- (RT) + (RA)
107 diff <- (RT) - (RA)
108 prod1 <- MULS(RB, sum)
109 prod2 <- MULS(RB, diff)
110 if n = 0 then
111 prod1_lo <- prod1[XLEN:(XLEN*2) - 1]
112 prod2_lo <- prod2[XLEN:(XLEN*2) - 1]
113 RT <- prod1_lo
114 RS <- prod2_lo
115 else
116 round <- [0]*(XLEN*2)
117 round[XLEN*2 - n] <- 1
118 prod1 <- prod1 + round
119 prod2 <- prod2 + round
120 m <- MASK(XLEN - n - 2, XLEN - 1)
121 res1 <- prod1[XLEN - n:XLEN*2 - n - 1]
122 res2 <- prod2[XLEN - n:XLEN*2 - n - 1]
123 signbit1 <- prod1[0]
124 signbit2 <- prod2[0]
125 smask1 <- ([signbit1]*XLEN) & ¬m
126 smask2 <- ([signbit2]*XLEN) & ¬m
127 RT <- (res1 | smask1)
128 RS <- (res2 | smask2)
129 ```
130
131 Similar to `RTp`, this instruction produces an implicit result, `RS`,
132 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
133 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
134 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
135
136 Special Registers Altered:
137
138 ```
139 None
140 ```
141
142 # [DRAFT] Integer Butterfly Multiply Add/Sub and Accumulate FFT/DCT
143
144 A-Form
145
146 * maddrs RT,RA,SH,RB
147
148 Pseudo-code:
149
150 ```
151 n <- SH
152 prod <- MULS(RB, RA)
153 if n = 0 then
154 prod_lo <- prod[XLEN:(XLEN*2) - 1]
155 RT <- (RT) + prod_lo
156 RS <- (RS) - prod_lo
157 else
158 res1[0:XLEN*2-1] <- (EXTSXL((RT)[0], 1) || (RT)) + prod
159 res2[0:XLEN*2-1] <- (EXTSXL((RS)[0], 1) || (RS)) - prod
160 round <- [0]*XLEN*2
161 round[XLEN*2 - n] <- 1
162 res1 <- res1 + round
163 res2 <- res2 + round
164 signbit1 <- res1[0]
165 signbit2 <- res2[0]
166 m <- MASK(XLEN -n - 2, XLEN - 1)
167 res1 <- res1[XLEN - n:XLEN*2 - n -1]
168 res2 <- res2[XLEN - n:XLEN*2 - n -1]
169 smask1 <- ([signbit1]*XLEN) & ¬m
170 smask2 <- ([signbit2]*XLEN) & ¬m
171 RT <- (res1 | smask1)
172 RS <- (res2 | smask2)
173 ```
174
175 Special Registers Altered:
176
177 None
178
179 Similar to `RTp`, this instruction produces an implicit result, `RS`,
180 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
181 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
182 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
183
184 This instruction is supposed to be used in complement to the maddsubrs
185 to produce the double-coefficient butterfly instruction. In order for that
186 to work, instead of passing c2 as coefficient, we have to pass c2-c1 instead.
187
188 In essence, we are calculating the quantity `a * c1 +/- b * c1` first, with
189 `maddsubrs` *without* shifting (so `SH=0`) and then we add/sub `b * (c2-c1)`
190 from the previous `RT`/`RS`, and *then* do the shifting.
191
192 In the following example, assume `a` in `R1`, `b` in `R10`, `c1` in `R11` and `c2 - c1` in `R12`.
193 The first instruction will put `a * c1 + b * c1` in `R1` (`RT`), `a * c1 - b * c1` in `RS`
194 (here, `RS = RT +1`, so `R2`).
195 Then, `maddrs` will add `b * (c2 - c1)` to `R1` (`RT`), and subtract it from `R2` (`RS`), and then
196 round shift right both quantities 14 bits:
197
198 ```
199 maddsubrs 1,10,0,11
200 maddrs 1,10,14,12
201 ```
202
203 In scalar code, that would take ~16 instructions for both operations.
204
205 -------
206
207 \newpage{}
208
209 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
210
211 **Add the following to Book I Section 4.6.6.3**
212
213 ## Floating-Point Twin Multiply-Add DCT [Single]
214
215 X-Form
216
217 ```
218 |0 |6 |11 |16 |21 |31 |
219 | PO | FRT | FRA | FRB | XO |Rc |
220 ```
221
222 * fdmadds FRT,FRA,FRB (Rc=0)
223
224 Pseudo-code:
225
226 ```
227 FRS <- FPADD32(FRT, FRB)
228 sub <- FPSUB32(FRT, FRB)
229 FRT <- FPMUL32(FRA, sub)
230 ```
231
232 The two IEEE754-FP32 operations
233
234 ```
235 FRS <- [(FRT) + (FRB)]
236 FRT <- [(FRT) - (FRB)] * (FRA)
237 ```
238
239 are simultaneously performed.
240
241 The Floating-Point operand in register FRT is added to the floating-point
242 operand in register FRB and the result stored in FRS.
243
244 Using the exact same operand input register values from FRT and FRB
245 that were used to create FRS, the Floating-Point operand in register
246 FRB is subtracted from the floating-point operand in register FRT and
247 the result then rounded before being multiplied by FRA to create an
248 intermediate result that is stored in FRT.
249
250 The add into FRS is treated exactly as `fadds`. The creation of the
251 result FRT is **not** the same as that of `fmsubs`, but is instead as if
252 `fsubs` were performed first followed by `fmuls`. The creation of FRS
253 and FRT are treated as parallel independent operations which occur at
254 the same time.
255
256 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
257
258 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
259 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
260 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
261 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
262
263 Special Registers Altered:
264
265 ```
266 FPRF FR FI
267 FX OX UX XX
268 VXSNAN VXISI VXIMZ
269 ```
270
271 ## Floating-Point Multiply-Add FFT [Single]
272
273 X-Form
274
275 ```
276 |0 |6 |11 |16 |21 |31 |
277 | PO | FRT | FRA | FRB | XO |Rc |
278 ```
279
280 * ffmadds FRT,FRA,FRB (Rc=0)
281
282 Pseudo-code:
283
284 ```
285 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
286 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
287 ```
288
289 The two operations
290
291 ```
292 FRS <- -([(FRT) * (FRA)] - (FRB))
293 FRT <- [(FRT) * (FRA)] + (FRB)
294 ```
295
296 are performed.
297
298 The floating-point operand in register FRT is multiplied by the
299 floating-point operand in register FRA. The floating-point operand in
300 register FRB is added to this intermediate result, and the intermediate
301 stored in FRS.
302
303 Using the exact same values of FRT, FRT and FRB as used to create
304 FRS, the floating-point operand in register FRT is multiplied by the
305 floating-point operand in register FRA. The floating-point operand
306 in register FRB is subtracted from this intermediate result, and the
307 intermediate stored in FRT.
308
309 FRT is created as if a `fmadds` operation had been performed. FRS is
310 created as if a `fnmsubs` operation had simultaneously been performed
311 with the exact same register operands, in parallel, independently,
312 at exactly the same time.
313
314 FRT is a Read-Modify-Write operation.
315
316 Note that if Rc=1 an Illegal Instruction is raised.
317 Rc=1 is `RESERVED`
318
319 Similar to `FRTp`, this instruction produces an implicit result,
320 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
321 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
322 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
323 (Max Vector Length).
324
325 Special Registers Altered:
326
327 ```
328 FPRF FR FI
329 FX OX UX XX
330 VXSNAN VXISI VXIMZ
331 ```
332
333 ## Floating-Point Twin Multiply-Add DCT
334
335 X-Form
336
337 ```
338 |0 |6 |11 |16 |21 |31 |
339 | PO | FRT | FRA | FRB | XO |Rc |
340 ```
341
342 * fdmadd FRT,FRA,FRB (Rc=0)
343
344 Pseudo-code:
345
346 ```
347 FRS <- FPADD64(FRT, FRB)
348 sub <- FPSUB64(FRT, FRB)
349 FRT <- FPMUL64(FRA, sub)
350 ```
351
352 The two IEEE754-FP64 operations
353
354 ```
355 FRS <- [(FRT) + (FRB)]
356 FRT <- [(FRT) - (FRB)] * (FRA)
357 ```
358
359 are simultaneously performed.
360
361 The Floating-Point operand in register FRT is added to the floating-point
362 operand in register FRB and the result stored in FRS.
363
364 Using the exact same operand input register values from FRT and FRB
365 that were used to create FRS, the Floating-Point operand in register
366 FRB is subtracted from the floating-point operand in register FRT and
367 the result then rounded before being multiplied by FRA to create an
368 intermediate result that is stored in FRT.
369
370 The add into FRS is treated exactly as `fadd`. The creation of the
371 result FRT is **not** the same as that of `fmsub`, but is instead as if
372 `fsub` were performed first followed by `fmuls. The creation of FRS
373 and FRT are treated as parallel independent operations which occur at
374 the same time.
375
376 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
377
378 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
379 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
380 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
381 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
382
383 Special Registers Altered:
384
385 ```
386 FPRF FR FI
387 FX OX UX XX
388 VXSNAN VXISI VXIMZ
389 ```
390
391 ## Floating-Point Twin Multiply-Add FFT
392
393 X-Form
394
395 ```
396 |0 |6 |11 |16 |21 |31 |
397 | PO | FRT | FRA | FRB | XO |Rc |
398 ```
399
400 * ffmadd FRT,FRA,FRB (Rc=0)
401
402 Pseudo-code:
403
404 ```
405 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
406 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
407 ```
408
409 The two operations
410
411 ```
412 FRS <- -([(FRT) * (FRA)] - (FRB))
413 FRT <- [(FRT) * (FRA)] + (FRB)
414 ```
415
416 are performed.
417
418 The floating-point operand in register FRT is multiplied by the
419 floating-point operand in register FRA. The float- ing-point operand in
420 register FRB is added to this intermediate result, and the intermediate
421 stored in FRS.
422
423 Using the exact same values of FRT, FRT and FRB as used to create
424 FRS, the floating-point operand in register FRT is multiplied by the
425 floating-point operand in register FRA. The float- ing-point operand
426 in register FRB is subtracted from this intermediate result, and the
427 intermediate stored in FRT.
428
429 FRT is created as if a `fmadd` operation had been performed. FRS is
430 created as if a `fnmsub` operation had simultaneously been performed
431 with the exact same register operands, in parallel, independently,
432 at exactly the same time.
433
434 FRT is a Read-Modify-Write operation.
435
436 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
437
438 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
439 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
440 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
441 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
442
443 Special Registers Altered:
444
445 ```
446 FPRF FR FI
447 FX OX UX XX
448 VXSNAN VXISI VXIMZ
449 ```
450
451
452 ## Floating-Point Add FFT/DCT [Single]
453
454 A-Form
455
456 ```
457 |0 |6 |11 |16 |21 |26 |31 |
458 | PO | FRT | FRA | FRB | / | XO |Rc |
459 ```
460
461 * ffadds FRT,FRA,FRB (Rc=0)
462
463 Pseudo-code:
464
465 ```
466 FRT <- FPADD32(FRA, FRB)
467 FRS <- FPSUB32(FRB, FRA)
468 ```
469
470 Special Registers Altered:
471
472 ```
473 FPRF FR FI
474 FX OX UX XX
475 VXSNAN VXISI
476 ```
477
478 ## Floating-Point Add FFT/DCT [Double]
479
480 A-Form
481
482 ```
483 |0 |6 |11 |16 |21 |26 |31 |
484 | PO | FRT | FRA | FRB | / | XO |Rc |
485 ```
486
487 * ffadd FRT,FRA,FRB (Rc=0)
488
489 Pseudo-code:
490
491 ```
492 FRT <- FPADD64(FRA, FRB)
493 FRS <- FPSUB64(FRB, FRA)
494 ```
495
496 Special Registers Altered:
497
498 ```
499 FPRF FR FI
500 FX OX UX XX
501 VXSNAN VXISI
502 ```
503
504 ## Floating-Point Subtract FFT/DCT [Single]
505
506 A-Form
507
508 ```
509 |0 |6 |11 |16 |21 |26 |31 |
510 | PO | FRT | FRA | FRB | / | XO |Rc |
511 ```
512
513 * ffsubs FRT,FRA,FRB (Rc=0)
514
515 Pseudo-code:
516
517 ```
518 FRT <- FPSUB32(FRB, FRA)
519 FRS <- FPADD32(FRA, FRB)
520 ```
521
522 Special Registers Altered:
523
524 ```
525 FPRF FR FI
526 FX OX UX XX
527 VXSNAN VXISI
528 ```
529
530 ## Floating-Point Subtract FFT/DCT [Double]
531
532 A-Form
533
534 ```
535 |0 |6 |11 |16 |21 |26 |31 |
536 | PO | FRT | FRA | FRB | / | XO |Rc |
537 ```
538
539 * ffsub FRT,FRA,FRB (Rc=0)
540
541 Pseudo-code:
542
543 ```
544 FRT <- FPSUB64(FRB, FRA)
545 FRS <- FPADD64(FRA, FRB)
546 ```
547
548 Special Registers Altered:
549
550 ```
551 FPRF FR FI
552 FX OX UX XX
553 VXSNAN VXISI
554 ```