https://bugs.libre-soc.org/show_bug.cgi?id=985
[libreriscv.git] / openpower / sv / twin_butterfly.mdwn
1 # Introduction
2
3 <!-- hide -->
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
11 <!-- show -->
12
13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
14 context to save considerably on DCT, DFT and FFT processing. Whilst some hardware
15 implementations may not necessarily implement them efficiently (slower Micro-coding)
16 savings still come from the reduction in temporary registers as well as instruction
17 count.
18
19 # Rationale for Twin Butterfly Integer DCT Instruction(s)
20
21 The number of general-purpose uses for DCT is huge. The number of
22 instructions needed instead of these Twin-Butterfly instructions is also
23 huge (**eight**) and given that it is extremely common to explicitly
24 loop-unroll them quantity hundreds to thousands of instructions are
25 dismayingly common (for all ISAs).
26
27 The goal is to implement instructions that calculate the expression:
28
29 ```
30 fdct_round_shift((a +/- b) * c)
31 ```
32
33 For the single-coefficient butterfly instruction, and:
34
35 ```
36 fdct_round_shift(a * c1 +/- b * c2)
37 ```
38
39 For the double-coefficient butterfly instruction.
40
41 In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
42
43 ```
44 #define ROUND_POWER_OF_TWO(value, n) \
45 (((value) + (1 << ((n)-1))) >> (n))
46 ```
47
48 These instructions are at the core of **ALL** FDCT calculations in many
49 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
50 ARM includes special instructions to optimize these operations, although
51 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
52
53 The suggestion is to have a single instruction to calculate both values
54 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
55 run in accumulate mode, so in order to calculate the 2-coeff version
56 one would just have to call the same instruction with different order a,
57 b and a different constant c.
58
59 Example taken from libvpx
60 <https://chromium.googlesource.com/webm/libvpx/+/refs/tags/v1.13.0/vpx_dsp/fwd_txfm.c#132>:
61
62 ```
63 #include <stdint.h>
64 #define ROUND_POWER_OF_TWO(value, n) \
65 (((value) + (1 << ((n)-1))) >> (n))
66 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
67 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
68 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
69 }
70 ```
71
72 8 instructions are required - replaced by just the one (maddsubrs):
73
74 ```
75 add 9,5,4
76 subf 5,5,4
77 mullw 9,9,6
78 mullw 5,5,6
79 addi 9,9,8192
80 addi 5,5,8192
81 srawi 9,9,14
82 srawi 5,5,14
83 ```
84
85 -------
86
87 \newpage{}
88
89 ## Integer Butterfly Multiply Add/Sub FFT/DCT
90
91 **Add the following to Book I Section 3.3.9.1**
92
93 A-Form
94
95 ```
96 |0 |6 |11 |16 |21 |26 |31 |
97 | PO | RT | RA | RB | SH | XO |Rc |
98 ```
99
100 * maddsubrs RT,RA,RB,SH
101
102 Pseudo-code:
103
104 ```
105 n <- SH
106 sum <- (RT[0] || RT) + (RA[0] || RA)
107 diff <- (RT[0] || RT) - (RA[0] || RA)
108 prod1 <- MULS(RB, sum)
109 prod2 <- MULS(RB, diff)
110 if n = 0 then
111 prod1_lo <- prod1[XLEN+1:(XLEN*2)]
112 prod2_lo <- prod2[XLEN+1:(XLEN*2)]
113 RT <- prod1_lo
114 RS <- prod2_lo
115 else
116 round <- [0]*(XLEN*2 + 1)
117 round[XLEN*2 - n + 1] <- 1
118 prod1 <- prod1 + round
119 prod2 <- prod2 + round
120 res1 <- prod1[XLEN - n + 1:XLEN*2 - n]
121 res2 <- prod2[XLEN - n + 1:XLEN*2 - n]
122 RT <- res1
123 RS <- res2
124 ```
125
126 Similar to `RTp`, this instruction produces an implicit result, `RS`,
127 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
128 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
129 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
130
131 Special Registers Altered:
132
133 ```
134 None
135 ```
136
137 # [DRAFT] Integer Butterfly Multiply Add and Round Shift FFT/DCT
138
139 A-Form
140
141 * maddrs RT,RA,RB,SH
142
143 Pseudo-code:
144
145 ```
146 n <- SH
147 prod <- MULS(RB, RA)
148 if n = 0 then
149 prod_lo <- prod[XLEN:(XLEN*2) - 1]
150 RT <- (RT) + prod_lo
151 else
152 res[0:XLEN*2-1] <- (EXTSXL((RT)[0], 1) || (RT)) + prod
153 round <- [0]*XLEN*2
154 round[XLEN*2 - n] <- 1
155 res <- res + round
156 RT <- res[XLEN - n:XLEN*2 - n -1]
157 ```
158
159 Special Registers Altered:
160
161 None
162
163 # [DRAFT] Integer Butterfly Multiply Sub and Round Shift FFT/DCT
164
165 A-Form
166
167 * msubrs RT,RA,RB,SH
168
169 Pseudo-code:
170
171 ```
172 n <- SH
173 prod <- MULS(RB, RA)
174 if n = 0 then
175 prod_lo <- prod[XLEN:(XLEN*2) - 1]
176 RT <- (RT) - prod_lo
177 else
178 res[0:XLEN*2-1] <- (EXTSXL((RT)[0], 1) || (RT)) - prod
179 round <- [0]*XLEN*2
180 round[XLEN*2 - n] <- 1
181 res <- res + round
182 RT <- res[XLEN - n:XLEN*2 - n -1]
183 ```
184
185 Special Registers Altered:
186
187 None
188
189
190 This pair of instructions is supposed to be used in complement to the maddsubrs
191 to produce the double-coefficient butterfly instruction. In order for that
192 to work, instead of passing c2 as coefficient, we have to pass c2-c1 instead.
193
194 In essence, we are calculating the quantity `a * c1 +/- b * c1` first, with
195 `maddsubrs` *without* shifting (so `SH=0`) and then we add/sub `b * (c2-c1)`
196 from the previous `RT`, and *then* do the shifting.
197
198 In the following example, assume `a` in `R1`, `b` in `R10`, `c1` in `R11` and `c2 - c1` in `R12`.
199 The first instruction will put `a * c1 + b * c1` in `R1` (`RT`), `a * c1 - b * c1` in `RS`
200 (here, `RS = RT +1`, so `R2`).
201 Then, `maddrs` will add `b * (c2 - c1)` to `R1` (`RT`), and `msubrs` will subtract it from `R2` (`RS`), and then
202 round shift right both quantities 14 bits:
203
204 ```
205 maddsubrs 1,10,0,11
206 maddrs 1,10,12,14
207 msubrs 2,10,12,14
208 ```
209
210 In scalar code, that would take ~16 instructions for both operations.
211
212 -------
213
214 \newpage{}
215
216 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
217
218 **Add the following to Book I Section 4.6.6.3**
219
220 ## Floating-Point Twin Multiply-Add DCT [Single]
221
222 X-Form
223
224 ```
225 |0 |6 |11 |16 |21 |31 |
226 | PO | FRT | FRA | FRB | XO |Rc |
227 ```
228
229 * fdmadds FRT,FRA,FRB (Rc=0)
230
231 Pseudo-code:
232
233 ```
234 FRS <- FPADD32(FRT, FRB)
235 sub <- FPSUB32(FRT, FRB)
236 FRT <- FPMUL32(FRA, sub)
237 ```
238
239 The two IEEE754-FP32 operations
240
241 ```
242 FRS <- [(FRT) + (FRB)]
243 FRT <- [(FRT) - (FRB)] * (FRA)
244 ```
245
246 are simultaneously performed.
247
248 The Floating-Point operand in register FRT is added to the floating-point
249 operand in register FRB and the result stored in FRS.
250
251 Using the exact same operand input register values from FRT and FRB
252 that were used to create FRS, the Floating-Point operand in register
253 FRB is subtracted from the floating-point operand in register FRT and
254 the result then rounded before being multiplied by FRA to create an
255 intermediate result that is stored in FRT.
256
257 The add into FRS is treated exactly as `fadds`. The creation of the
258 result FRT is **not** the same as that of `fmsubs`, but is instead as if
259 `fsubs` were performed first followed by `fmuls`. The creation of FRS
260 and FRT are treated as parallel independent operations which occur at
261 the same time.
262
263 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
264
265 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
266 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
267 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
268 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
269
270 Special Registers Altered:
271
272 ```
273 FPRF FR FI
274 FX OX UX XX
275 VXSNAN VXISI VXIMZ
276 ```
277
278 ## Floating-Point Multiply-Add FFT [Single]
279
280 X-Form
281
282 ```
283 |0 |6 |11 |16 |21 |31 |
284 | PO | FRT | FRA | FRB | XO |Rc |
285 ```
286
287 * ffmadds FRT,FRA,FRB (Rc=0)
288
289 Pseudo-code:
290
291 ```
292 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
293 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
294 ```
295
296 The two operations
297
298 ```
299 FRS <- -([(FRT) * (FRA)] - (FRB))
300 FRT <- [(FRT) * (FRA)] + (FRB)
301 ```
302
303 are performed.
304
305 The floating-point operand in register FRT is multiplied by the
306 floating-point operand in register FRA. The floating-point operand in
307 register FRB is added to this intermediate result, and the intermediate
308 stored in FRS.
309
310 Using the exact same values of FRT, FRT and FRB as used to create
311 FRS, the floating-point operand in register FRT is multiplied by the
312 floating-point operand in register FRA. The floating-point operand
313 in register FRB is subtracted from this intermediate result, and the
314 intermediate stored in FRT.
315
316 FRT is created as if a `fmadds` operation had been performed. FRS is
317 created as if a `fnmsubs` operation had simultaneously been performed
318 with the exact same register operands, in parallel, independently,
319 at exactly the same time.
320
321 FRT is a Read-Modify-Write operation.
322
323 Note that if Rc=1 an Illegal Instruction is raised.
324 Rc=1 is `RESERVED`
325
326 Similar to `FRTp`, this instruction produces an implicit result,
327 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
328 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
329 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
330 (Max Vector Length).
331
332 Special Registers Altered:
333
334 ```
335 FPRF FR FI
336 FX OX UX XX
337 VXSNAN VXISI VXIMZ
338 ```
339
340 ## Floating-Point Twin Multiply-Add DCT
341
342 X-Form
343
344 ```
345 |0 |6 |11 |16 |21 |31 |
346 | PO | FRT | FRA | FRB | XO |Rc |
347 ```
348
349 * fdmadd FRT,FRA,FRB (Rc=0)
350
351 Pseudo-code:
352
353 ```
354 FRS <- FPADD64(FRT, FRB)
355 sub <- FPSUB64(FRT, FRB)
356 FRT <- FPMUL64(FRA, sub)
357 ```
358
359 The two IEEE754-FP64 operations
360
361 ```
362 FRS <- [(FRT) + (FRB)]
363 FRT <- [(FRT) - (FRB)] * (FRA)
364 ```
365
366 are simultaneously performed.
367
368 The Floating-Point operand in register FRT is added to the floating-point
369 operand in register FRB and the result stored in FRS.
370
371 Using the exact same operand input register values from FRT and FRB
372 that were used to create FRS, the Floating-Point operand in register
373 FRB is subtracted from the floating-point operand in register FRT and
374 the result then rounded before being multiplied by FRA to create an
375 intermediate result that is stored in FRT.
376
377 The add into FRS is treated exactly as `fadd`. The creation of the
378 result FRT is **not** the same as that of `fmsub`, but is instead as if
379 `fsub` were performed first followed by `fmuls. The creation of FRS
380 and FRT are treated as parallel independent operations which occur at
381 the same time.
382
383 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
384
385 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
386 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
387 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
388 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
389
390 Special Registers Altered:
391
392 ```
393 FPRF FR FI
394 FX OX UX XX
395 VXSNAN VXISI VXIMZ
396 ```
397
398 ## Floating-Point Twin Multiply-Add FFT
399
400 X-Form
401
402 ```
403 |0 |6 |11 |16 |21 |31 |
404 | PO | FRT | FRA | FRB | XO |Rc |
405 ```
406
407 * ffmadd FRT,FRA,FRB (Rc=0)
408
409 Pseudo-code:
410
411 ```
412 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
413 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
414 ```
415
416 The two operations
417
418 ```
419 FRS <- -([(FRT) * (FRA)] - (FRB))
420 FRT <- [(FRT) * (FRA)] + (FRB)
421 ```
422
423 are performed.
424
425 The floating-point operand in register FRT is multiplied by the
426 floating-point operand in register FRA. The float- ing-point operand in
427 register FRB is added to this intermediate result, and the intermediate
428 stored in FRS.
429
430 Using the exact same values of FRT, FRT and FRB as used to create
431 FRS, the floating-point operand in register FRT is multiplied by the
432 floating-point operand in register FRA. The float- ing-point operand
433 in register FRB is subtracted from this intermediate result, and the
434 intermediate stored in FRT.
435
436 FRT is created as if a `fmadd` operation had been performed. FRS is
437 created as if a `fnmsub` operation had simultaneously been performed
438 with the exact same register operands, in parallel, independently,
439 at exactly the same time.
440
441 FRT is a Read-Modify-Write operation.
442
443 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
444
445 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
446 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
447 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
448 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
449
450 Special Registers Altered:
451
452 ```
453 FPRF FR FI
454 FX OX UX XX
455 VXSNAN VXISI VXIMZ
456 ```
457
458
459 ## Floating-Point Add FFT/DCT [Single]
460
461 A-Form
462
463 ```
464 |0 |6 |11 |16 |21 |26 |31 |
465 | PO | FRT | FRA | FRB | / | XO |Rc |
466 ```
467
468 * ffadds FRT,FRA,FRB (Rc=0)
469
470 Pseudo-code:
471
472 ```
473 FRT <- FPADD32(FRA, FRB)
474 FRS <- FPSUB32(FRB, FRA)
475 ```
476
477 Special Registers Altered:
478
479 ```
480 FPRF FR FI
481 FX OX UX XX
482 VXSNAN VXISI
483 ```
484
485 ## Floating-Point Add FFT/DCT [Double]
486
487 A-Form
488
489 ```
490 |0 |6 |11 |16 |21 |26 |31 |
491 | PO | FRT | FRA | FRB | / | XO |Rc |
492 ```
493
494 * ffadd FRT,FRA,FRB (Rc=0)
495
496 Pseudo-code:
497
498 ```
499 FRT <- FPADD64(FRA, FRB)
500 FRS <- FPSUB64(FRB, FRA)
501 ```
502
503 Special Registers Altered:
504
505 ```
506 FPRF FR FI
507 FX OX UX XX
508 VXSNAN VXISI
509 ```
510
511 ## Floating-Point Subtract FFT/DCT [Single]
512
513 A-Form
514
515 ```
516 |0 |6 |11 |16 |21 |26 |31 |
517 | PO | FRT | FRA | FRB | / | XO |Rc |
518 ```
519
520 * ffsubs FRT,FRA,FRB (Rc=0)
521
522 Pseudo-code:
523
524 ```
525 FRT <- FPSUB32(FRB, FRA)
526 FRS <- FPADD32(FRA, FRB)
527 ```
528
529 Special Registers Altered:
530
531 ```
532 FPRF FR FI
533 FX OX UX XX
534 VXSNAN VXISI
535 ```
536
537 ## Floating-Point Subtract FFT/DCT [Double]
538
539 A-Form
540
541 ```
542 |0 |6 |11 |16 |21 |26 |31 |
543 | PO | FRT | FRA | FRB | / | XO |Rc |
544 ```
545
546 * ffsub FRT,FRA,FRB (Rc=0)
547
548 Pseudo-code:
549
550 ```
551 FRT <- FPSUB64(FRB, FRA)
552 FRS <- FPADD64(FRA, FRB)
553 ```
554
555 Special Registers Altered:
556
557 ```
558 FPRF FR FI
559 FX OX UX XX
560 VXSNAN VXISI
561 ```