update pseudo-code
[libreriscv.git] / openpower / sv / twin_butterfly.mdwn
1 # Introduction
2
3 <!-- hide -->
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
11 <!-- show -->
12
13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
14 context to save considerably on DCT, DFT and FFT processing. Whilst some hardware
15 implementations may not necessarily implement them efficiently (slower Micro-coding)
16 savings still come from the reduction in temporary registers as well as instruction
17 count.
18
19 # Rationale for Twin Butterfly Integer DCT Instruction(s)
20
21 The number of general-purpose uses for DCT is huge. The number of
22 instructions needed instead of these Twin-Butterfly instructions is also
23 huge (**eight**) and given that it is extremely common to explicitly
24 loop-unroll them quantity hundreds to thousands of instructions are
25 dismayingly common (for all ISAs).
26
27 The goal is to implement instructions that calculate the expression:
28
29 ```
30 fdct_round_shift((a +/- b) * c)
31 ```
32
33 For the single-coefficient butterfly instruction, and:
34
35 ```
36 fdct_round_shift(a * c1 +/- b * c2)
37 ```
38
39 For the double-coefficient butterfly instruction.
40
41 In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
42
43 ```
44 #define ROUND_POWER_OF_TWO(value, n) \
45 (((value) + (1 << ((n)-1))) >> (n))
46 ```
47
48 These instructions are at the core of **ALL** FDCT calculations in many
49 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
50 ARM includes special instructions to optimize these operations, although
51 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
52
53 The suggestion is to have a single instruction to calculate both values
54 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
55 run in accumulate mode, so in order to calculate the 2-coeff version
56 one would just have to call the same instruction with different order a,
57 b and a different constant c.
58
59 Example taken from libvpx
60 <https://chromium.googlesource.com/webm/libvpx/+/refs/tags/v1.13.0/vpx_dsp/fwd_txfm.c#132>:
61
62 ```
63 #include <stdint.h>
64 #define ROUND_POWER_OF_TWO(value, n) \
65 (((value) + (1 << ((n)-1))) >> (n))
66 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
67 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
68 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
69 }
70 ```
71
72 8 instructions are required - replaced by just the one (maddsubrs):
73
74 ```
75 add 9,5,4
76 subf 5,5,4
77 mullw 9,9,6
78 mullw 5,5,6
79 addi 9,9,8192
80 addi 5,5,8192
81 srawi 9,9,14
82 srawi 5,5,14
83 ```
84
85 -------
86
87 \newpage{}
88
89 ## Integer Butterfly Multiply Add/Sub FFT/DCT
90
91 **Add the following to Book I Section 3.3.9.1**
92
93 A-Form
94
95 ```
96 |0 |6 |11 |16 |21 |26 |31 |
97 | PO | RT | RA | RB | SH | XO |Rc |
98 ```
99
100 * maddsubrs RT,RA,SH,RB
101
102 Pseudo-code:
103
104 ```
105 n <- SH
106 sum <- (RT) + (RA)
107 diff <- (RT) - (RA)
108 prod1 <- MULS(RB, sum)
109 prod2 <- MULS(RB, diff)
110 if n = 0 then
111 prod1_lo <- prod1[XLEN:(XLEN*2) - 1]
112 prod2_lo <- prod2[XLEN:(XLEN*2) - 1]
113 RT <- prod1_lo
114 RS <- prod2_lo
115 else
116 round <- [0]*(XLEN*2)
117 round[XLEN*2 - n] <- 1
118 prod1 <- prod1 + round
119 prod2 <- prod2 + round
120 m <- MASK(XLEN - n - 2, XLEN - 1)
121 res1 <- prod1[XLEN - n:XLEN*2 - n - 1]
122 res2 <- prod2[XLEN - n:XLEN*2 - n - 1]
123 signbit1 <- prod1[0]
124 signbit2 <- prod2[0]
125 smask1 <- ([signbit1]*XLEN) & ¬m
126 smask2 <- ([signbit2]*XLEN) & ¬m
127 RT <- (res1 | smask1)
128 RS <- (res2 | smask2)
129 ```
130
131 Similar to `RTp`, this instruction produces an implicit result, `RS`,
132 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
133 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
134 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
135
136 Special Registers Altered:
137
138 ```
139 None
140 ```
141
142 # [DRAFT] Integer Butterfly Multiply Add/Sub and Accumulate FFT/DCT
143
144 A-Form
145
146 * maddrs RT,RA,SH,RB
147
148 Pseudo-code:
149
150 n <- SH
151 prod <- MULS(RB, RA)
152 if n = 0 then
153 prod_lo <- prod[XLEN:(XLEN*2) - 1]
154 RT <- (RT) + prod_lo
155 n <- SH
156 sum <- (RT) + (RA)
157 diff <- (RT) - (RA)
158 prod1 <- MULS(RB, sum)
159 prod1_lo <- prod1[XLEN:(XLEN*2)-1]
160 prod2 <- MULS(RB, diff)
161 prod2_lo <- prod2[XLEN:(XLEN*2)-1]
162 if n = 0 then
163 RT <- prod1_lo
164 RS <- prod2_lo
165 else
166 round <- [0]*XLEN
167 round[XLEN -n] <- 1
168 prod1_lo <- prod1_lo + round
169 prod2_lo <- prod2_lo + round
170 m <- MASK(n, (XLEN-1))
171 res1 <- ROTL64(prod1_lo, XLEN-n) & m
172 res2 <- ROTL64(prod2_lo, XLEN-n) & m
173 signbit1 <- prod1_lo[0]
174 signbit2 <- prod2_lo[0]
175 smask1 <- ([signbit1]*XLEN) & ¬m
176 smask2 <- ([signbit2]*XLEN) & ¬m
177 RT <- (res1 | smask1)
178 RS <- (res2 | smask2)
179
180 Special Registers Altered:
181
182 None
183
184 Similar to `RTp`, this instruction produces an implicit result, `RS`,
185 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
186 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
187 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
188
189 This instruction is supposed to be used in complement to the maddsubrs
190 to produce the double-coefficient butterfly instruction. In order for that
191 to work, instead of passing c2 as coefficient, we have to pass c2-c1 instead.
192
193 In essence, we are calculating the quantity `a * c1 +/- b * c1` first, with
194 `maddsubrs` *without* shifting (so `SH=0`) and then we add/sub `b * (c2-c1)`
195 from the previous `RT`/`RS`, and *then* do the shifting.
196
197 In the following example, assume `a` in `R1`, `b` in `R10`, `c1` in `R11` and `c2 - c1` in `R12`.
198 The first instruction will put `a * c1 + b * c1` in `R1` (`RT`), `a * c1 - b * c1` in `RS`
199 (here, `RS = RT +1`, so `R2`).
200 Then, `maddrs` will add `b * (c2 - c1)` to `R1` (`RT`), and subtract it from `R2` (`RS`), and then
201 round shift right both quantities 14 bits:
202
203 ```
204 maddsubrs 1,10,0,11
205 maddrs 1,10,14,12
206 ```
207
208 In scalar code, that would take ~16 instructions for both operations.
209
210 -------
211
212 \newpage{}
213
214 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
215
216 **Add the following to Book I Section 4.6.6.3**
217
218 ## Floating-Point Twin Multiply-Add DCT [Single]
219
220 X-Form
221
222 ```
223 |0 |6 |11 |16 |21 |31 |
224 | PO | FRT | FRA | FRB | XO |Rc |
225 ```
226
227 * fdmadds FRT,FRA,FRB (Rc=0)
228
229 Pseudo-code:
230
231 ```
232 FRS <- FPADD32(FRT, FRB)
233 sub <- FPSUB32(FRT, FRB)
234 FRT <- FPMUL32(FRA, sub)
235 ```
236
237 The two IEEE754-FP32 operations
238
239 ```
240 FRS <- [(FRT) + (FRB)]
241 FRT <- [(FRT) - (FRB)] * (FRA)
242 ```
243
244 are simultaneously performed.
245
246 The Floating-Point operand in register FRT is added to the floating-point
247 operand in register FRB and the result stored in FRS.
248
249 Using the exact same operand input register values from FRT and FRB
250 that were used to create FRS, the Floating-Point operand in register
251 FRB is subtracted from the floating-point operand in register FRT and
252 the result then rounded before being multiplied by FRA to create an
253 intermediate result that is stored in FRT.
254
255 The add into FRS is treated exactly as `fadds`. The creation of the
256 result FRT is **not** the same as that of `fmsubs`, but is instead as if
257 `fsubs` were performed first followed by `fmuls`. The creation of FRS
258 and FRT are treated as parallel independent operations which occur at
259 the same time.
260
261 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
262
263 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
264 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
265 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
266 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
267
268 Special Registers Altered:
269
270 ```
271 FPRF FR FI
272 FX OX UX XX
273 VXSNAN VXISI VXIMZ
274 ```
275
276 ## Floating-Point Multiply-Add FFT [Single]
277
278 X-Form
279
280 ```
281 |0 |6 |11 |16 |21 |31 |
282 | PO | FRT | FRA | FRB | XO |Rc |
283 ```
284
285 * ffmadds FRT,FRA,FRB (Rc=0)
286
287 Pseudo-code:
288
289 ```
290 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
291 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
292 ```
293
294 The two operations
295
296 ```
297 FRS <- -([(FRT) * (FRA)] - (FRB))
298 FRT <- [(FRT) * (FRA)] + (FRB)
299 ```
300
301 are performed.
302
303 The floating-point operand in register FRT is multiplied by the
304 floating-point operand in register FRA. The floating-point operand in
305 register FRB is added to this intermediate result, and the intermediate
306 stored in FRS.
307
308 Using the exact same values of FRT, FRT and FRB as used to create
309 FRS, the floating-point operand in register FRT is multiplied by the
310 floating-point operand in register FRA. The floating-point operand
311 in register FRB is subtracted from this intermediate result, and the
312 intermediate stored in FRT.
313
314 FRT is created as if a `fmadds` operation had been performed. FRS is
315 created as if a `fnmsubs` operation had simultaneously been performed
316 with the exact same register operands, in parallel, independently,
317 at exactly the same time.
318
319 FRT is a Read-Modify-Write operation.
320
321 Note that if Rc=1 an Illegal Instruction is raised.
322 Rc=1 is `RESERVED`
323
324 Similar to `FRTp`, this instruction produces an implicit result,
325 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
326 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
327 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
328 (Max Vector Length).
329
330 Special Registers Altered:
331
332 ```
333 FPRF FR FI
334 FX OX UX XX
335 VXSNAN VXISI VXIMZ
336 ```
337
338 ## Floating-Point Twin Multiply-Add DCT
339
340 X-Form
341
342 ```
343 |0 |6 |11 |16 |21 |31 |
344 | PO | FRT | FRA | FRB | XO |Rc |
345 ```
346
347 * fdmadd FRT,FRA,FRB (Rc=0)
348
349 Pseudo-code:
350
351 ```
352 FRS <- FPADD64(FRT, FRB)
353 sub <- FPSUB64(FRT, FRB)
354 FRT <- FPMUL64(FRA, sub)
355 ```
356
357 The two IEEE754-FP64 operations
358
359 ```
360 FRS <- [(FRT) + (FRB)]
361 FRT <- [(FRT) - (FRB)] * (FRA)
362 ```
363
364 are simultaneously performed.
365
366 The Floating-Point operand in register FRT is added to the floating-point
367 operand in register FRB and the result stored in FRS.
368
369 Using the exact same operand input register values from FRT and FRB
370 that were used to create FRS, the Floating-Point operand in register
371 FRB is subtracted from the floating-point operand in register FRT and
372 the result then rounded before being multiplied by FRA to create an
373 intermediate result that is stored in FRT.
374
375 The add into FRS is treated exactly as `fadd`. The creation of the
376 result FRT is **not** the same as that of `fmsub`, but is instead as if
377 `fsub` were performed first followed by `fmuls. The creation of FRS
378 and FRT are treated as parallel independent operations which occur at
379 the same time.
380
381 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
382
383 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
384 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
385 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
386 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
387
388 Special Registers Altered:
389
390 ```
391 FPRF FR FI
392 FX OX UX XX
393 VXSNAN VXISI VXIMZ
394 ```
395
396 ## Floating-Point Twin Multiply-Add FFT
397
398 X-Form
399
400 ```
401 |0 |6 |11 |16 |21 |31 |
402 | PO | FRT | FRA | FRB | XO |Rc |
403 ```
404
405 * ffmadd FRT,FRA,FRB (Rc=0)
406
407 Pseudo-code:
408
409 ```
410 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
411 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
412 ```
413
414 The two operations
415
416 ```
417 FRS <- -([(FRT) * (FRA)] - (FRB))
418 FRT <- [(FRT) * (FRA)] + (FRB)
419 ```
420
421 are performed.
422
423 The floating-point operand in register FRT is multiplied by the
424 floating-point operand in register FRA. The float- ing-point operand in
425 register FRB is added to this intermediate result, and the intermediate
426 stored in FRS.
427
428 Using the exact same values of FRT, FRT and FRB as used to create
429 FRS, the floating-point operand in register FRT is multiplied by the
430 floating-point operand in register FRA. The float- ing-point operand
431 in register FRB is subtracted from this intermediate result, and the
432 intermediate stored in FRT.
433
434 FRT is created as if a `fmadd` operation had been performed. FRS is
435 created as if a `fnmsub` operation had simultaneously been performed
436 with the exact same register operands, in parallel, independently,
437 at exactly the same time.
438
439 FRT is a Read-Modify-Write operation.
440
441 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
442
443 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
444 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
445 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
446 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
447
448 Special Registers Altered:
449
450 ```
451 FPRF FR FI
452 FX OX UX XX
453 VXSNAN VXISI VXIMZ
454 ```
455
456
457 ## Floating-Point Add FFT/DCT [Single]
458
459 A-Form
460
461 ```
462 |0 |6 |11 |16 |21 |26 |31 |
463 | PO | FRT | FRA | FRB | / | XO |Rc |
464 ```
465
466 * ffadds FRT,FRA,FRB (Rc=0)
467
468 Pseudo-code:
469
470 ```
471 FRT <- FPADD32(FRA, FRB)
472 FRS <- FPSUB32(FRB, FRA)
473 ```
474
475 Special Registers Altered:
476
477 ```
478 FPRF FR FI
479 FX OX UX XX
480 VXSNAN VXISI
481 ```
482
483 ## Floating-Point Add FFT/DCT [Double]
484
485 A-Form
486
487 ```
488 |0 |6 |11 |16 |21 |26 |31 |
489 | PO | FRT | FRA | FRB | / | XO |Rc |
490 ```
491
492 * ffadd FRT,FRA,FRB (Rc=0)
493
494 Pseudo-code:
495
496 ```
497 FRT <- FPADD64(FRA, FRB)
498 FRS <- FPSUB64(FRB, FRA)
499 ```
500
501 Special Registers Altered:
502
503 ```
504 FPRF FR FI
505 FX OX UX XX
506 VXSNAN VXISI
507 ```
508
509 ## Floating-Point Subtract FFT/DCT [Single]
510
511 A-Form
512
513 ```
514 |0 |6 |11 |16 |21 |26 |31 |
515 | PO | FRT | FRA | FRB | / | XO |Rc |
516 ```
517
518 * ffsubs FRT,FRA,FRB (Rc=0)
519
520 Pseudo-code:
521
522 ```
523 FRT <- FPSUB32(FRB, FRA)
524 FRS <- FPADD32(FRA, FRB)
525 ```
526
527 Special Registers Altered:
528
529 ```
530 FPRF FR FI
531 FX OX UX XX
532 VXSNAN VXISI
533 ```
534
535 ## Floating-Point Subtract FFT/DCT [Double]
536
537 A-Form
538
539 ```
540 |0 |6 |11 |16 |21 |26 |31 |
541 | PO | FRT | FRA | FRB | / | XO |Rc |
542 ```
543
544 * ffsub FRT,FRA,FRB (Rc=0)
545
546 Pseudo-code:
547
548 ```
549 FRT <- FPSUB64(FRB, FRA)
550 FRS <- FPADD64(FRA, FRB)
551 ```
552
553 Special Registers Altered:
554
555 ```
556 FPRF FR FI
557 FX OX UX XX
558 VXSNAN VXISI
559 ```