4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
14 context to save considerably on DCT, DFT and FFT processing. Whilst some hardware
15 implementations may not necessarily implement them efficiently (slower Micro-coding)
16 savings still come from the reduction in temporary registers as well as instruction
19 # Rationale for Twin Butterfly Integer DCT Instruction(s)
21 The number of general-purpose uses for DCT is huge. The number of
22 instructions needed instead of these Twin-Butterfly instructions is also
23 huge (**eight**) and given that it is extremely common to explicitly
24 loop-unroll them quantity hundreds to thousands of instructions are
25 dismayingly common (for all ISAs).
27 The goal is to implement instructions that calculate the expression:
30 fdct_round_shift((a +/- b) * c)
33 For the single-coefficient butterfly instruction, and:
36 fdct_round_shift(a * c1 +/- b * c2)
39 For the double-coefficient butterfly instruction.
41 In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
44 #define ROUND_POWER_OF_TWO(value, n) \
45 (((value) + (1 << ((n)-1))) >> (n))
48 These instructions are at the core of **ALL** FDCT calculations in many
49 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
50 ARM includes special instructions to optimize these operations, although
51 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
53 The suggestion is to have a single instruction to calculate both values
54 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
55 run in accumulate mode, so in order to calculate the 2-coeff version
56 one would just have to call the same instruction with different order a,
57 b and a different constant c.
59 Example taken from libvpx
60 <https://chromium.googlesource.com/webm/libvpx/+/refs/tags/v1.13.0/vpx_dsp/fwd_txfm.c#132>:
64 #define ROUND_POWER_OF_TWO(value, n) \
65 (((value) + (1 << ((n)-1))) >> (n))
66 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
67 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
68 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
72 8 instructions are required - replaced by just the one (maddsubrs):
89 ## Integer Butterfly Multiply Add/Sub FFT/DCT
91 **Add the following to Book I Section 3.3.9.1**
96 |0 |6 |11 |16 |21 |26 |31 |
97 | PO | RT | RA | RB | SH | XO |Rc |
100 * maddsubrs RT,RA,RB,SH
106 sum <- (RT[0] || RT) + (RA[0] || RA)
107 diff <- (RT[0] || RT) - (RA[0] || RA)
108 prod1 <- MULS(RB, sum)
109 prod2 <- MULS(RB, diff)
111 prod1_lo <- prod1[XLEN+1:(XLEN*2)]
112 prod2_lo <- prod2[XLEN+1:(XLEN*2)]
116 round <- [0]*(XLEN*2 + 1)
117 round[XLEN*2 - n + 1] <- 1
118 prod1 <- prod1 + round
119 prod2 <- prod2 + round
120 res1 <- prod1[XLEN - n + 1:XLEN*2 - n]
121 res2 <- prod2[XLEN - n + 1:XLEN*2 - n]
126 Similar to `RTp`, this instruction produces an implicit result, `RS`,
127 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
128 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
129 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
131 Special Registers Altered:
137 # [DRAFT] Integer Butterfly Multiply Add and Round Shift FFT/DCT
149 prod_lo <- prod[XLEN:(XLEN*2) - 1]
152 res[0:XLEN*2-1] <- (EXTSXL((RT)[0], 1) || (RT)) + prod
154 round[XLEN*2 - n] <- 1
156 RT <- res[XLEN - n:XLEN*2 - n -1]
159 Special Registers Altered:
163 # [DRAFT] Integer Butterfly Multiply Sub and Round Shift FFT/DCT
175 prod_lo <- prod[XLEN:(XLEN*2) - 1]
178 res[0:XLEN*2-1] <- (EXTSXL((RT)[0], 1) || (RT)) - prod
180 round[XLEN*2 - n] <- 1
182 RT <- res[XLEN - n:XLEN*2 - n -1]
185 Special Registers Altered:
190 This pair of instructions is supposed to be used in complement to the maddsubrs
191 to produce the double-coefficient butterfly instruction. In order for that
192 to work, instead of passing c2 as coefficient, we have to pass c2-c1 instead.
194 In essence, we are calculating the quantity `a * c1 +/- b * c1` first, with
195 `maddsubrs` *without* shifting (so `SH=0`) and then we add/sub `b * (c2-c1)`
196 from the previous `RT`, and *then* do the shifting.
198 In the following example, assume `a` in `R1`, `b` in `R10`, `c1` in `R11` and `c2 - c1` in `R12`.
199 The first instruction will put `a * c1 + b * c1` in `R1` (`RT`), `a * c1 - b * c1` in `RS`
200 (here, `RS = RT +1`, so `R2`).
201 Then, `maddrs` will add `b * (c2 - c1)` to `R1` (`RT`), and `msubrs` will subtract it from `R2` (`RS`), and then
202 round shift right both quantities 14 bits:
210 In scalar code, that would take ~16 instructions for both operations.
216 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
218 **Add the following to Book I Section 4.6.6.3**
220 ## Floating-Point Twin Multiply-Add DCT [Single]
225 |0 |6 |11 |16 |21 |31 |
226 | PO | FRT | FRA | FRB | XO |Rc |
229 * fdmadds FRT,FRA,FRB (Rc=0)
234 FRS <- FPADD32(FRT, FRB)
235 sub <- FPSUB32(FRT, FRB)
236 FRT <- FPMUL32(FRA, sub)
239 The two IEEE754-FP32 operations
242 FRS <- [(FRT) + (FRB)]
243 FRT <- [(FRT) - (FRB)] * (FRA)
246 are simultaneously performed.
248 The Floating-Point operand in register FRT is added to the floating-point
249 operand in register FRB and the result stored in FRS.
251 Using the exact same operand input register values from FRT and FRB
252 that were used to create FRS, the Floating-Point operand in register
253 FRB is subtracted from the floating-point operand in register FRT and
254 the result then rounded before being multiplied by FRA to create an
255 intermediate result that is stored in FRT.
257 The add into FRS is treated exactly as `fadds`. The creation of the
258 result FRT is **not** the same as that of `fmsubs`, but is instead as if
259 `fsubs` were performed first followed by `fmuls`. The creation of FRS
260 and FRT are treated as parallel independent operations which occur at
263 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
265 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
266 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
267 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
268 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
270 Special Registers Altered:
278 ## Floating-Point Multiply-Add FFT [Single]
283 |0 |6 |11 |16 |21 |31 |
284 | PO | FRT | FRA | FRB | XO |Rc |
287 * ffmadds FRT,FRA,FRB (Rc=0)
292 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
293 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
299 FRS <- -([(FRT) * (FRA)] - (FRB))
300 FRT <- [(FRT) * (FRA)] + (FRB)
305 The floating-point operand in register FRT is multiplied by the
306 floating-point operand in register FRA. The floating-point operand in
307 register FRB is added to this intermediate result, and the intermediate
310 Using the exact same values of FRT, FRT and FRB as used to create
311 FRS, the floating-point operand in register FRT is multiplied by the
312 floating-point operand in register FRA. The floating-point operand
313 in register FRB is subtracted from this intermediate result, and the
314 intermediate stored in FRT.
316 FRT is created as if a `fmadds` operation had been performed. FRS is
317 created as if a `fnmsubs` operation had simultaneously been performed
318 with the exact same register operands, in parallel, independently,
319 at exactly the same time.
321 FRT is a Read-Modify-Write operation.
323 Note that if Rc=1 an Illegal Instruction is raised.
326 Similar to `FRTp`, this instruction produces an implicit result,
327 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
328 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
329 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
332 Special Registers Altered:
340 ## Floating-Point Twin Multiply-Add DCT
345 |0 |6 |11 |16 |21 |31 |
346 | PO | FRT | FRA | FRB | XO |Rc |
349 * fdmadd FRT,FRA,FRB (Rc=0)
354 FRS <- FPADD64(FRT, FRB)
355 sub <- FPSUB64(FRT, FRB)
356 FRT <- FPMUL64(FRA, sub)
359 The two IEEE754-FP64 operations
362 FRS <- [(FRT) + (FRB)]
363 FRT <- [(FRT) - (FRB)] * (FRA)
366 are simultaneously performed.
368 The Floating-Point operand in register FRT is added to the floating-point
369 operand in register FRB and the result stored in FRS.
371 Using the exact same operand input register values from FRT and FRB
372 that were used to create FRS, the Floating-Point operand in register
373 FRB is subtracted from the floating-point operand in register FRT and
374 the result then rounded before being multiplied by FRA to create an
375 intermediate result that is stored in FRT.
377 The add into FRS is treated exactly as `fadd`. The creation of the
378 result FRT is **not** the same as that of `fmsub`, but is instead as if
379 `fsub` were performed first followed by `fmuls. The creation of FRS
380 and FRT are treated as parallel independent operations which occur at
383 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
385 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
386 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
387 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
388 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
390 Special Registers Altered:
398 ## Floating-Point Twin Multiply-Add FFT
403 |0 |6 |11 |16 |21 |31 |
404 | PO | FRT | FRA | FRB | XO |Rc |
407 * ffmadd FRT,FRA,FRB (Rc=0)
412 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
413 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
419 FRS <- -([(FRT) * (FRA)] - (FRB))
420 FRT <- [(FRT) * (FRA)] + (FRB)
425 The floating-point operand in register FRT is multiplied by the
426 floating-point operand in register FRA. The float- ing-point operand in
427 register FRB is added to this intermediate result, and the intermediate
430 Using the exact same values of FRT, FRT and FRB as used to create
431 FRS, the floating-point operand in register FRT is multiplied by the
432 floating-point operand in register FRA. The float- ing-point operand
433 in register FRB is subtracted from this intermediate result, and the
434 intermediate stored in FRT.
436 FRT is created as if a `fmadd` operation had been performed. FRS is
437 created as if a `fnmsub` operation had simultaneously been performed
438 with the exact same register operands, in parallel, independently,
439 at exactly the same time.
441 FRT is a Read-Modify-Write operation.
443 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
445 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
446 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
447 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
448 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
450 Special Registers Altered:
459 ## Floating-Point Add FFT/DCT [Single]
464 |0 |6 |11 |16 |21 |26 |31 |
465 | PO | FRT | FRA | FRB | / | XO |Rc |
468 * ffadds FRT,FRA,FRB (Rc=0)
473 FRT <- FPADD32(FRA, FRB)
474 FRS <- FPSUB32(FRB, FRA)
477 Special Registers Altered:
485 ## Floating-Point Add FFT/DCT [Double]
490 |0 |6 |11 |16 |21 |26 |31 |
491 | PO | FRT | FRA | FRB | / | XO |Rc |
494 * ffadd FRT,FRA,FRB (Rc=0)
499 FRT <- FPADD64(FRA, FRB)
500 FRS <- FPSUB64(FRB, FRA)
503 Special Registers Altered:
511 ## Floating-Point Subtract FFT/DCT [Single]
516 |0 |6 |11 |16 |21 |26 |31 |
517 | PO | FRT | FRA | FRB | / | XO |Rc |
520 * ffsubs FRT,FRA,FRB (Rc=0)
525 FRT <- FPSUB32(FRB, FRA)
526 FRS <- FPADD32(FRA, FRB)
529 Special Registers Altered:
537 ## Floating-Point Subtract FFT/DCT [Double]
542 |0 |6 |11 |16 |21 |26 |31 |
543 | PO | FRT | FRA | FRB | / | XO |Rc |
546 * ffsub FRT,FRA,FRB (Rc=0)
551 FRT <- FPSUB64(FRB, FRA)
552 FRS <- FPADD64(FRA, FRB)
555 Special Registers Altered: