4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
14 context to save considerably on DCT, DFT and FFT processing. Whilst some hardware
15 implementations may not necessarily implement them efficiently (slower Micro-coding)
16 savings still come from the reduction in temporary registers as well as instruction
19 # Rationale for Twin Butterfly Integer DCT Instruction(s)
21 The number of general-purpose uses for DCT is huge. The number of
22 instructions needed instead of these Twin-Butterfly instructions is also
23 huge (**eight**) and given that it is extremely common to explicitly
24 loop-unroll them quantity hundreds to thousands of instructions are
25 dismayingly common (for all ISAs).
27 The goal is to implement instructions that calculate the expression:
30 fdct_round_shift((a +/- b) * c)
33 For the single-coefficient butterfly instruction, and:
36 fdct_round_shift(a * c1 +/- b * c2)
39 For the double-coefficient butterfly instruction.
41 In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
44 #define ROUND_POWER_OF_TWO(value, n) \
45 (((value) + (1 << ((n)-1))) >> (n))
48 These instructions are at the core of **ALL** FDCT calculations in many
49 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
50 ARM includes special instructions to optimize these operations, although
51 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
53 The suggestion is to have a single instruction to calculate both values
54 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
55 run in accumulate mode, so in order to calculate the 2-coeff version
56 one would just have to call the same instruction with different order a,
57 b and a different constant c.
59 Example taken from libvpx
60 <https://chromium.googlesource.com/webm/libvpx/+/refs/tags/v1.13.0/vpx_dsp/fwd_txfm.c#132>:
64 #define ROUND_POWER_OF_TWO(value, n) \
65 (((value) + (1 << ((n)-1))) >> (n))
66 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
67 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
68 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
72 8 instructions are required - replaced by just the one (maddsubrs):
89 ## Integer Butterfly Multiply Add/Sub FFT/DCT
91 **Add the following to Book I Section 3.3.9.1**
96 |0 |6 |11 |16 |21 |26 |31 |
97 | PO | RT | RA | RB | SH | XO |Rc |
100 * maddsubrs RT,RA,SH,RB
108 prod1 <- MULS(RB, sum)
109 prod2 <- MULS(RB, diff)
111 prod1_lo <- prod1[XLEN:(XLEN*2) - 1]
112 prod2_lo <- prod2[XLEN:(XLEN*2) - 1]
116 round <- [0]*(XLEN*2)
117 round[XLEN*2 - n] <- 1
118 prod1 <- prod1 + round
119 prod2 <- prod2 + round
120 m <- MASK(XLEN - n - 2, XLEN - 1)
121 res1 <- prod1[XLEN - n:XLEN*2 - n - 1]
122 res2 <- prod2[XLEN - n:XLEN*2 - n - 1]
125 smask1 <- ([signbit1]*XLEN) & ¬m
126 smask2 <- ([signbit2]*XLEN) & ¬m
127 RT <- (res1 | smask1)
128 RS <- (res2 | smask2)
131 Similar to `RTp`, this instruction produces an implicit result, `RS`,
132 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
133 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
134 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
136 Special Registers Altered:
142 # [DRAFT] Integer Butterfly Multiply Add/Sub and Accumulate FFT/DCT
153 prod_lo <- prod[XLEN:(XLEN*2) - 1]
158 prod1 <- MULS(RB, sum)
159 prod1_lo <- prod1[XLEN:(XLEN*2)-1]
160 prod2 <- MULS(RB, diff)
161 prod2_lo <- prod2[XLEN:(XLEN*2)-1]
168 prod1_lo <- prod1_lo + round
169 prod2_lo <- prod2_lo + round
170 m <- MASK(n, (XLEN-1))
171 res1 <- ROTL64(prod1_lo, XLEN-n) & m
172 res2 <- ROTL64(prod2_lo, XLEN-n) & m
173 signbit1 <- prod1_lo[0]
174 signbit2 <- prod2_lo[0]
175 smask1 <- ([signbit1]*XLEN) & ¬m
176 smask2 <- ([signbit2]*XLEN) & ¬m
177 RT <- (res1 | smask1)
178 RS <- (res2 | smask2)
180 Special Registers Altered:
184 Similar to `RTp`, this instruction produces an implicit result, `RS`,
185 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
186 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
187 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
189 This instruction is supposed to be used in complement to the maddsubrs
190 to produce the double-coefficient butterfly instruction. In order for that
191 to work, instead of passing c2 as coefficient, we have to pass c2-c1 instead.
193 In essence, we are calculating the quantity `a * c1 +/- b * c1` first, with
194 `maddsubrs` *without* shifting (so `SH=0`) and then we add/sub `b * (c2-c1)`
195 from the previous `RT`/`RS`, and *then* do the shifting.
197 In the following example, assume `a` in `R1`, `b` in `R10`, `c1` in `R11` and `c2 - c1` in `R12`.
198 The first instruction will put `a * c1 + b * c1` in `R1` (`RT`), `a * c1 - b * c1` in `RS`
199 (here, `RS = RT +1`, so `R2`).
200 Then, `maddrs` will add `b * (c2 - c1)` to `R1` (`RT`), and subtract it from `R2` (`RS`), and then
201 round shift right both quantities 14 bits:
208 In scalar code, that would take ~16 instructions for both operations.
214 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
216 **Add the following to Book I Section 4.6.6.3**
218 ## Floating-Point Twin Multiply-Add DCT [Single]
223 |0 |6 |11 |16 |21 |31 |
224 | PO | FRT | FRA | FRB | XO |Rc |
227 * fdmadds FRT,FRA,FRB (Rc=0)
232 FRS <- FPADD32(FRT, FRB)
233 sub <- FPSUB32(FRT, FRB)
234 FRT <- FPMUL32(FRA, sub)
237 The two IEEE754-FP32 operations
240 FRS <- [(FRT) + (FRB)]
241 FRT <- [(FRT) - (FRB)] * (FRA)
244 are simultaneously performed.
246 The Floating-Point operand in register FRT is added to the floating-point
247 operand in register FRB and the result stored in FRS.
249 Using the exact same operand input register values from FRT and FRB
250 that were used to create FRS, the Floating-Point operand in register
251 FRB is subtracted from the floating-point operand in register FRT and
252 the result then rounded before being multiplied by FRA to create an
253 intermediate result that is stored in FRT.
255 The add into FRS is treated exactly as `fadds`. The creation of the
256 result FRT is **not** the same as that of `fmsubs`, but is instead as if
257 `fsubs` were performed first followed by `fmuls`. The creation of FRS
258 and FRT are treated as parallel independent operations which occur at
261 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
263 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
264 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
265 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
266 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
268 Special Registers Altered:
276 ## Floating-Point Multiply-Add FFT [Single]
281 |0 |6 |11 |16 |21 |31 |
282 | PO | FRT | FRA | FRB | XO |Rc |
285 * ffmadds FRT,FRA,FRB (Rc=0)
290 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
291 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
297 FRS <- -([(FRT) * (FRA)] - (FRB))
298 FRT <- [(FRT) * (FRA)] + (FRB)
303 The floating-point operand in register FRT is multiplied by the
304 floating-point operand in register FRA. The floating-point operand in
305 register FRB is added to this intermediate result, and the intermediate
308 Using the exact same values of FRT, FRT and FRB as used to create
309 FRS, the floating-point operand in register FRT is multiplied by the
310 floating-point operand in register FRA. The floating-point operand
311 in register FRB is subtracted from this intermediate result, and the
312 intermediate stored in FRT.
314 FRT is created as if a `fmadds` operation had been performed. FRS is
315 created as if a `fnmsubs` operation had simultaneously been performed
316 with the exact same register operands, in parallel, independently,
317 at exactly the same time.
319 FRT is a Read-Modify-Write operation.
321 Note that if Rc=1 an Illegal Instruction is raised.
324 Similar to `FRTp`, this instruction produces an implicit result,
325 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
326 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
327 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
330 Special Registers Altered:
338 ## Floating-Point Twin Multiply-Add DCT
343 |0 |6 |11 |16 |21 |31 |
344 | PO | FRT | FRA | FRB | XO |Rc |
347 * fdmadd FRT,FRA,FRB (Rc=0)
352 FRS <- FPADD64(FRT, FRB)
353 sub <- FPSUB64(FRT, FRB)
354 FRT <- FPMUL64(FRA, sub)
357 The two IEEE754-FP64 operations
360 FRS <- [(FRT) + (FRB)]
361 FRT <- [(FRT) - (FRB)] * (FRA)
364 are simultaneously performed.
366 The Floating-Point operand in register FRT is added to the floating-point
367 operand in register FRB and the result stored in FRS.
369 Using the exact same operand input register values from FRT and FRB
370 that were used to create FRS, the Floating-Point operand in register
371 FRB is subtracted from the floating-point operand in register FRT and
372 the result then rounded before being multiplied by FRA to create an
373 intermediate result that is stored in FRT.
375 The add into FRS is treated exactly as `fadd`. The creation of the
376 result FRT is **not** the same as that of `fmsub`, but is instead as if
377 `fsub` were performed first followed by `fmuls. The creation of FRS
378 and FRT are treated as parallel independent operations which occur at
381 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
383 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
384 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
385 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
386 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
388 Special Registers Altered:
396 ## Floating-Point Twin Multiply-Add FFT
401 |0 |6 |11 |16 |21 |31 |
402 | PO | FRT | FRA | FRB | XO |Rc |
405 * ffmadd FRT,FRA,FRB (Rc=0)
410 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
411 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
417 FRS <- -([(FRT) * (FRA)] - (FRB))
418 FRT <- [(FRT) * (FRA)] + (FRB)
423 The floating-point operand in register FRT is multiplied by the
424 floating-point operand in register FRA. The float- ing-point operand in
425 register FRB is added to this intermediate result, and the intermediate
428 Using the exact same values of FRT, FRT and FRB as used to create
429 FRS, the floating-point operand in register FRT is multiplied by the
430 floating-point operand in register FRA. The float- ing-point operand
431 in register FRB is subtracted from this intermediate result, and the
432 intermediate stored in FRT.
434 FRT is created as if a `fmadd` operation had been performed. FRS is
435 created as if a `fnmsub` operation had simultaneously been performed
436 with the exact same register operands, in parallel, independently,
437 at exactly the same time.
439 FRT is a Read-Modify-Write operation.
441 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
443 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
444 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
445 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
446 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
448 Special Registers Altered:
457 ## Floating-Point Add FFT/DCT [Single]
462 |0 |6 |11 |16 |21 |26 |31 |
463 | PO | FRT | FRA | FRB | / | XO |Rc |
466 * ffadds FRT,FRA,FRB (Rc=0)
471 FRT <- FPADD32(FRA, FRB)
472 FRS <- FPSUB32(FRB, FRA)
475 Special Registers Altered:
483 ## Floating-Point Add FFT/DCT [Double]
488 |0 |6 |11 |16 |21 |26 |31 |
489 | PO | FRT | FRA | FRB | / | XO |Rc |
492 * ffadd FRT,FRA,FRB (Rc=0)
497 FRT <- FPADD64(FRA, FRB)
498 FRS <- FPSUB64(FRB, FRA)
501 Special Registers Altered:
509 ## Floating-Point Subtract FFT/DCT [Single]
514 |0 |6 |11 |16 |21 |26 |31 |
515 | PO | FRT | FRA | FRB | / | XO |Rc |
518 * ffsubs FRT,FRA,FRB (Rc=0)
523 FRT <- FPSUB32(FRB, FRA)
524 FRS <- FPADD32(FRA, FRB)
527 Special Registers Altered:
535 ## Floating-Point Subtract FFT/DCT [Double]
540 |0 |6 |11 |16 |21 |26 |31 |
541 | PO | FRT | FRA | FRB | / | XO |Rc |
544 * ffsub FRT,FRA,FRB (Rc=0)
549 FRT <- FPSUB64(FRB, FRA)
550 FRS <- FPADD64(FRA, FRB)
553 Special Registers Altered: