4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
14 context to save considerably on DCT, DFT and FFT processing. Whilst some hardware
15 implementations may not necessarily implement them efficiently (slower Micro-coding)
16 savings still come from the reduction in temporary registers as well as instruction
19 # Rationale for Twin Butterfly Integer DCT Instruction(s)
21 The number of general-purpose uses for DCT is huge. The number of
22 instructions needed instead of these Twin-Butterfly instructions is also
23 huge (**eight**) and given that it is extremely common to explicitly
24 loop-unroll them quantity hundreds to thousands of instructions are
25 dismayingly common (for all ISAs).
27 The goal is to implement instructions that calculate the expression:
30 fdct_round_shift((a +/- b) * c)
33 For the single-coefficient butterfly instruction, and:
36 fdct_round_shift(a * c1 +/- b * c2)
39 For the double-coefficient butterfly instruction.
41 In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
44 #define ROUND_POWER_OF_TWO(value, n) \
45 (((value) + (1 << ((n)-1))) >> (n))
48 These instructions are at the core of **ALL** FDCT calculations in many
49 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
50 ARM includes special instructions to optimize these operations, although
51 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
53 The suggestion is to have a single instruction to calculate both values
54 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
55 run in accumulate mode, so in order to calculate the 2-coeff version
56 one would just have to call the same instruction with different order a,
57 b and a different constant c.
59 Example taken from libvpx
60 <https://chromium.googlesource.com/webm/libvpx/+/refs/tags/v1.13.0/vpx_dsp/fwd_txfm.c#132>:
64 #define ROUND_POWER_OF_TWO(value, n) \
65 (((value) + (1 << ((n)-1))) >> (n))
66 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
67 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
68 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
72 8 instructions are required - replaced by just the one (maddsubrs):
89 ## Integer Butterfly Multiply Add/Sub FFT/DCT
91 **Add the following to Book I Section 3.3.9.1**
96 |0 |6 |11 |16 |21 |26 |31 |
97 | PO | RT | RA | RB | SH | XO |Rc |
100 * maddsubrs RT,RA,SH,RB
108 prod1 <- MULS(RB, sum)
109 prod2 <- MULS(RB, diff)
111 prod1_lo <- prod1[XLEN:(XLEN*2) - 1]
112 prod2_lo <- prod2[XLEN:(XLEN*2) - 1]
116 round <- [0]*(XLEN*2)
117 round[XLEN*2 - n] <- 1
118 prod1 <- prod1 + round
119 prod2 <- prod2 + round
120 m <- MASK(XLEN - n - 2, XLEN - 1)
121 res1 <- prod1[XLEN - n:XLEN*2 - n - 1]
122 res2 <- prod2[XLEN - n:XLEN*2 - n - 1]
125 smask1 <- ([signbit1]*XLEN) & ¬m
126 smask2 <- ([signbit2]*XLEN) & ¬m
127 RT <- (res1 | smask1)
128 RS <- (res2 | smask2)
131 Similar to `RTp`, this instruction produces an implicit result, `RS`,
132 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
133 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
134 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
136 Special Registers Altered:
142 # [DRAFT] Integer Butterfly Multiply Add/Sub and Accumulate FFT/DCT
154 prod_lo <- prod[XLEN:(XLEN*2) - 1]
158 res1[0:XLEN*2-1] <- (EXTSXL((RT)[0], 1) || (RT)) + prod
159 res2[0:XLEN*2-1] <- (EXTSXL((RS)[0], 1) || (RS)) - prod
161 round[XLEN*2 - n] <- 1
166 m <- MASK(XLEN -n - 2, XLEN - 1)
167 res1 <- res1[XLEN - n:XLEN*2 - n -1]
168 res2 <- res2[XLEN - n:XLEN*2 - n -1]
169 smask1 <- ([signbit1]*XLEN) & ¬m
170 smask2 <- ([signbit2]*XLEN) & ¬m
171 RT <- (res1 | smask1)
172 RS <- (res2 | smask2)
175 Special Registers Altered:
179 Similar to `RTp`, this instruction produces an implicit result, `RS`,
180 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
181 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
182 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
184 This instruction is supposed to be used in complement to the maddsubrs
185 to produce the double-coefficient butterfly instruction. In order for that
186 to work, instead of passing c2 as coefficient, we have to pass c2-c1 instead.
188 In essence, we are calculating the quantity `a * c1 +/- b * c1` first, with
189 `maddsubrs` *without* shifting (so `SH=0`) and then we add/sub `b * (c2-c1)`
190 from the previous `RT`/`RS`, and *then* do the shifting.
192 In the following example, assume `a` in `R1`, `b` in `R10`, `c1` in `R11` and `c2 - c1` in `R12`.
193 The first instruction will put `a * c1 + b * c1` in `R1` (`RT`), `a * c1 - b * c1` in `RS`
194 (here, `RS = RT +1`, so `R2`).
195 Then, `maddrs` will add `b * (c2 - c1)` to `R1` (`RT`), and subtract it from `R2` (`RS`), and then
196 round shift right both quantities 14 bits:
203 In scalar code, that would take ~16 instructions for both operations.
209 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
211 **Add the following to Book I Section 4.6.6.3**
213 ## Floating-Point Twin Multiply-Add DCT [Single]
218 |0 |6 |11 |16 |21 |31 |
219 | PO | FRT | FRA | FRB | XO |Rc |
222 * fdmadds FRT,FRA,FRB (Rc=0)
227 FRS <- FPADD32(FRT, FRB)
228 sub <- FPSUB32(FRT, FRB)
229 FRT <- FPMUL32(FRA, sub)
232 The two IEEE754-FP32 operations
235 FRS <- [(FRT) + (FRB)]
236 FRT <- [(FRT) - (FRB)] * (FRA)
239 are simultaneously performed.
241 The Floating-Point operand in register FRT is added to the floating-point
242 operand in register FRB and the result stored in FRS.
244 Using the exact same operand input register values from FRT and FRB
245 that were used to create FRS, the Floating-Point operand in register
246 FRB is subtracted from the floating-point operand in register FRT and
247 the result then rounded before being multiplied by FRA to create an
248 intermediate result that is stored in FRT.
250 The add into FRS is treated exactly as `fadds`. The creation of the
251 result FRT is **not** the same as that of `fmsubs`, but is instead as if
252 `fsubs` were performed first followed by `fmuls`. The creation of FRS
253 and FRT are treated as parallel independent operations which occur at
256 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
258 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
259 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
260 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
261 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
263 Special Registers Altered:
271 ## Floating-Point Multiply-Add FFT [Single]
276 |0 |6 |11 |16 |21 |31 |
277 | PO | FRT | FRA | FRB | XO |Rc |
280 * ffmadds FRT,FRA,FRB (Rc=0)
285 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
286 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
292 FRS <- -([(FRT) * (FRA)] - (FRB))
293 FRT <- [(FRT) * (FRA)] + (FRB)
298 The floating-point operand in register FRT is multiplied by the
299 floating-point operand in register FRA. The floating-point operand in
300 register FRB is added to this intermediate result, and the intermediate
303 Using the exact same values of FRT, FRT and FRB as used to create
304 FRS, the floating-point operand in register FRT is multiplied by the
305 floating-point operand in register FRA. The floating-point operand
306 in register FRB is subtracted from this intermediate result, and the
307 intermediate stored in FRT.
309 FRT is created as if a `fmadds` operation had been performed. FRS is
310 created as if a `fnmsubs` operation had simultaneously been performed
311 with the exact same register operands, in parallel, independently,
312 at exactly the same time.
314 FRT is a Read-Modify-Write operation.
316 Note that if Rc=1 an Illegal Instruction is raised.
319 Similar to `FRTp`, this instruction produces an implicit result,
320 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
321 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
322 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
325 Special Registers Altered:
333 ## Floating-Point Twin Multiply-Add DCT
338 |0 |6 |11 |16 |21 |31 |
339 | PO | FRT | FRA | FRB | XO |Rc |
342 * fdmadd FRT,FRA,FRB (Rc=0)
347 FRS <- FPADD64(FRT, FRB)
348 sub <- FPSUB64(FRT, FRB)
349 FRT <- FPMUL64(FRA, sub)
352 The two IEEE754-FP64 operations
355 FRS <- [(FRT) + (FRB)]
356 FRT <- [(FRT) - (FRB)] * (FRA)
359 are simultaneously performed.
361 The Floating-Point operand in register FRT is added to the floating-point
362 operand in register FRB and the result stored in FRS.
364 Using the exact same operand input register values from FRT and FRB
365 that were used to create FRS, the Floating-Point operand in register
366 FRB is subtracted from the floating-point operand in register FRT and
367 the result then rounded before being multiplied by FRA to create an
368 intermediate result that is stored in FRT.
370 The add into FRS is treated exactly as `fadd`. The creation of the
371 result FRT is **not** the same as that of `fmsub`, but is instead as if
372 `fsub` were performed first followed by `fmuls. The creation of FRS
373 and FRT are treated as parallel independent operations which occur at
376 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
378 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
379 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
380 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
381 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
383 Special Registers Altered:
391 ## Floating-Point Twin Multiply-Add FFT
396 |0 |6 |11 |16 |21 |31 |
397 | PO | FRT | FRA | FRB | XO |Rc |
400 * ffmadd FRT,FRA,FRB (Rc=0)
405 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
406 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
412 FRS <- -([(FRT) * (FRA)] - (FRB))
413 FRT <- [(FRT) * (FRA)] + (FRB)
418 The floating-point operand in register FRT is multiplied by the
419 floating-point operand in register FRA. The float- ing-point operand in
420 register FRB is added to this intermediate result, and the intermediate
423 Using the exact same values of FRT, FRT and FRB as used to create
424 FRS, the floating-point operand in register FRT is multiplied by the
425 floating-point operand in register FRA. The float- ing-point operand
426 in register FRB is subtracted from this intermediate result, and the
427 intermediate stored in FRT.
429 FRT is created as if a `fmadd` operation had been performed. FRS is
430 created as if a `fnmsub` operation had simultaneously been performed
431 with the exact same register operands, in parallel, independently,
432 at exactly the same time.
434 FRT is a Read-Modify-Write operation.
436 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
438 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
439 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
440 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
441 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
443 Special Registers Altered:
452 ## Floating-Point Add FFT/DCT [Single]
457 |0 |6 |11 |16 |21 |26 |31 |
458 | PO | FRT | FRA | FRB | / | XO |Rc |
461 * ffadds FRT,FRA,FRB (Rc=0)
466 FRT <- FPADD32(FRA, FRB)
467 FRS <- FPSUB32(FRB, FRA)
470 Special Registers Altered:
478 ## Floating-Point Add FFT/DCT [Double]
483 |0 |6 |11 |16 |21 |26 |31 |
484 | PO | FRT | FRA | FRB | / | XO |Rc |
487 * ffadd FRT,FRA,FRB (Rc=0)
492 FRT <- FPADD64(FRA, FRB)
493 FRS <- FPSUB64(FRB, FRA)
496 Special Registers Altered:
504 ## Floating-Point Subtract FFT/DCT [Single]
509 |0 |6 |11 |16 |21 |26 |31 |
510 | PO | FRT | FRA | FRB | / | XO |Rc |
513 * ffsubs FRT,FRA,FRB (Rc=0)
518 FRT <- FPSUB32(FRB, FRA)
519 FRS <- FPADD32(FRA, FRB)
522 Special Registers Altered:
530 ## Floating-Point Subtract FFT/DCT [Double]
535 |0 |6 |11 |16 |21 |26 |31 |
536 | PO | FRT | FRA | FRB | / | XO |Rc |
539 * ffsub FRT,FRA,FRB (Rc=0)
544 FRT <- FPSUB64(FRB, FRA)
545 FRS <- FPADD64(FRA, FRB)
548 Special Registers Altered: