4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
16 The number of general-purpose uses for DCT is huge. The number of
17 instructions needed instead of these Twin-Butterfly instructions is also
18 huge (**eight**) and given that it is extremely common to explicitly
19 loop-unroll them quantity hundreds to thousands of instructions are
20 dismayingly common (for all ISAs).
22 The goal is to implement instructions that calculate the expression:
25 fdct_round_shift((a +/- b) * c)
28 For the single-coefficient butterfly instruction, and:
31 fdct_round_shift(a * c1 +/- b * c2)
34 For the double-coefficient butterfly instruction.
36 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
39 #define ROUND_POWER_OF_TWO(value, n) \
40 (((value) + (1 << ((n)-1))) >> (n))
43 These instructions are at the core of **ALL** FDCT calculations in many
44 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
45 Arm includes special instructions to optimize these operations, although
46 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
48 The suggestion is to have a single instruction to calculate both values
49 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
50 run in accumulate mode, so in order to calculate the 2-coeff version
51 one would just have to call the same instruction with different order a,
52 b and a different constant c.
54 Example taken from libvpx
55 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
59 #define ROUND_POWER_OF_TWO(value, n) \
60 (((value) + (1 << ((n)-1))) >> (n))
61 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
62 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
63 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
67 8 instructions are required - replaced by just the one (maddsubrs):
84 ## Integer Butterfly Multiply Add/Sub FFT/DCT
86 **Add the following to Book I Section 3.3.9.1**
91 |0 |6 |11 |16 |21 |26 |31 |
92 | PO | RT | RA | RB | SH | XO |Rc |
96 * maddsubrs RT,RA,SH,RB
104 prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1] + 1
105 prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1] + 1
106 res1 <- ROTL64(prod1, XLEN-n)
107 res2 <- ROTL64(prod2, XLEN-n)
108 m <- MASK(n, (XLEN-1))
111 smask1 <- ([signbit1]*XLEN) & ¬m
112 smask2 <- ([signbit2]*XLEN) & ¬m
113 s64_1 <- [0]*(XLEN-1) || signbit1
114 s64_2 <- [0]*(XLEN-1) || signbit2
115 RT <- (res1 & m | smask1) + s64_1
116 RS <- (res2 & m | smask2) + s64_2
119 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
121 Similar to `RTp`, this instruction produces an implicit result, `RS`,
122 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
123 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
124 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
126 Special Registers Altered:
136 # Twin Butterfly Floating-Point DCT Instruction(s)
138 **Add the following to Book I Section 4.6.6.3**
140 ## Floating-Point Twin Multiply-Add DCT [Single]
145 |0 |6 |11 |16 |21 |31 |
146 | PO | FRT | FRA | FRB | XO |Rc |
149 * fdmadds FRT,FRA,FRB (Rc=0)
154 FRS <- FPADD32(FRT, FRB)
155 sub <- FPSUB32(FRT, FRB)
156 FRT <- FPMUL32(FRA, sub)
159 The two IEEE754-FP32 operations
162 FRS <- [(FRT) + (FRB)]
163 FRT <- [(FRT) - (FRB)] * (FRA)
166 are simultaneously performed.
168 The Floating-Point operand in register FRT is added to the floating-point
169 operand in register FRB and the result stored in FRS.
171 Using the exact same operand input register values from FRT and FRB
172 that were used to create FRS, the Floating-Point operand in register
173 FRB is subtracted from the floating-point operand in register FRT and
174 the result then rounded before being multiplied by FRA to create an
175 intermediate result that is stored in FRT.
177 The add into FRS is treated exactly as `fadds`. The creation of the
178 result FRT is **not** the same as that of `fmsubs`, but is instead as if
179 `fsubs` were performed first followed by `fmuls`. The creation of FRS
180 and FRT are treated as parallel independent operations which occur at
183 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
185 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
186 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
187 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
188 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
190 Special Registers Altered:
198 ## Floating-Point Multiply-Add FFT [Single]
203 |0 |6 |11 |16 |21 |31 |
204 | PO | FRT | FRA | FRB | XO |Rc |
207 * ffmadds FRT,FRA,FRB (Rc=0)
212 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
213 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
219 FRS <- -([(FRT) * (FRA)] - (FRB))
220 FRT <- [(FRT) * (FRA)] + (FRB)
225 The floating-point operand in register FRT is multiplied by the
226 floating-point operand in register FRA. The floating-point operand in
227 register FRB is added to this intermediate result, and the intermediate
230 Using the exact same values of FRT, FRT and FRB as used to create
231 FRS, the floating-point operand in register FRT is multiplied by the
232 floating-point operand in register FRA. The float- ing-point operand
233 in register FRB is subtracted from this intermediate result, and the
234 intermediate stored in FRT.
236 FRT is created as if a `fmadds` operation had been performed. FRS is
237 created as if a `fnmsubs` operation had simultaneously been performed
238 with the exact same register operands, in parallel, independently,
239 at exactly the same time.
241 FRT is a Read-Modify-Write operation.
243 Note that if Rc=1 an Illegal Instruction is raised.
246 Similar to `FRTp`, this instruction produces an implicit result,
247 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
248 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
249 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
253 Special Registers Altered:
260 ## Floating-Point Twin Multiply-Add DCT
265 |0 |6 |11 |16 |21 |31 |
266 | PO | FRT | FRA | FRB | XO |Rc |
269 * fdmadd FRT,FRA,FRB (Rc=0)
274 FRS <- FPADD64(FRT, FRB)
275 sub <- FPSUB64(FRT, FRB)
276 FRT <- FPMUL64(FRA, sub)
279 The two IEEE754-FP64 operations
282 FRS <- [(FRT) + (FRB)]
283 FRT <- [(FRT) - (FRB)] * (FRA)
286 are simultaneously performed.
288 The Floating-Point operand in register FRT is added to the floating-point
289 operand in register FRB and the result stored in FRS.
291 Using the exact same operand input register values from FRT and FRB
292 that were used to create FRS, the Floating-Point operand in register
293 FRB is subtracted from the floating-point operand in register FRT and
294 the result then rounded before being multiplied by FRA to create an
295 intermediate result that is stored in FRT.
297 The add into FRS is treated exactly as `fadd`. The creation of the
298 result FRT is **not** the same as that of `fmsub`, but is instead as if
299 `fsub` were performed first followed by `fmuls. The creation of FRS
300 and FRT are treated as parallel independent operations which occur at
303 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
305 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
306 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
307 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
308 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
310 Special Registers Altered:
318 ## Floating-Point Twin Multiply-Add FFT
323 |0 |6 |11 |16 |21 |31 |
324 | PO | FRT | FRA | FRB | XO |Rc |
327 * ffmadd FRT,FRA,FRB (Rc=0)
332 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
333 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
339 FRS <- -([(FRT) * (FRA)] - (FRB))
340 FRT <- [(FRT) * (FRA)] + (FRB)
345 The floating-point operand in register FRT is multiplied by the
346 floating-point operand in register FRA. The float- ing-point operand in
347 register FRB is added to this intermediate result, and the intermediate
350 Using the exact same values of FRT, FRT and FRB as used to create
351 FRS, the floating-point operand in register FRT is multiplied by the
352 floating-point operand in register FRA. The float- ing-point operand
353 in register FRB is subtracted from this intermediate result, and the
354 intermediate stored in FRT.
356 FRT is created as if a `fmadd` operation had been performed. FRS is
357 created as if a `fnmsub` operation had simultaneously been performed
358 with the exact same register operands, in parallel, independently,
359 at exactly the same time.
361 FRT is a Read-Modify-Write operation.
363 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
365 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
366 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
367 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
368 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
370 Special Registers Altered:
379 ## [DRAFT] Floating-Point Add FFT/DCT [Single]
383 * ffadds FRT,FRA,FRB (Rc=0)
384 * ffadds. FRT,FRA,FRB (Rc=1)
389 FRT <- FPADD32(FRA, FRB)
390 FRS <- FPSUB32(FRB, FRA)
393 Special Registers Altered:
402 ## [DRAFT] Floating-Point Add FFT/DCT [Double]
406 * ffadd FRT,FRA,FRB (Rc=0)
407 * ffadd. FRT,FRA,FRB (Rc=1)
412 FRT <- FPADD64(FRA, FRB)
413 FRS <- FPSUB64(FRB, FRA)
416 Special Registers Altered:
425 ## [DRAFT] Floating-Point Subtract FFT/DCT [Single]
429 * ffsubs FRT,FRA,FRB (Rc=0)
430 * ffsubs. FRT,FRA,FRB (Rc=1)
435 FRT <- FPSUB32(FRB, FRA)
436 FRS <- FPADD32(FRA, FRB)
439 Special Registers Altered:
448 ## [DRAFT] Floating-Point Subtract FFT/DCT [Double]
452 * ffsub FRT,FRA,FRB (Rc=0)
453 * ffsub. FRT,FRA,FRB (Rc=1)
458 FRT <- FPSUB64(FRB, FRA)
459 FRS <- FPADD64(FRA, FRB)
462 Special Registers Altered: