4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
16 The number of general-purpose uses for DCT is huge. The number of
17 instructions needed instead of these Twin-Butterfly instructions is also
18 huge (**eight**) and given that it is extremely common to explicitly
19 loop-unroll them quantity hundreds to thousands of instructions are
20 dismayingly common (for all ISAs).
22 The goal is to implement instructions that calculate the expression:
25 fdct_round_shift((a +/- b) * c)
28 For the single-coefficient butterfly instruction, and:
31 fdct_round_shift(a * c1 +/- b * c2)
34 For the double-coefficient butterfly instruction.
36 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
39 #define ROUND_POWER_OF_TWO(value, n) \
40 (((value) + (1 << ((n)-1))) >> (n))
43 These instructions are at the core of **ALL** FDCT calculations in many
44 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
45 Arm includes special instructions to optimize these operations, although
46 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
48 The suggestion is to have a single instruction to calculate both values
49 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
50 run in accumulate mode, so in order to calculate the 2-coeff version
51 one would just have to call the same instruction with different order a,
52 b and a different constant c.
54 Example taken from libvpx
55 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
59 #define ROUND_POWER_OF_TWO(value, n) \
60 (((value) + (1 << ((n)-1))) >> (n))
61 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
62 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
63 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
67 8 instructions are required - replaced by just the one (maddsubrs):
84 ## Integer Butterfly Multiply Add/Sub FFT/DCT
86 **Add the following to Book I Section 3.3.9.1**
91 |0 |6 |11 |16 |21 |26 |31 |
92 | PO | RT | RA | RB | SH | XO |Rc |
96 * maddsubrs RT,RA,SH,RB
104 prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1] + 1
105 prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1] + 1
106 res1 <- ROTL64(prod1, XLEN-n)
107 res2 <- ROTL64(prod2, XLEN-n)
108 m <- MASK(n, (XLEN-1))
111 smask1 <- ([signbit1]*XLEN) & ¬m
112 smask2 <- ([signbit2]*XLEN) & ¬m
113 s64_1 <- [0]*(XLEN-1) || signbit1
114 s64_2 <- [0]*(XLEN-1) || signbit2
115 RT <- (res1 & m | smask1) + s64_1
116 RS <- (res2 & m | smask2) + s64_2
119 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
121 Similar to `RTp`, this instruction produces an implicit result, `RS`,
122 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
123 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
124 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
126 Special Registers Altered:
136 # Twin Butterfly Floating-Point DCT Instruction(s)
138 ## Floating-Point Twin Multiply-Add DCT [Single]
140 **Add the following to Book I Section 4.6.6.3**
145 |0 |6 |11 |16 |21 |31 |
146 | PO | FRT | FRA | FRB | XO |Rc |
149 * fdmadds FRT,FRA,FRB (Rc=0)
154 FRS <- FPADD32(FRT, FRB)
155 sub <- FPSUB32(FRT, FRB)
156 FRT <- FPMUL32(FRA, sub)
159 The two IEEE754-FP32 operations
162 FRS <- [(FRT) + (FRB)]
163 FRT <- [(FRT) - (FRB)] * (FRA)
166 are simultaneously performed.
168 The Floating-Point operand in register FRT is added to the floating-point
169 operand in register FRB and the result stored in FRS.
171 Using the exact same operand input register values from FRT and FRB
172 that were used to create FRS, the Floating-Point operand in register
173 FRB is subtracted from the floating-point operand in register FRT and
174 the result then multiplied by FRA to create an intermediate result that
177 The add into FRS is treated exactly as `fadds`. The creation of the
178 result FRT is **not** the same as that of `fmsubs`.
179 The creation of FRS and FRT are treated as parallel independent operations
180 which occur at the same time.
182 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
184 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
185 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
186 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
187 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
189 Special Registers Altered:
197 ## Floating-Point Multiply-Add FFT [Single]
199 **Add the following to Book I Section 4.6.6.3**
204 |0 |6 |11 |16 |21 |31 |
205 | PO | FRT | FRA | FRB | XO |Rc |
208 * ffmadds FRT,FRA,FRB (Rc=0)
213 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
214 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
220 FRS <- -([(FRT) * (FRA)] - (FRB))
221 FRT <- [(FRT) * (FRA)] + (FRB)
226 The floating-point operand in register FRT is multiplied by the
227 floating-point operand in register FRA. The floating-point operand in
228 register FRB is added to this intermediate result, and the intermediate
231 Using the exact same values of FRT, FRT and FRB as used to create
232 FRS, the floating-point operand in register FRT is multiplied by the
233 floating-point operand in register FRA. The float- ing-point operand
234 in register FRB is subtracted from this intermediate result, and the
235 intermediate stored in FRT.
237 FRT is created as if a `fmadds` operation had been performed. FRS is
238 created as if a `fnmsubs` operation had simultaneously been performed
239 with the exact same register operands, in parallel, independently,
240 at exactly the same time.
242 FRT is a Read-Modify-Write operation.
244 Note that if Rc=1 an Illegal Instruction is raised.
247 Similar to `FRTp`, this instruction produces an implicit result,
248 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
249 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
250 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
254 Special Registers Altered:
261 ## Floating-Point Twin Multiply-Add DCT
263 **Add the following to Book I Section 4.6.6.3**
268 |0 |6 |11 |16 |21 |31 |
269 | PO | FRT | FRA | FRB | XO |Rc |
272 * fdmadd FRT,FRA,FRB (Rc=0)
277 FRS <- FPADD64(FRT, FRB)
278 sub <- FPSUB64(FRT, FRB)
279 FRT <- FPMUL64(FRA, sub)
282 The two IEEE754-FP64 operations
285 FRS <- [(FRT) + (FRB)]
286 FRT <- [(FRT) - (FRB)] * (FRA)
289 are simultaneously performed.
291 The Floating-Point operand in register FRT is added to the floating-point
292 operand in register FRB and the result stored in FRS.
294 Using the exact same operand input register values from FRT and FRB
295 that were used to create FRS, the Floating-Point operand in register
296 FRB is subtracted from the floating-point operand in register FRT and
297 the result then multiplied by FRA to create an intermediate result that
300 The add into FRS is treated exactly as `fadd`. The creation of the
301 result FRT is **not** the same as that of `fmsub`.
302 The creation of FRS and FRT are treated as parallel independent operations
303 which occur at the same time.
305 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
307 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
308 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
309 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
310 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
312 Special Registers Altered:
320 ## Floating-Point Twin Multiply-Add FFT
322 **Add the following to Book I Section 4.6.6.3**
327 |0 |6 |11 |16 |21 |31 |
328 | PO | FRT | FRA | FRB | XO |Rc |
331 * ffmadd FRT,FRA,FRB (Rc=0)
336 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
337 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
343 FRS <- -([(FRT) * (FRA)] - (FRB))
344 FRT <- [(FRT) * (FRA)] + (FRB)
349 The floating-point operand in register FRT is multiplied by the
350 floating-point operand in register FRA. The float- ing-point operand in
351 register FRB is added to this intermediate result, and the intermediate
354 Using the exact same values of FRT, FRT and FRB as used to create
355 FRS, the floating-point operand in register FRT is multiplied by the
356 floating-point operand in register FRA. The float- ing-point operand
357 in register FRB is subtracted from this intermediate result, and the
358 intermediate stored in FRT.
360 FRT is created as if a `fmadd` operation had been performed. FRS is
361 created as if a `fnmsub` operation had simultaneously been performed
362 with the exact same register operands, in parallel, independently,
363 at exactly the same time.
365 FRT is a Read-Modify-Write operation.
367 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
369 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
370 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
371 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
372 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
374 Special Registers Altered:
383 ## [DRAFT] Floating-Point Add FFT/DCT [Single]
387 * ffadds FRT,FRA,FRB (Rc=0)
388 * ffadds. FRT,FRA,FRB (Rc=1)
393 FRT <- FPADD32(FRA, FRB)
394 FRS <- FPSUB32(FRB, FRA)
397 Special Registers Altered:
406 ## [DRAFT] Floating-Point Add FFT/DCT [Double]
410 * ffadd FRT,FRA,FRB (Rc=0)
411 * ffadd. FRT,FRA,FRB (Rc=1)
416 FRT <- FPADD64(FRA, FRB)
417 FRS <- FPSUB64(FRB, FRA)
420 Special Registers Altered:
429 ## [DRAFT] Floating-Point Subtract FFT/DCT [Single]
433 * ffsubs FRT,FRA,FRB (Rc=0)
434 * ffsubs. FRT,FRA,FRB (Rc=1)
439 FRT <- FPSUB32(FRB, FRA)
440 FRS <- FPADD32(FRA, FRB)
443 Special Registers Altered:
452 ## [DRAFT] Floating-Point Subtract FFT/DCT [Double]
456 * ffsub FRT,FRA,FRB (Rc=0)
457 * ffsub. FRT,FRA,FRB (Rc=1)
462 FRT <- FPSUB64(FRB, FRA)
463 FRS <- FPADD64(FRA, FRB)
466 Special Registers Altered: