4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
16 The number of general-purpose uses for DCT is huge. The number of
17 instructions needed instead of these Twin-Butterfly instructions is also
18 huge (**eight**) and given that it is extremely common to explicitly
19 loop-unroll them quantity hundreds to thousands of instructions are
20 dismayingly common (for all ISAs).
22 The goal is to implement instructions that calculate the expression:
25 fdct_round_shift((a +/- b) * c)
28 For the single-coefficient butterfly instruction, and:
31 fdct_round_shift(a * c1 +/- b * c2)
34 For the double-coefficient butterfly instruction.
36 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
39 #define ROUND_POWER_OF_TWO(value, n) \
40 (((value) + (1 << ((n)-1))) >> (n))
43 These instructions are at the core of **ALL** FDCT calculations in many
44 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
45 Arm includes special instructions to optimize these operations, although
46 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
48 The suggestion is to have a single instruction to calculate both values
49 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
50 run in accumulate mode, so in order to calculate the 2-coeff version
51 one would just have to call the same instruction with different order a,
52 b and a different constant c.
54 Example taken from libvpx
55 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
59 #define ROUND_POWER_OF_TWO(value, n) \
60 (((value) + (1 << ((n)-1))) >> (n))
61 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
62 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
63 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
67 8 instructions are required - replaced by just the one (maddsubrs):
84 ## Integer Butterfly Multiply Add/Sub FFT/DCT
86 **Add the following to Book I Section 3.3.9.1**
91 |0 |6 |11 |16 |21 |26 |31 |
92 | PO | RT | RA | RB | SH | XO |Rc |
96 * maddsubrs RT,RA,SH,RB
104 prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1] + 1
105 prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1] + 1
106 res1 <- ROTL64(prod1, XLEN-n)
107 res2 <- ROTL64(prod2, XLEN-n)
108 m <- MASK(n, (XLEN-1))
111 smask1 <- ([signbit1]*XLEN) & ¬m
112 smask2 <- ([signbit2]*XLEN) & ¬m
113 s64_1 <- [0]*(XLEN-1) || signbit1
114 s64_2 <- [0]*(XLEN-1) || signbit2
115 RT <- (res1 & m | smask1) + s64_1
116 RS <- (res2 & m | smask2) + s64_2
119 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
121 Similar to `RTp`, this instruction produces an implicit result, `RS`,
122 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
123 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
124 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
126 Special Registers Altered:
136 # Twin Butterfly Floating-Point DCT Instruction(s)
138 **Add the following to Book I Section 4.6.6.3**
140 ## Floating-Point Twin Multiply-Add DCT [Single]
145 |0 |6 |11 |16 |21 |31 |
146 | PO | FRT | FRA | FRB | XO |Rc |
149 * fdmadds FRT,FRA,FRB (Rc=0)
154 FRS <- FPADD32(FRT, FRB)
155 sub <- FPSUB32(FRT, FRB)
156 FRT <- FPMUL32(FRA, sub)
159 The two IEEE754-FP32 operations
162 FRS <- [(FRT) + (FRB)]
163 FRT <- [(FRT) - (FRB)] * (FRA)
166 are simultaneously performed.
168 The Floating-Point operand in register FRT is added to the floating-point
169 operand in register FRB and the result stored in FRS.
171 Using the exact same operand input register values from FRT and FRB
172 that were used to create FRS, the Floating-Point operand in register
173 FRB is subtracted from the floating-point operand in register FRT and
174 the result then multiplied by FRA to create an intermediate result that
177 The add into FRS is treated exactly as `fadds`. The creation of the
178 result FRT is **not** the same as that of `fmsubs`.
179 The creation of FRS and FRT are treated as parallel independent operations
180 which occur at the same time.
182 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
184 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
185 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
186 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
187 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
189 Special Registers Altered:
197 ## Floating-Point Multiply-Add FFT [Single]
202 |0 |6 |11 |16 |21 |31 |
203 | PO | FRT | FRA | FRB | XO |Rc |
206 * ffmadds FRT,FRA,FRB (Rc=0)
211 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
212 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
218 FRS <- -([(FRT) * (FRA)] - (FRB))
219 FRT <- [(FRT) * (FRA)] + (FRB)
224 The floating-point operand in register FRT is multiplied by the
225 floating-point operand in register FRA. The floating-point operand in
226 register FRB is added to this intermediate result, and the intermediate
229 Using the exact same values of FRT, FRT and FRB as used to create
230 FRS, the floating-point operand in register FRT is multiplied by the
231 floating-point operand in register FRA. The float- ing-point operand
232 in register FRB is subtracted from this intermediate result, and the
233 intermediate stored in FRT.
235 FRT is created as if a `fmadds` operation had been performed. FRS is
236 created as if a `fnmsubs` operation had simultaneously been performed
237 with the exact same register operands, in parallel, independently,
238 at exactly the same time.
240 FRT is a Read-Modify-Write operation.
242 Note that if Rc=1 an Illegal Instruction is raised.
245 Similar to `FRTp`, this instruction produces an implicit result,
246 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
247 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
248 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
252 Special Registers Altered:
259 ## Floating-Point Twin Multiply-Add DCT
264 |0 |6 |11 |16 |21 |31 |
265 | PO | FRT | FRA | FRB | XO |Rc |
268 * fdmadd FRT,FRA,FRB (Rc=0)
273 FRS <- FPADD64(FRT, FRB)
274 sub <- FPSUB64(FRT, FRB)
275 FRT <- FPMUL64(FRA, sub)
278 The two IEEE754-FP64 operations
281 FRS <- [(FRT) + (FRB)]
282 FRT <- [(FRT) - (FRB)] * (FRA)
285 are simultaneously performed.
287 The Floating-Point operand in register FRT is added to the floating-point
288 operand in register FRB and the result stored in FRS.
290 Using the exact same operand input register values from FRT and FRB
291 that were used to create FRS, the Floating-Point operand in register
292 FRB is subtracted from the floating-point operand in register FRT and
293 the result then multiplied by FRA to create an intermediate result that
296 The add into FRS is treated exactly as `fadd`. The creation of the
297 result FRT is **not** the same as that of `fmsub`.
298 The creation of FRS and FRT are treated as parallel independent operations
299 which occur at the same time.
301 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
303 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
304 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
305 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
306 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
308 Special Registers Altered:
316 ## Floating-Point Twin Multiply-Add FFT
321 |0 |6 |11 |16 |21 |31 |
322 | PO | FRT | FRA | FRB | XO |Rc |
325 * ffmadd FRT,FRA,FRB (Rc=0)
330 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
331 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
337 FRS <- -([(FRT) * (FRA)] - (FRB))
338 FRT <- [(FRT) * (FRA)] + (FRB)
343 The floating-point operand in register FRT is multiplied by the
344 floating-point operand in register FRA. The float- ing-point operand in
345 register FRB is added to this intermediate result, and the intermediate
348 Using the exact same values of FRT, FRT and FRB as used to create
349 FRS, the floating-point operand in register FRT is multiplied by the
350 floating-point operand in register FRA. The float- ing-point operand
351 in register FRB is subtracted from this intermediate result, and the
352 intermediate stored in FRT.
354 FRT is created as if a `fmadd` operation had been performed. FRS is
355 created as if a `fnmsub` operation had simultaneously been performed
356 with the exact same register operands, in parallel, independently,
357 at exactly the same time.
359 FRT is a Read-Modify-Write operation.
361 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
363 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
364 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
365 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
366 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
368 Special Registers Altered:
377 ## [DRAFT] Floating-Point Add FFT/DCT [Single]
381 * ffadds FRT,FRA,FRB (Rc=0)
382 * ffadds. FRT,FRA,FRB (Rc=1)
387 FRT <- FPADD32(FRA, FRB)
388 FRS <- FPSUB32(FRB, FRA)
391 Special Registers Altered:
400 ## [DRAFT] Floating-Point Add FFT/DCT [Double]
404 * ffadd FRT,FRA,FRB (Rc=0)
405 * ffadd. FRT,FRA,FRB (Rc=1)
410 FRT <- FPADD64(FRA, FRB)
411 FRS <- FPSUB64(FRB, FRA)
414 Special Registers Altered:
423 ## [DRAFT] Floating-Point Subtract FFT/DCT [Single]
427 * ffsubs FRT,FRA,FRB (Rc=0)
428 * ffsubs. FRT,FRA,FRB (Rc=1)
433 FRT <- FPSUB32(FRB, FRA)
434 FRS <- FPADD32(FRA, FRB)
437 Special Registers Altered:
446 ## [DRAFT] Floating-Point Subtract FFT/DCT [Double]
450 * ffsub FRT,FRA,FRB (Rc=0)
451 * ffsub. FRT,FRA,FRB (Rc=1)
456 FRT <- FPSUB64(FRB, FRA)
457 FRS <- FPADD64(FRA, FRB)
460 Special Registers Altered: