4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
14 context to save considerably on DCT, DFT and FFT processing.
16 # Rationale for Twin Butterfly Integer DCT Instruction(s)
18 The number of general-purpose uses for DCT is huge. The number of
19 instructions needed instead of these Twin-Butterfly instructions is also
20 huge (**eight**) and given that it is extremely common to explicitly
21 loop-unroll them quantity hundreds to thousands of instructions are
22 dismayingly common (for all ISAs).
24 The goal is to implement instructions that calculate the expression:
27 fdct_round_shift((a +/- b) * c)
30 For the single-coefficient butterfly instruction, and:
33 fdct_round_shift(a * c1 +/- b * c2)
36 For the double-coefficient butterfly instruction.
38 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
41 #define ROUND_POWER_OF_TWO(value, n) \
42 (((value) + (1 << ((n)-1))) >> (n))
45 These instructions are at the core of **ALL** FDCT calculations in many
46 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
47 Arm includes special instructions to optimize these operations, although
48 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
50 The suggestion is to have a single instruction to calculate both values
51 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
52 run in accumulate mode, so in order to calculate the 2-coeff version
53 one would just have to call the same instruction with different order a,
54 b and a different constant c.
56 Example taken from libvpx
57 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
61 #define ROUND_POWER_OF_TWO(value, n) \
62 (((value) + (1 << ((n)-1))) >> (n))
63 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
64 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
65 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
69 8 instructions are required - replaced by just the one (maddsubrs):
86 ## Integer Butterfly Multiply Add/Sub FFT/DCT
88 **Add the following to Book I Section 3.3.9.1**
93 |0 |6 |11 |16 |21 |26 |31 |
94 | PO | RT | RA | RB | SH | XO |Rc |
97 * maddsubrs RT,RA,SH,RB
105 prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1]
106 prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1]
108 #round <- EXTS([0]*(XLEN-1) || [1]*1)
109 #prod1 <- ROTL64(prod1, 1)
110 #prod2 <- ROTL64(prod2, 1)
111 #prod1 <- prod1 + round
112 #prod2 <- prod2 + round
113 #res1 <- ROTL64(prod1, XLEN-1)
114 #res2 <- ROTL64(prod2, XLEN-1)
115 #m <- MASK(1, (XLEN-1))
119 round <- EXTS([0]*(XLEN -n -1) || [1]*1 || [0]*(n-1))
120 prod1 <- prod1 + round
121 prod2 <- prod2 + round
122 res1 <- ROTL64(prod1, XLEN-n)
123 res2 <- ROTL64(prod2, XLEN-n)
124 m <- MASK(n, (XLEN-1))
127 smask1 <- ([signbit1]*XLEN) & ¬m
128 smask2 <- ([signbit2]*XLEN) & ¬m
129 RT <- (res1 & m | smask1)
130 RS <- (res2 & m | smask2)
133 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
135 Similar to `RTp`, this instruction produces an implicit result, `RS`,
136 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
137 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
138 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
140 Special Registers Altered:
150 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
152 **Add the following to Book I Section 4.6.6.3**
154 ## Floating-Point Twin Multiply-Add DCT [Single]
159 |0 |6 |11 |16 |21 |31 |
160 | PO | FRT | FRA | FRB | XO |Rc |
163 * fdmadds FRT,FRA,FRB (Rc=0)
168 FRS <- FPADD32(FRT, FRB)
169 sub <- FPSUB32(FRT, FRB)
170 FRT <- FPMUL32(FRA, sub)
173 The two IEEE754-FP32 operations
176 FRS <- [(FRT) + (FRB)]
177 FRT <- [(FRT) - (FRB)] * (FRA)
180 are simultaneously performed.
182 The Floating-Point operand in register FRT is added to the floating-point
183 operand in register FRB and the result stored in FRS.
185 Using the exact same operand input register values from FRT and FRB
186 that were used to create FRS, the Floating-Point operand in register
187 FRB is subtracted from the floating-point operand in register FRT and
188 the result then rounded before being multiplied by FRA to create an
189 intermediate result that is stored in FRT.
191 The add into FRS is treated exactly as `fadds`. The creation of the
192 result FRT is **not** the same as that of `fmsubs`, but is instead as if
193 `fsubs` were performed first followed by `fmuls`. The creation of FRS
194 and FRT are treated as parallel independent operations which occur at
197 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
199 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
200 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
201 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
202 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
204 Special Registers Altered:
212 ## Floating-Point Multiply-Add FFT [Single]
217 |0 |6 |11 |16 |21 |31 |
218 | PO | FRT | FRA | FRB | XO |Rc |
221 * ffmadds FRT,FRA,FRB (Rc=0)
226 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
227 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
233 FRS <- -([(FRT) * (FRA)] - (FRB))
234 FRT <- [(FRT) * (FRA)] + (FRB)
239 The floating-point operand in register FRT is multiplied by the
240 floating-point operand in register FRA. The floating-point operand in
241 register FRB is added to this intermediate result, and the intermediate
244 Using the exact same values of FRT, FRT and FRB as used to create
245 FRS, the floating-point operand in register FRT is multiplied by the
246 floating-point operand in register FRA. The floating-point operand
247 in register FRB is subtracted from this intermediate result, and the
248 intermediate stored in FRT.
250 FRT is created as if a `fmadds` operation had been performed. FRS is
251 created as if a `fnmsubs` operation had simultaneously been performed
252 with the exact same register operands, in parallel, independently,
253 at exactly the same time.
255 FRT is a Read-Modify-Write operation.
257 Note that if Rc=1 an Illegal Instruction is raised.
260 Similar to `FRTp`, this instruction produces an implicit result,
261 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
262 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
263 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
266 Special Registers Altered:
274 ## Floating-Point Twin Multiply-Add DCT
279 |0 |6 |11 |16 |21 |31 |
280 | PO | FRT | FRA | FRB | XO |Rc |
283 * fdmadd FRT,FRA,FRB (Rc=0)
288 FRS <- FPADD64(FRT, FRB)
289 sub <- FPSUB64(FRT, FRB)
290 FRT <- FPMUL64(FRA, sub)
293 The two IEEE754-FP64 operations
296 FRS <- [(FRT) + (FRB)]
297 FRT <- [(FRT) - (FRB)] * (FRA)
300 are simultaneously performed.
302 The Floating-Point operand in register FRT is added to the floating-point
303 operand in register FRB and the result stored in FRS.
305 Using the exact same operand input register values from FRT and FRB
306 that were used to create FRS, the Floating-Point operand in register
307 FRB is subtracted from the floating-point operand in register FRT and
308 the result then rounded before being multiplied by FRA to create an
309 intermediate result that is stored in FRT.
311 The add into FRS is treated exactly as `fadd`. The creation of the
312 result FRT is **not** the same as that of `fmsub`, but is instead as if
313 `fsub` were performed first followed by `fmuls. The creation of FRS
314 and FRT are treated as parallel independent operations which occur at
317 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
319 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
320 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
321 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
322 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
324 Special Registers Altered:
332 ## Floating-Point Twin Multiply-Add FFT
337 |0 |6 |11 |16 |21 |31 |
338 | PO | FRT | FRA | FRB | XO |Rc |
341 * ffmadd FRT,FRA,FRB (Rc=0)
346 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
347 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
353 FRS <- -([(FRT) * (FRA)] - (FRB))
354 FRT <- [(FRT) * (FRA)] + (FRB)
359 The floating-point operand in register FRT is multiplied by the
360 floating-point operand in register FRA. The float- ing-point operand in
361 register FRB is added to this intermediate result, and the intermediate
364 Using the exact same values of FRT, FRT and FRB as used to create
365 FRS, the floating-point operand in register FRT is multiplied by the
366 floating-point operand in register FRA. The float- ing-point operand
367 in register FRB is subtracted from this intermediate result, and the
368 intermediate stored in FRT.
370 FRT is created as if a `fmadd` operation had been performed. FRS is
371 created as if a `fnmsub` operation had simultaneously been performed
372 with the exact same register operands, in parallel, independently,
373 at exactly the same time.
375 FRT is a Read-Modify-Write operation.
377 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
379 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
380 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
381 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
382 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
384 Special Registers Altered:
393 ## Floating-Point Add FFT/DCT [Single]
398 |0 |6 |11 |16 |21 |26 |31 |
399 | PO | FRT | FRA | FRB | / | XO |Rc |
402 * ffadds FRT,FRA,FRB (Rc=0)
407 FRT <- FPADD32(FRA, FRB)
408 FRS <- FPSUB32(FRB, FRA)
411 Special Registers Altered:
419 ## Floating-Point Add FFT/DCT [Double]
424 |0 |6 |11 |16 |21 |26 |31 |
425 | PO | FRT | FRA | FRB | / | XO |Rc |
428 * ffadd FRT,FRA,FRB (Rc=0)
433 FRT <- FPADD64(FRA, FRB)
434 FRS <- FPSUB64(FRB, FRA)
437 Special Registers Altered:
445 ## Floating-Point Subtract FFT/DCT [Single]
450 |0 |6 |11 |16 |21 |26 |31 |
451 | PO | FRT | FRA | FRB | / | XO |Rc |
454 * ffsubs FRT,FRA,FRB (Rc=0)
459 FRT <- FPSUB32(FRB, FRA)
460 FRS <- FPADD32(FRA, FRB)
463 Special Registers Altered:
471 ## Floating-Point Subtract FFT/DCT [Double]
476 |0 |6 |11 |16 |21 |26 |31 |
477 | PO | FRT | FRA | FRB | / | XO |Rc |
480 * ffsub FRT,FRA,FRB (Rc=0)
485 FRT <- FPSUB64(FRB, FRA)
486 FRS <- FPADD64(FRA, FRB)
489 Special Registers Altered: