4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
14 context to save considerably on DCT, DFT and FFT processing. Whilst some hardware
15 implementations may not necessarily implement them efficiently (slower Micro-coding)
16 savings still come from the reduction in temporary registers as well as instruction
19 # Rationale for Twin Butterfly Integer DCT Instruction(s)
21 The number of general-purpose uses for DCT is huge. The number of
22 instructions needed instead of these Twin-Butterfly instructions is also
23 huge (**eight**) and given that it is extremely common to explicitly
24 loop-unroll them quantity hundreds to thousands of instructions are
25 dismayingly common (for all ISAs).
27 The goal is to implement instructions that calculate the expression:
30 fdct_round_shift((a +/- b) * c)
33 For the single-coefficient butterfly instruction, and:
36 fdct_round_shift(a * c1 +/- b * c2)
39 For the double-coefficient butterfly instruction.
41 In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
44 #define ROUND_POWER_OF_TWO(value, n) \
45 (((value) + (1 << ((n)-1))) >> (n))
48 These instructions are at the core of **ALL** FDCT calculations in many
49 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
50 ARM includes special instructions to optimize these operations, although
51 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
53 The suggestion is to have a single instruction to calculate both values
54 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
55 run in accumulate mode, so in order to calculate the 2-coeff version
56 one would just have to call the same instruction with different order a,
57 b and a different constant c.
59 Example taken from libvpx
60 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
64 #define ROUND_POWER_OF_TWO(value, n) \
65 (((value) + (1 << ((n)-1))) >> (n))
66 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
67 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
68 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
72 8 instructions are required - replaced by just the one (maddsubrs):
89 ## Integer Butterfly Multiply Add/Sub FFT/DCT
91 **Add the following to Book I Section 3.3.9.1**
96 |0 |6 |11 |16 |21 |26 |31 |
97 | PO | RT | RA | RB | SH | XO |Rc |
100 * maddsubrs RT,RA,SH,RB
108 prod1 <- MULS(RB, sum)
109 prod1_lo <- prod1[XLEN:(XLEN*2)-1]
110 prod2 <- MULS(RB, diff)
111 prod2_lo <- prod2[XLEN:(XLEN*2)-1]
118 prod1_lo <- prod1_lo + round
119 prod2_lo <- prod2_lo + round
120 m <- MASK(n, (XLEN-1))
121 res1 <- ROTL64(prod1_lo, XLEN-n) & m
122 res2 <- ROTL64(prod2_lo, XLEN-n) & m
123 signbit1 <- prod1_lo[0]
124 signbit2 <- prod2_lo[0]
125 smask1 <- ([signbit1]*XLEN) & ¬m
126 smask2 <- ([signbit2]*XLEN) & ¬m
127 RT <- (res1 | smask1)
128 RS <- (res2 | smask2)
131 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
133 Similar to `RTp`, this instruction produces an implicit result, `RS`,
134 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
135 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
136 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
138 Special Registers Altered:
148 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
150 **Add the following to Book I Section 4.6.6.3**
152 ## Floating-Point Twin Multiply-Add DCT [Single]
157 |0 |6 |11 |16 |21 |31 |
158 | PO | FRT | FRA | FRB | XO |Rc |
161 * fdmadds FRT,FRA,FRB (Rc=0)
166 FRS <- FPADD32(FRT, FRB)
167 sub <- FPSUB32(FRT, FRB)
168 FRT <- FPMUL32(FRA, sub)
171 The two IEEE754-FP32 operations
174 FRS <- [(FRT) + (FRB)]
175 FRT <- [(FRT) - (FRB)] * (FRA)
178 are simultaneously performed.
180 The Floating-Point operand in register FRT is added to the floating-point
181 operand in register FRB and the result stored in FRS.
183 Using the exact same operand input register values from FRT and FRB
184 that were used to create FRS, the Floating-Point operand in register
185 FRB is subtracted from the floating-point operand in register FRT and
186 the result then rounded before being multiplied by FRA to create an
187 intermediate result that is stored in FRT.
189 The add into FRS is treated exactly as `fadds`. The creation of the
190 result FRT is **not** the same as that of `fmsubs`, but is instead as if
191 `fsubs` were performed first followed by `fmuls`. The creation of FRS
192 and FRT are treated as parallel independent operations which occur at
195 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
197 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
198 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
199 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
200 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
202 Special Registers Altered:
210 ## Floating-Point Multiply-Add FFT [Single]
215 |0 |6 |11 |16 |21 |31 |
216 | PO | FRT | FRA | FRB | XO |Rc |
219 * ffmadds FRT,FRA,FRB (Rc=0)
224 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
225 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
231 FRS <- -([(FRT) * (FRA)] - (FRB))
232 FRT <- [(FRT) * (FRA)] + (FRB)
237 The floating-point operand in register FRT is multiplied by the
238 floating-point operand in register FRA. The floating-point operand in
239 register FRB is added to this intermediate result, and the intermediate
242 Using the exact same values of FRT, FRT and FRB as used to create
243 FRS, the floating-point operand in register FRT is multiplied by the
244 floating-point operand in register FRA. The floating-point operand
245 in register FRB is subtracted from this intermediate result, and the
246 intermediate stored in FRT.
248 FRT is created as if a `fmadds` operation had been performed. FRS is
249 created as if a `fnmsubs` operation had simultaneously been performed
250 with the exact same register operands, in parallel, independently,
251 at exactly the same time.
253 FRT is a Read-Modify-Write operation.
255 Note that if Rc=1 an Illegal Instruction is raised.
258 Similar to `FRTp`, this instruction produces an implicit result,
259 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
260 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
261 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
264 Special Registers Altered:
272 ## Floating-Point Twin Multiply-Add DCT
277 |0 |6 |11 |16 |21 |31 |
278 | PO | FRT | FRA | FRB | XO |Rc |
281 * fdmadd FRT,FRA,FRB (Rc=0)
286 FRS <- FPADD64(FRT, FRB)
287 sub <- FPSUB64(FRT, FRB)
288 FRT <- FPMUL64(FRA, sub)
291 The two IEEE754-FP64 operations
294 FRS <- [(FRT) + (FRB)]
295 FRT <- [(FRT) - (FRB)] * (FRA)
298 are simultaneously performed.
300 The Floating-Point operand in register FRT is added to the floating-point
301 operand in register FRB and the result stored in FRS.
303 Using the exact same operand input register values from FRT and FRB
304 that were used to create FRS, the Floating-Point operand in register
305 FRB is subtracted from the floating-point operand in register FRT and
306 the result then rounded before being multiplied by FRA to create an
307 intermediate result that is stored in FRT.
309 The add into FRS is treated exactly as `fadd`. The creation of the
310 result FRT is **not** the same as that of `fmsub`, but is instead as if
311 `fsub` were performed first followed by `fmuls. The creation of FRS
312 and FRT are treated as parallel independent operations which occur at
315 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
317 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
318 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
319 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
320 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
322 Special Registers Altered:
330 ## Floating-Point Twin Multiply-Add FFT
335 |0 |6 |11 |16 |21 |31 |
336 | PO | FRT | FRA | FRB | XO |Rc |
339 * ffmadd FRT,FRA,FRB (Rc=0)
344 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
345 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
351 FRS <- -([(FRT) * (FRA)] - (FRB))
352 FRT <- [(FRT) * (FRA)] + (FRB)
357 The floating-point operand in register FRT is multiplied by the
358 floating-point operand in register FRA. The float- ing-point operand in
359 register FRB is added to this intermediate result, and the intermediate
362 Using the exact same values of FRT, FRT and FRB as used to create
363 FRS, the floating-point operand in register FRT is multiplied by the
364 floating-point operand in register FRA. The float- ing-point operand
365 in register FRB is subtracted from this intermediate result, and the
366 intermediate stored in FRT.
368 FRT is created as if a `fmadd` operation had been performed. FRS is
369 created as if a `fnmsub` operation had simultaneously been performed
370 with the exact same register operands, in parallel, independently,
371 at exactly the same time.
373 FRT is a Read-Modify-Write operation.
375 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
377 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
378 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
379 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
380 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
382 Special Registers Altered:
391 ## Floating-Point Add FFT/DCT [Single]
396 |0 |6 |11 |16 |21 |26 |31 |
397 | PO | FRT | FRA | FRB | / | XO |Rc |
400 * ffadds FRT,FRA,FRB (Rc=0)
405 FRT <- FPADD32(FRA, FRB)
406 FRS <- FPSUB32(FRB, FRA)
409 Special Registers Altered:
417 ## Floating-Point Add FFT/DCT [Double]
422 |0 |6 |11 |16 |21 |26 |31 |
423 | PO | FRT | FRA | FRB | / | XO |Rc |
426 * ffadd FRT,FRA,FRB (Rc=0)
431 FRT <- FPADD64(FRA, FRB)
432 FRS <- FPSUB64(FRB, FRA)
435 Special Registers Altered:
443 ## Floating-Point Subtract FFT/DCT [Single]
448 |0 |6 |11 |16 |21 |26 |31 |
449 | PO | FRT | FRA | FRB | / | XO |Rc |
452 * ffsubs FRT,FRA,FRB (Rc=0)
457 FRT <- FPSUB32(FRB, FRA)
458 FRS <- FPADD32(FRA, FRB)
461 Special Registers Altered:
469 ## Floating-Point Subtract FFT/DCT [Double]
474 |0 |6 |11 |16 |21 |26 |31 |
475 | PO | FRT | FRA | FRB | / | XO |Rc |
478 * ffsub FRT,FRA,FRB (Rc=0)
483 FRT <- FPSUB64(FRB, FRA)
484 FRS <- FPADD64(FRA, FRB)
487 Special Registers Altered: