4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
16 The number of general-purpose uses for DCT is huge. The number of
17 instructions needed instead of these Twin-Butterfly instructions is also
18 huge (**eight**) and given that it is extremely common to explicitly
19 loop-unroll them quantity hundreds to thousands of instructions are
20 dismayingly common (for all ISAs).
22 The goal is to implement instructions that calculate the expression:
25 fdct_round_shift((a +/- b) * c)
28 For the single-coefficient butterfly instruction, and:
31 fdct_round_shift(a * c1 +/- b * c2)
34 For the double-coefficient butterfly instruction.
36 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
39 #define ROUND_POWER_OF_TWO(value, n) \
40 (((value) + (1 << ((n)-1))) >> (n))
43 These instructions are at the core of **ALL** FDCT calculations in many
44 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
45 Arm includes special instructions to optimize these operations, although
46 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
48 The suggestion is to have a single instruction to calculate both values
49 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
50 run in accumulate mode, so in order to calculate the 2-coeff version
51 one would just have to call the same instruction with different order a,
52 b and a different constant c.
54 Example taken from libvpx
55 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
59 #define ROUND_POWER_OF_TWO(value, n) \
60 (((value) + (1 << ((n)-1))) >> (n))
61 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
62 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
63 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
67 8 instructions are required - replaced by just the one (maddsubrs):
84 ## Integer Butterfly Multiply Add/Sub FFT/DCT
86 **Add the following to Book I Section 3.3.9.1**
91 |0 |6 |11 |16 |21 |26 |31 |
92 | PO | RT | RA | RB | SH | XO |Rc |
96 * maddsubrs RT,RA,SH,RB
104 prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1]
105 prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1]
107 #round <- EXTS([0]*(XLEN-1) || [1]*1)
108 #prod1 <- ROTL64(prod1, 1)
109 #prod2 <- ROTL64(prod2, 1)
110 #prod1 <- prod1 + round
111 #prod2 <- prod2 + round
112 #res1 <- ROTL64(prod1, XLEN-1)
113 #res2 <- ROTL64(prod2, XLEN-1)
114 #m <- MASK(1, (XLEN-1))
118 round <- EXTS([0]*(XLEN -n -1) || [1]*1 || [0]*(n-1))
119 prod1 <- prod1 + round
120 prod2 <- prod2 + round
121 res1 <- ROTL64(prod1, XLEN-n)
122 res2 <- ROTL64(prod2, XLEN-n)
123 m <- MASK(n, (XLEN-1))
126 smask1 <- ([signbit1]*XLEN) & ¬m
127 smask2 <- ([signbit2]*XLEN) & ¬m
128 RT <- (res1 & m | smask1)
129 RS <- (res2 & m | smask2)
132 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
134 Similar to `RTp`, this instruction produces an implicit result, `RS`,
135 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
136 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
137 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
139 Special Registers Altered:
149 # Twin Butterfly Floating-Point DCT Instruction(s)
151 **Add the following to Book I Section 4.6.6.3**
153 ## Floating-Point Twin Multiply-Add DCT [Single]
158 |0 |6 |11 |16 |21 |31 |
159 | PO | FRT | FRA | FRB | XO |Rc |
162 * fdmadds FRT,FRA,FRB (Rc=0)
167 FRS <- FPADD32(FRT, FRB)
168 sub <- FPSUB32(FRT, FRB)
169 FRT <- FPMUL32(FRA, sub)
172 The two IEEE754-FP32 operations
175 FRS <- [(FRT) + (FRB)]
176 FRT <- [(FRT) - (FRB)] * (FRA)
179 are simultaneously performed.
181 The Floating-Point operand in register FRT is added to the floating-point
182 operand in register FRB and the result stored in FRS.
184 Using the exact same operand input register values from FRT and FRB
185 that were used to create FRS, the Floating-Point operand in register
186 FRB is subtracted from the floating-point operand in register FRT and
187 the result then rounded before being multiplied by FRA to create an
188 intermediate result that is stored in FRT.
190 The add into FRS is treated exactly as `fadds`. The creation of the
191 result FRT is **not** the same as that of `fmsubs`, but is instead as if
192 `fsubs` were performed first followed by `fmuls`. The creation of FRS
193 and FRT are treated as parallel independent operations which occur at
196 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
198 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
199 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
200 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
201 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
203 Special Registers Altered:
211 ## Floating-Point Multiply-Add FFT [Single]
216 |0 |6 |11 |16 |21 |31 |
217 | PO | FRT | FRA | FRB | XO |Rc |
220 * ffmadds FRT,FRA,FRB (Rc=0)
225 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
226 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
232 FRS <- -([(FRT) * (FRA)] - (FRB))
233 FRT <- [(FRT) * (FRA)] + (FRB)
238 The floating-point operand in register FRT is multiplied by the
239 floating-point operand in register FRA. The floating-point operand in
240 register FRB is added to this intermediate result, and the intermediate
243 Using the exact same values of FRT, FRT and FRB as used to create
244 FRS, the floating-point operand in register FRT is multiplied by the
245 floating-point operand in register FRA. The float- ing-point operand
246 in register FRB is subtracted from this intermediate result, and the
247 intermediate stored in FRT.
249 FRT is created as if a `fmadds` operation had been performed. FRS is
250 created as if a `fnmsubs` operation had simultaneously been performed
251 with the exact same register operands, in parallel, independently,
252 at exactly the same time.
254 FRT is a Read-Modify-Write operation.
256 Note that if Rc=1 an Illegal Instruction is raised.
259 Similar to `FRTp`, this instruction produces an implicit result,
260 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
261 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
262 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
266 Special Registers Altered:
273 ## Floating-Point Twin Multiply-Add DCT
278 |0 |6 |11 |16 |21 |31 |
279 | PO | FRT | FRA | FRB | XO |Rc |
282 * fdmadd FRT,FRA,FRB (Rc=0)
287 FRS <- FPADD64(FRT, FRB)
288 sub <- FPSUB64(FRT, FRB)
289 FRT <- FPMUL64(FRA, sub)
292 The two IEEE754-FP64 operations
295 FRS <- [(FRT) + (FRB)]
296 FRT <- [(FRT) - (FRB)] * (FRA)
299 are simultaneously performed.
301 The Floating-Point operand in register FRT is added to the floating-point
302 operand in register FRB and the result stored in FRS.
304 Using the exact same operand input register values from FRT and FRB
305 that were used to create FRS, the Floating-Point operand in register
306 FRB is subtracted from the floating-point operand in register FRT and
307 the result then rounded before being multiplied by FRA to create an
308 intermediate result that is stored in FRT.
310 The add into FRS is treated exactly as `fadd`. The creation of the
311 result FRT is **not** the same as that of `fmsub`, but is instead as if
312 `fsub` were performed first followed by `fmuls. The creation of FRS
313 and FRT are treated as parallel independent operations which occur at
316 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
318 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
319 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
320 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
321 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
323 Special Registers Altered:
331 ## Floating-Point Twin Multiply-Add FFT
336 |0 |6 |11 |16 |21 |31 |
337 | PO | FRT | FRA | FRB | XO |Rc |
340 * ffmadd FRT,FRA,FRB (Rc=0)
345 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
346 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
352 FRS <- -([(FRT) * (FRA)] - (FRB))
353 FRT <- [(FRT) * (FRA)] + (FRB)
358 The floating-point operand in register FRT is multiplied by the
359 floating-point operand in register FRA. The float- ing-point operand in
360 register FRB is added to this intermediate result, and the intermediate
363 Using the exact same values of FRT, FRT and FRB as used to create
364 FRS, the floating-point operand in register FRT is multiplied by the
365 floating-point operand in register FRA. The float- ing-point operand
366 in register FRB is subtracted from this intermediate result, and the
367 intermediate stored in FRT.
369 FRT is created as if a `fmadd` operation had been performed. FRS is
370 created as if a `fnmsub` operation had simultaneously been performed
371 with the exact same register operands, in parallel, independently,
372 at exactly the same time.
374 FRT is a Read-Modify-Write operation.
376 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
378 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
379 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
380 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
381 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
383 Special Registers Altered:
392 ## [DRAFT] Floating-Point Add FFT/DCT [Single]
396 * ffadds FRT,FRA,FRB (Rc=0)
397 * ffadds. FRT,FRA,FRB (Rc=1)
402 FRT <- FPADD32(FRA, FRB)
403 FRS <- FPSUB32(FRB, FRA)
406 Special Registers Altered:
415 ## [DRAFT] Floating-Point Add FFT/DCT [Double]
419 * ffadd FRT,FRA,FRB (Rc=0)
420 * ffadd. FRT,FRA,FRB (Rc=1)
425 FRT <- FPADD64(FRA, FRB)
426 FRS <- FPSUB64(FRB, FRA)
429 Special Registers Altered:
438 ## [DRAFT] Floating-Point Subtract FFT/DCT [Single]
442 * ffsubs FRT,FRA,FRB (Rc=0)
443 * ffsubs. FRT,FRA,FRB (Rc=1)
448 FRT <- FPSUB32(FRB, FRA)
449 FRS <- FPADD32(FRA, FRB)
452 Special Registers Altered:
461 ## [DRAFT] Floating-Point Subtract FFT/DCT [Double]
465 * ffsub FRT,FRA,FRB (Rc=0)
466 * ffsub. FRT,FRA,FRB (Rc=1)
471 FRT <- FPSUB64(FRB, FRA)
472 FRS <- FPADD64(FRA, FRB)
475 Special Registers Altered: