4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
16 The number of general-purpose uses for DCT is huge. The number of
17 instructions needed instead of these Twin-Butterfly instructions is also
18 huge (**eight**) and given that it is extremely common to explicitly
19 loop-unroll them quantity hundreds to thousands of instructions are
20 dismayingly common (for all ISAs).
22 The goal is to implement instructions that calculate the expression:
25 fdct_round_shift((a +/- b) * c)
28 For the single-coefficient butterfly instruction, and:
31 fdct_round_shift(a * c1 +/- b * c2)
34 For the double-coefficient butterfly instruction.
36 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
39 #define ROUND_POWER_OF_TWO(value, n) \
40 (((value) + (1 << ((n)-1))) >> (n))
43 These instructions are at the core of **ALL** FDCT calculations in many
44 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
45 Arm includes special instructions to optimize these operations, although
46 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
48 The suggestion is to have a single instruction to calculate both values
49 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
50 run in accumulate mode, so in order to calculate the 2-coeff version
51 one would just have to call the same instruction with different order a,
52 b and a different constant c.
54 Example taken from libvpx
55 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
59 #define ROUND_POWER_OF_TWO(value, n) \
60 (((value) + (1 << ((n)-1))) >> (n))
61 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
62 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
63 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
67 8 instructions are required - replaced by just the one (maddsubrs):
80 ## Integer Butterfly Multiply Add/Sub FFT/DCT
82 **Add the following to Book I Section 3.3.9.1**
87 |0 |6 |11 |16 |21 |26 |31 |
88 | PO | RT | RA | RB | SH | XO |Rc |
92 * maddsubrs RT,RA,SH,RB
100 prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1] + 1
101 prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1] + 1
102 res1 <- ROTL64(prod1, XLEN-n)
103 res2 <- ROTL64(prod2, XLEN-n)
104 m <- MASK(n, (XLEN-1))
107 smask1 <- ([signbit1]*XLEN) & ¬m
108 smask2 <- ([signbit2]*XLEN) & ¬m
109 s64_1 <- [0]*(XLEN-1) || signbit1
110 s64_2 <- [0]*(XLEN-1) || signbit2
111 RT <- (res1 & m | smask1) + s64_1
112 RS <- (res2 & m | smask2) + s64_2
115 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
117 Similar to `RTp`, this instruction produces an implicit result, `RS`,
118 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
119 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
120 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
122 Special Registers Altered:
128 # Twin Butterfly Floating-Point DCT Instruction(s)
130 ## Floating-Point Twin Multiply-Add DCT [Single]
132 **Add the following to Book I Section 4.6.6.3**
137 |0 |6 |11 |16 |21 |31 |
138 | PO | FRT | FRA | FRB | XO |Rc |
141 * fdmadds FRT,FRA,FRB (Rc=0)
146 FRS <- FPADD32(FRT, FRB)
147 sub <- FPSUB32(FRT, FRB)
148 FRT <- FPMUL32(FRA, sub)
151 The two IEEE754-FP32 operations
154 FRS <- [(FRT) + (FRB)]
155 FRT <- [(FRT) - (FRB)] * (FRA)
158 are simultaneously performed.
160 The Floating-Point operand in register FRT is added to the floating-point
161 operand in register FRB and the result stored in FRS.
163 Using the exact same operand input register values from FRT and FRB
164 that were used to create FRS, the Floating-Point operand in register
165 FRB is subtracted from the floating-point operand in register FRT and
166 the result then multiplied by FRA to create an intermediate result that
169 The add into FRS is treated exactly as `fadds`. The creation of the
170 result FRT is **not** the same as that of `fmsubs`.
171 The creation of FRS and FRT are treated as parallel independent operations
172 which occur at the same time.
174 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
176 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
177 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
178 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
179 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
181 Special Registers Altered:
189 ## Floating-Point Multiply-Add FFT [Single]
191 **Add the following to Book I Section 4.6.6.3**
196 |0 |6 |11 |16 |21 |31 |
197 | PO | FRT | FRA | FRB | XO |Rc |
200 * ffmadds FRT,FRA,FRB (Rc=0)
205 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
206 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
212 FRS <- -([(FRT) * (FRA)] - (FRB))
213 FRT <- [(FRT) * (FRA)] + (FRB)
218 The floating-point operand in register FRT is multiplied by the
219 floating-point operand in register FRA. The floating-point operand in
220 register FRB is added to this intermediate result, and the intermediate
223 Using the exact same values of FRT, FRT and FRB as used to create
224 FRS, the floating-point operand in register FRT is multiplied by the
225 floating-point operand in register FRA. The float- ing-point operand
226 in register FRB is subtracted from this intermediate result, and the
227 intermediate stored in FRT.
229 FRT is created as if a `fmadds` operation had been performed. FRS is
230 created as if a `fnmsubs` operation had simultaneously been performed
231 with the exact same register operands, in parallel, independently,
232 at exactly the same time.
234 FRT is a Read-Modify-Write operation.
236 Note that if Rc=1 an Illegal Instruction is raised.
239 Similar to `FRTp`, this instruction produces an implicit result,
240 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
241 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
242 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
246 Special Registers Altered:
253 ## Floating-Point Twin Multiply-Add DCT
255 **Add the following to Book I Section 4.6.6.3**
260 |0 |6 |11 |16 |21 |31 |
261 | PO | FRT | FRA | FRB | XO |Rc |
264 * fdmadd FRT,FRA,FRB (Rc=0)
269 FRS <- FPADD64(FRT, FRB)
270 sub <- FPSUB64(FRT, FRB)
271 FRT <- FPMUL64(FRA, sub)
274 The two IEEE754-FP64 operations
277 FRS <- [(FRT) + (FRB)]
278 FRT <- [(FRT) - (FRB)] * (FRA)
281 are simultaneously performed.
283 The Floating-Point operand in register FRT is added to the floating-point
284 operand in register FRB and the result stored in FRS.
286 Using the exact same operand input register values from FRT and FRB
287 that were used to create FRS, the Floating-Point operand in register
288 FRB is subtracted from the floating-point operand in register FRT and
289 the result then multiplied by FRA to create an intermediate result that
292 The add into FRS is treated exactly as `fadd`. The creation of the
293 result FRT is **not** the same as that of `fmsub`.
294 The creation of FRS and FRT are treated as parallel independent operations
295 which occur at the same time.
297 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
299 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
300 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
301 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
302 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
304 Special Registers Altered:
312 ## Floating-Point Twin Multiply-Add FFT
314 **Add the following to Book I Section 4.6.6.3**
319 |0 |6 |11 |16 |21 |31 |
320 | PO | FRT | FRA | FRB | XO |Rc |
323 * ffmadd FRT,FRA,FRB (Rc=0)
328 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
329 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
335 FRS <- -([(FRT) * (FRA)] - (FRB))
336 FRT <- [(FRT) * (FRA)] + (FRB)
341 The floating-point operand in register FRT is multiplied by the
342 floating-point operand in register FRA. The float- ing-point operand in
343 register FRB is added to this intermediate result, and the intermediate
346 Using the exact same values of FRT, FRT and FRB as used to create
347 FRS, the floating-point operand in register FRT is multiplied by the
348 floating-point operand in register FRA. The float- ing-point operand
349 in register FRB is subtracted from this intermediate result, and the
350 intermediate stored in FRT.
352 FRT is created as if a `fmadd` operation had been performed. FRS is
353 created as if a `fnmsub` operation had simultaneously been performed
354 with the exact same register operands, in parallel, independently,
355 at exactly the same time.
357 FRT is a Read-Modify-Write operation.
359 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
361 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
362 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
363 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
364 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
366 Special Registers Altered:
375 ## [DRAFT] Floating-Point Add FFT/DCT [Single]
379 * ffadds FRT,FRA,FRB (Rc=0)
380 * ffadds. FRT,FRA,FRB (Rc=1)
385 FRT <- FPADD32(FRA, FRB)
386 FRS <- FPSUB32(FRB, FRA)
389 Special Registers Altered:
398 ## [DRAFT] Floating-Point Add FFT/DCT [Double]
402 * ffadd FRT,FRA,FRB (Rc=0)
403 * ffadd. FRT,FRA,FRB (Rc=1)
408 FRT <- FPADD64(FRA, FRB)
409 FRS <- FPSUB64(FRB, FRA)
412 Special Registers Altered:
421 ## [DRAFT] Floating-Point Subtract FFT/DCT [Single]
425 * ffsubs FRT,FRA,FRB (Rc=0)
426 * ffsubs. FRT,FRA,FRB (Rc=1)
431 FRT <- FPSUB32(FRB, FRA)
432 FRS <- FPADD32(FRA, FRB)
435 Special Registers Altered:
444 ## [DRAFT] Floating-Point Subtract FFT/DCT [Double]
448 * ffsub FRT,FRA,FRB (Rc=0)
449 * ffsub. FRT,FRA,FRB (Rc=1)
454 FRT <- FPSUB64(FRB, FRA)
455 FRS <- FPADD64(FRA, FRB)
458 Special Registers Altered: