4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
16 The number of general-purpose uses for DCT is huge. The number of
17 instructions needed instead of these Twin-Butterfly instructions is also
18 huge (**eight**) and given that it is extremely common to explicitly
19 loop-unroll them quantity hundreds to thousands of instructions are
20 dismayingly common (for all ISAs).
22 The goal is to implement instructions that calculate the expression:
25 fdct_round_shift((a +/- b) * c)
28 For the single-coefficient butterfly instruction, and:
31 fdct_round_shift(a * c1 +/- b * c2)
34 For the double-coefficient butterfly instruction.
36 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
39 #define ROUND_POWER_OF_TWO(value, n) \
40 (((value) + (1 << ((n)-1))) >> (n))
43 These instructions are at the core of **ALL** FDCT calculations in many
44 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
45 Arm includes special instructions to optimize these operations, although
46 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
48 The suggestion is to have a single instruction to calculate both values
49 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
50 run in accumulate mode, so in order to calculate the 2-coeff version
51 one would just have to call the same instruction with different order a,
52 b and a different constant c.
54 Example taken from libvpx
55 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
59 #define ROUND_POWER_OF_TWO(value, n) \
60 (((value) + (1 << ((n)-1))) >> (n))
61 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
62 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
63 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
67 8 instructions are required - replaced by just the one (maddsubrs):
84 ## Integer Butterfly Multiply Add/Sub FFT/DCT
86 **Add the following to Book I Section 3.3.9.1**
91 |0 |6 |11 |16 |21 |26 |31 |
92 | PO | RT | RA | RB | SH | XO |Rc |
95 * maddsubrs RT,RA,SH,RB
103 prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1]
104 prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1]
106 #round <- EXTS([0]*(XLEN-1) || [1]*1)
107 #prod1 <- ROTL64(prod1, 1)
108 #prod2 <- ROTL64(prod2, 1)
109 #prod1 <- prod1 + round
110 #prod2 <- prod2 + round
111 #res1 <- ROTL64(prod1, XLEN-1)
112 #res2 <- ROTL64(prod2, XLEN-1)
113 #m <- MASK(1, (XLEN-1))
117 round <- EXTS([0]*(XLEN -n -1) || [1]*1 || [0]*(n-1))
118 prod1 <- prod1 + round
119 prod2 <- prod2 + round
120 res1 <- ROTL64(prod1, XLEN-n)
121 res2 <- ROTL64(prod2, XLEN-n)
122 m <- MASK(n, (XLEN-1))
125 smask1 <- ([signbit1]*XLEN) & ¬m
126 smask2 <- ([signbit2]*XLEN) & ¬m
127 RT <- (res1 & m | smask1)
128 RS <- (res2 & m | smask2)
131 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
133 Similar to `RTp`, this instruction produces an implicit result, `RS`,
134 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
135 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
136 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
138 Special Registers Altered:
148 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
150 **Add the following to Book I Section 4.6.6.3**
152 ## Floating-Point Twin Multiply-Add DCT [Single]
157 |0 |6 |11 |16 |21 |31 |
158 | PO | FRT | FRA | FRB | XO |Rc |
161 * fdmadds FRT,FRA,FRB (Rc=0)
166 FRS <- FPADD32(FRT, FRB)
167 sub <- FPSUB32(FRT, FRB)
168 FRT <- FPMUL32(FRA, sub)
171 The two IEEE754-FP32 operations
174 FRS <- [(FRT) + (FRB)]
175 FRT <- [(FRT) - (FRB)] * (FRA)
178 are simultaneously performed.
180 The Floating-Point operand in register FRT is added to the floating-point
181 operand in register FRB and the result stored in FRS.
183 Using the exact same operand input register values from FRT and FRB
184 that were used to create FRS, the Floating-Point operand in register
185 FRB is subtracted from the floating-point operand in register FRT and
186 the result then rounded before being multiplied by FRA to create an
187 intermediate result that is stored in FRT.
189 The add into FRS is treated exactly as `fadds`. The creation of the
190 result FRT is **not** the same as that of `fmsubs`, but is instead as if
191 `fsubs` were performed first followed by `fmuls`. The creation of FRS
192 and FRT are treated as parallel independent operations which occur at
195 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
197 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
198 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
199 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
200 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
202 Special Registers Altered:
210 ## Floating-Point Multiply-Add FFT [Single]
215 |0 |6 |11 |16 |21 |31 |
216 | PO | FRT | FRA | FRB | XO |Rc |
219 * ffmadds FRT,FRA,FRB (Rc=0)
224 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
225 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
231 FRS <- -([(FRT) * (FRA)] - (FRB))
232 FRT <- [(FRT) * (FRA)] + (FRB)
237 The floating-point operand in register FRT is multiplied by the
238 floating-point operand in register FRA. The floating-point operand in
239 register FRB is added to this intermediate result, and the intermediate
242 Using the exact same values of FRT, FRT and FRB as used to create
243 FRS, the floating-point operand in register FRT is multiplied by the
244 floating-point operand in register FRA. The floating-point operand
245 in register FRB is subtracted from this intermediate result, and the
246 intermediate stored in FRT.
248 FRT is created as if a `fmadds` operation had been performed. FRS is
249 created as if a `fnmsubs` operation had simultaneously been performed
250 with the exact same register operands, in parallel, independently,
251 at exactly the same time.
253 FRT is a Read-Modify-Write operation.
255 Note that if Rc=1 an Illegal Instruction is raised.
258 Similar to `FRTp`, this instruction produces an implicit result,
259 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
260 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
261 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
264 Special Registers Altered:
272 ## Floating-Point Twin Multiply-Add DCT
277 |0 |6 |11 |16 |21 |31 |
278 | PO | FRT | FRA | FRB | XO |Rc |
281 * fdmadd FRT,FRA,FRB (Rc=0)
286 FRS <- FPADD64(FRT, FRB)
287 sub <- FPSUB64(FRT, FRB)
288 FRT <- FPMUL64(FRA, sub)
291 The two IEEE754-FP64 operations
294 FRS <- [(FRT) + (FRB)]
295 FRT <- [(FRT) - (FRB)] * (FRA)
298 are simultaneously performed.
300 The Floating-Point operand in register FRT is added to the floating-point
301 operand in register FRB and the result stored in FRS.
303 Using the exact same operand input register values from FRT and FRB
304 that were used to create FRS, the Floating-Point operand in register
305 FRB is subtracted from the floating-point operand in register FRT and
306 the result then rounded before being multiplied by FRA to create an
307 intermediate result that is stored in FRT.
309 The add into FRS is treated exactly as `fadd`. The creation of the
310 result FRT is **not** the same as that of `fmsub`, but is instead as if
311 `fsub` were performed first followed by `fmuls. The creation of FRS
312 and FRT are treated as parallel independent operations which occur at
315 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
317 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
318 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
319 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
320 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
322 Special Registers Altered:
330 ## Floating-Point Twin Multiply-Add FFT
335 |0 |6 |11 |16 |21 |31 |
336 | PO | FRT | FRA | FRB | XO |Rc |
339 * ffmadd FRT,FRA,FRB (Rc=0)
344 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
345 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
351 FRS <- -([(FRT) * (FRA)] - (FRB))
352 FRT <- [(FRT) * (FRA)] + (FRB)
357 The floating-point operand in register FRT is multiplied by the
358 floating-point operand in register FRA. The float- ing-point operand in
359 register FRB is added to this intermediate result, and the intermediate
362 Using the exact same values of FRT, FRT and FRB as used to create
363 FRS, the floating-point operand in register FRT is multiplied by the
364 floating-point operand in register FRA. The float- ing-point operand
365 in register FRB is subtracted from this intermediate result, and the
366 intermediate stored in FRT.
368 FRT is created as if a `fmadd` operation had been performed. FRS is
369 created as if a `fnmsub` operation had simultaneously been performed
370 with the exact same register operands, in parallel, independently,
371 at exactly the same time.
373 FRT is a Read-Modify-Write operation.
375 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
377 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
378 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
379 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
380 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
382 Special Registers Altered:
391 ## Floating-Point Add FFT/DCT [Single]
396 |0 |6 |11 |16 |21 |26 |31 |
397 | PO | FRT | FRA | FRB | / | XO |Rc |
400 * ffadds FRT,FRA,FRB (Rc=0)
405 FRT <- FPADD32(FRA, FRB)
406 FRS <- FPSUB32(FRB, FRA)
409 Special Registers Altered:
417 ## Floating-Point Add FFT/DCT [Double]
422 |0 |6 |11 |16 |21 |26 |31 |
423 | PO | FRT | FRA | FRB | / | XO |Rc |
426 * ffadd FRT,FRA,FRB (Rc=0)
431 FRT <- FPADD64(FRA, FRB)
432 FRS <- FPSUB64(FRB, FRA)
435 Special Registers Altered:
443 ## Floating-Point Subtract FFT/DCT [Single]
448 |0 |6 |11 |16 |21 |26 |31 |
449 | PO | FRT | FRA | FRB | / | XO |Rc |
452 * ffsubs FRT,FRA,FRB (Rc=0)
457 FRT <- FPSUB32(FRB, FRA)
458 FRS <- FPADD32(FRA, FRB)
461 Special Registers Altered:
469 ## Floating-Point Subtract FFT/DCT [Double]
474 |0 |6 |11 |16 |21 |26 |31 |
475 | PO | FRT | FRA | FRB | / | XO |Rc |
478 * ffsub FRT,FRA,FRB (Rc=0)
483 FRT <- FPSUB64(FRB, FRA)
484 FRS <- FPADD64(FRA, FRB)
487 Special Registers Altered: