4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
16 The number of general-purpose uses for DCT is huge. The
17 number of instructions needed instead of these Twin-Butterfly
18 instructions is also huge (**eight**) and given that it is
19 extremely common to explicitly loop-unroll them quantity
20 hundreds to thousands of instructions are dismayingly common
23 The goal is to implement instructions that calculate the expression:
26 fdct_round_shift((a +/- b) * c)
29 For the single-coefficient butterfly instruction, and:
32 fdct_round_shift(a * c1 +/- b * c2)
35 For the double-coefficient butterfly instruction.
37 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
40 #define ROUND_POWER_OF_TWO(value, n) \
41 (((value) + (1 << ((n)-1))) >> (n))
44 These instructions are at the core of **ALL** FDCT calculations in many major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
45 Arm includes special instructions to optimize these operations, although they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
47 The suggestion is to have a single instruction to calculate both values `((a + b) * c) >> N`, and `((a - b) * c) >> N`.
48 The instruction will run in accumulate mode, so in order to calculate the 2-coeff version one would just have to call the same instruction with different order a, b and a different constant c.
50 ## Integer Butterfly Multiply Add/Sub FFT/DCT
52 **Add the following to Book I Section 3.3.9.1**
57 |0 |6 |11 |16 |21 |26 |31 |
58 | PO | RT | RA | RB | SH | XO |Rc |
62 * maddsubrs RT,RA,SH,RB
70 prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1]
71 prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1]
72 res1 <- ROTL64(prod1, XLEN-n)
73 res2 <- ROTL64(prod2, XLEN-n)
74 m <- MASK(n, (XLEN-1))
77 smask1 <- ([signbit1]*XLEN) & ¬m
78 smask2 <- ([signbit2]*XLEN) & ¬m
79 s64_1 <- [0]*(XLEN-1) || signbit1
80 s64_2 <- [0]*(XLEN-1) || signbit2
81 RT <- (res1 & m | smask1) + s64_1
82 RS <- (res2 & m | smask2) + s64_2
85 Note that if Rc=1 an Illegal Instruction is raised.
88 Similar to `RTp`, this instruction produces an implicit result,
89 `RS`, which under Scalar circumstances is defined as `RT+1`.
90 For SVP64 if `RT` is a Vector, `RS` begins immediately after the
91 Vector `RT` where the length of `RT` is set by `SVSTATE.MAXVL`
94 Special Registers Altered:
100 # Twin Butterfly Integer DCT Instruction(s)
102 ## Floating Twin Multiply-Add DCT [Single]
104 **Add the following to Book I Section 4.6.6.3**
109 |0 |6 |11 |16 |21 |31 |
110 | PO | FRT | FRA | FRB | XO |Rc |
113 * fdmadds FRT,FRA,FRB (Rc=0)
118 FRS <- FPADD32(FRT, FRB)
119 FRT <- FPMULADD32(FRT, FRA, FRB, 1, -1)
122 The Floating-Point operand in register FRT is added to the floating-point
123 operand in register FRB and the result stored in FRS.
125 Using the exact same operand input register values from FRT and FRB that
126 were used to create FRS, the Floating-Point operand in register FRB
127 is subtracted from the floating-point operand in register FRT and the
128 result then multiplied by FRA to create an intermediate result that is
131 The add into FRS is treated exactly as `fadd`. The creation
132 of the result FRT is exact!y that of `fmsub`. The creation of FRS and FRT are
133 treated as parallel independent operations which occur at the same time.
135 Note that if Rc=1 an Illegal Instruction is raised.
138 Similar to `FRTp`, this instruction produces an implicit result,
139 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
140 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
141 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
144 Special Registers Altered:
152 ## Floating Multiply-Add FFT [Single]
154 **Add the following to Book I Section 4.6.6.3**
159 |0 |6 |11 |16 |21 |31 |
160 | PO | FRT | FRA | FRB | XO |Rc |
163 * ffmadds FRT,FRA,FRB (Rc=0)
168 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
169 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
175 FRS <- -([(FRT) * (FRA)] - (FRB))
176 FRT <- [(FRT) * (FRA)] + (FRB)
181 The floating-point operand in register FRT is multiplied
182 by the floating-point operand in register FRA. The float-
183 ing-point operand in register FRB is added to
184 this intermediate result, and the intermediate stored in FRS.
186 Using the exact same values of FRT, FRT and FRB as used to create FRS,
187 the floating-point operand in register FRT is multiplied
188 by the floating-point operand in register FRA. The float-
189 ing-point operand in register FRB is subtracted from
190 this intermediate result, and the intermediate stored in FRT.
193 a `fmadds` operation had been performed. FRS is created as if
194 a `fnmsubs` operation had simultaneously been performed with
195 the exact same register operands, in parallel, independently,
196 at exactly the same time.
198 FRT is a Read-Modify-Write operation.
200 Note that if Rc=1 an Illegal Instruction is raised.
203 Similar to `FRTp`, this instruction produces an implicit result,
204 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
205 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
206 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
210 Special Registers Altered:
217 ## Floating Twin Multiply-Add DCT
219 **Add the following to Book I Section 4.6.6.3**
224 |0 |6 |11 |16 |21 |31 |
225 | PO | FRT | FRA | FRB | XO |Rc |
228 * fdmadd FRT,FRA,FRB (Rc=0)
233 FRS <- FPADD64(FRT, FRB)
234 FRT <- FPMULADD64(FRT, FRA, FRB, 1, -1)
237 The Floating-Point operand in register FRT is added to the floating-point
238 operand in register FRB and the result stored in FRS.
240 Using the exact same operand input register values from FRT and FRB that
241 were used to create FRS, the Floating-Point operand in register FRB
242 is subtracted from the floating-point operand in register FRT and the
243 result then multiplied by FRA to create an intermediate result that is
246 The add into FRS is treated exactly as `fadd`. The creation
247 of the result FRT is exact!y that of `fmsub`. The creation of FRS and FRT are
248 treated as parallel independent operations which occur at the same time.
250 Note that if Rc=1 an Illegal Instruction is raised.
253 Similar to `FRTp`, this instruction produces an implicit result,
254 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
255 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
256 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
259 Special Registers Altered:
267 ## Floating Twin Multiply-Add FFT
269 **Add the following to Book I Section 4.6.6.3**
274 |0 |6 |11 |16 |21 |31 |
275 | PO | FRT | FRA | FRB | XO |Rc |
278 * ffmadd FRT,FRA,FRB (Rc=0)
283 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
284 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
290 FRS <- -([(FRT) * (FRA)] - (FRB))
291 FRT <- [(FRT) * (FRA)] + (FRB)
296 The floating-point operand in register FRT is multiplied
297 by the floating-point operand in register FRA. The float-
298 ing-point operand in register FRB is added to
299 this intermediate result, and the intermediate stored in FRS.
301 Using the exact same values of FRT, FRT and FRB as used to create FRS,
302 the floating-point operand in register FRT is multiplied
303 by the floating-point operand in register FRA. The float-
304 ing-point operand in register FRB is subtracted from
305 this intermediate result, and the intermediate stored in FRT.
308 a `fmadd` operation had been performed. FRS is created as if
309 a `fnmsub` operation had simultaneously been performed with
310 the exact same register operands, in parallel, independently,
311 at exactly the same time.
313 FRT is a Read-Modify-Write operation.
315 Note that if Rc=1 an Illegal Instruction is raised.
318 Similar to `FRTp`, this instruction produces an implicit result,
319 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
320 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
321 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
324 Special Registers Altered:
333 ## [DRAFT] Floating Add FFT/DCT [Single]
337 * ffadds FRT,FRA,FRB (Rc=0)
338 * ffadds. FRT,FRA,FRB (Rc=1)
343 FRT <- FPADD32(FRA, FRB)
344 FRS <- FPSUB32(FRB, FRA)
347 Special Registers Altered:
356 ## [DRAFT] Floating Add FFT/DCT [Double]
360 * ffadd FRT,FRA,FRB (Rc=0)
361 * ffadd. FRT,FRA,FRB (Rc=1)
366 FRT <- FPADD64(FRA, FRB)
367 FRS <- FPSUB64(FRB, FRA)
370 Special Registers Altered:
379 ## [DRAFT] Floating Subtract FFT/DCT [Single]
383 * ffsubs FRT,FRA,FRB (Rc=0)
384 * ffsubs. FRT,FRA,FRB (Rc=1)
389 FRT <- FPSUB32(FRB, FRA)
390 FRS <- FPADD32(FRA, FRB)
393 Special Registers Altered:
402 ## [DRAFT] Floating Subtract FFT/DCT [Double]
406 * ffsub FRT,FRA,FRB (Rc=0)
407 * ffsub. FRT,FRA,FRB (Rc=1)
412 FRT <- FPSUB64(FRB, FRA)
413 FRS <- FPADD64(FRA, FRB)
416 Special Registers Altered: