4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
16 The number of general-purpose uses for DCT is huge. The
17 number of instructions needed instead of these Twin-Butterfly
18 instructions is also huge (**eight**) and given that it is
19 extremely common to explicitly loop-unroll them quantity
20 hundreds to thousands of instructions are dismayingly common
23 The goal is to implement instructions that calculate the expression:
26 fdct_round_shift((a +/- b) * c)
29 For the single-coefficient butterfly instruction, and:
32 fdct_round_shift(a * c1 +/- b * c2)
35 For the double-coefficient butterfly instruction.
37 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
40 #define ROUND_POWER_OF_TWO(value, n) \
41 (((value) + (1 << ((n)-1))) >> (n))
44 These instructions are at the core of **ALL** FDCT calculations in many major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
45 Arm includes special instructions to optimize these operations, although they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
47 The suggestion is to have a single instruction to calculate both values `((a + b) * c) >> N`, and `((a - b) * c) >> N`.
48 The instruction will run in accumulate mode, so in order to calculate the 2-coeff version one would just have to call the same instruction with different order a, b and a different constant c.
50 ## Integer Butterfly Multiply Add/Sub FFT/DCT
52 **Add the following to Book I Section 3.3.9.1**
57 |0 |6 |11 |16 |21 |26 |31 |
58 | PO | RT | RA | RB | SH | XO |Rc |
62 * maddsubrs RT,RA,SH,RB
70 prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1]
71 prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1]
72 res1 <- ROTL64(prod1, XLEN-n)
73 res2 <- ROTL64(prod2, XLEN-n)
74 m <- MASK(n, (XLEN-1))
77 smask1 <- ([signbit1]*XLEN) & ¬m
78 smask2 <- ([signbit2]*XLEN) & ¬m
79 s64_1 <- [0]*(XLEN-1) || signbit1
80 s64_2 <- [0]*(XLEN-1) || signbit2
81 RT <- (res1 & m | smask1) + s64_1
82 RS <- (res2 & m | smask2) + s64_2
85 Note that if Rc=1 an Illegal Instruction is raised.
88 Similar to `RTp`, this instruction produces an implicit result,
89 `RS`, which under Scalar circumstances is defined as `RT+1`.
90 For SVP64 if `RT` is a Vector, `RS` begins immediately after the
91 Vector `RT` where the length of `RT` is set by `SVSTATE.MAXVL`
94 Special Registers Altered:
100 # Twin Butterfly Integer DCT Instruction(s)
102 ## Floating Twin Multiply-Add DCT [Single]
104 **Add the following to Book I Section 4.6.6.3**
109 |0 |6 |11 |16 |21 |31 |
110 | PO | FRT | FRA | FRB | XO |Rc |
113 * fdmadds FRT,FRA,FRB (Rc=0)
118 FRS <- FPADD32(FRT, FRB)
119 FRT <- FPMULADD32(FRT, FRA, FRB, 1, -1)
122 The Floating-Point operand in register FRT is added to the floating-point
123 operand in register FRB and the result stored in FRS.
125 Using the exact same operand input register values from FRT and FRB that
126 were used to create FRS, the Floating-Point operand in register FRB
127 is subtracted from the floating-point operand in register FRT and the
128 result then multiplied by FRA to create an intermediate result that is
131 The add into FRS is treated exactly as `fadd`. The creation
132 of the result FRT is exact!y that of `fmsub`. The creation of FRS and FRT are
133 treated as parallel independent operations which occur at the same time.
135 Note that if Rc=1 an Illegal Instruction is raised.
138 Similar to `FRTp`, this instruction produces an implicit result,
139 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
140 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
141 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
144 Special Registers Altered:
152 ## Floating Multiply-Add FFT [Single]
154 **Add the following to Book I Section 4.6.6.3**
159 |0 |6 |11 |16 |21 |31 |
160 | PO | FRT | FRA | FRB | XO |Rc |
163 * ffmadds FRT,FRA,FRB (Rc=0)
168 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
169 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
175 FRS <- -([(FRT) * (FRA)] - (FRB))
176 FRT <- [(FRT) * (FRA)] + (FRB)
181 The floating-point operand in register FRT is multiplied
182 by the floating-point operand in register FRA. The float-
183 ing-point operand in register FRB is added to
184 this intermediate result, and the intermediate stored in FRS.
186 Using the exact same values of FRT, FRT and FRB as used to create FRS,
187 the floating-point operand in register FRT is multiplied
188 by the floating-point operand in register FRA. The float-
189 ing-point operand in register FRB is subtracted from
190 this intermediate result, and the intermediate stored in FRT.
193 a `fmadds` operation had been performed. FRS is created as if
194 a `fnmsubs` operation had simultaneously been performed with
195 the exact same register operands, in parallel, independently,
196 at exactly the same time.
198 FRT is a Read-Modify-Write operation.
200 Note that if Rc=1 an Illegal Instruction is raised.
203 Similar to `FRTp`, this instruction produces an implicit result,
204 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
205 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
206 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
210 Special Registers Altered:
218 ## [DRAFT] Floating Add FFT/DCT [Single]
222 * ffadds FRT,FRA,FRB (Rc=0)
223 * ffadds. FRT,FRA,FRB (Rc=1)
228 FRT <- FPADD32(FRA, FRB)
229 FRS <- FPSUB32(FRB, FRA)
232 Special Registers Altered:
241 ## [DRAFT] Floating Add FFT/DCT [Double]
245 * ffadd FRT,FRA,FRB (Rc=0)
246 * ffadd. FRT,FRA,FRB (Rc=1)
251 FRT <- FPADD64(FRA, FRB)
252 FRS <- FPSUB64(FRB, FRA)
255 Special Registers Altered:
264 ## [DRAFT] Floating Subtract FFT/DCT [Single]
268 * ffsubs FRT,FRA,FRB (Rc=0)
269 * ffsubs. FRT,FRA,FRB (Rc=1)
274 FRT <- FPSUB32(FRB, FRA)
275 FRS <- FPADD32(FRA, FRB)
278 Special Registers Altered:
287 ## [DRAFT] Floating Subtract FFT/DCT [Double]
291 * ffsub FRT,FRA,FRB (Rc=0)
292 * ffsub. FRT,FRA,FRB (Rc=1)
297 FRT <- FPSUB64(FRB, FRA)
298 FRS <- FPADD64(FRA, FRB)
301 Special Registers Altered: