4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
16 The number of general-purpose uses for DCT is huge. The
17 number of instructions needed instead of these Twin-Butterfly
18 instructions is also huge (**eight**) and given that it is
19 extremely common to explicitly loop-unroll them quantity
20 hundreds to thousands of instructions are dismayingly common
23 The goal is to implement instructions that calculate the expression:
26 fdct_round_shift((a +/- b) * c)
29 For the single-coefficient butterfly instruction, and:
32 fdct_round_shift(a * c1 +/- b * c2)
35 For the double-coefficient butterfly instruction.
37 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
40 #define ROUND_POWER_OF_TWO(value, n) \
41 (((value) + (1 << ((n)-1))) >> (n))
44 These instructions are at the core of **ALL** FDCT calculations in many major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
45 Arm includes special instructions to optimize these operations, although they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
47 The suggestion is to have a single instruction to calculate both values `((a + b) * c) >> N`, and `((a - b) * c) >> N`.
48 The instruction will run in accumulate mode, so in order to calculate the 2-coeff version one would just have to call the same instruction with different order a, b and a different constant c.
50 ## Integer Butterfly Multiply Add/Sub FFT/DCT
52 **Add the following to Book I Section 3.3.9.1**
57 |0 |6 |11 |16 |21 |26 |31 |
58 | PO | RT | RA | RB | SH | XO |Rc |
62 * maddsubrs RT,RA,SH,RB
70 prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1]
71 prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1]
72 res1 <- ROTL64(prod1, XLEN-n)
73 res2 <- ROTL64(prod2, XLEN-n)
74 m <- MASK(n, (XLEN-1))
77 smask1 <- ([signbit1]*XLEN) & ¬m
78 smask2 <- ([signbit2]*XLEN) & ¬m
79 s64_1 <- [0]*(XLEN-1) || signbit1
80 s64_2 <- [0]*(XLEN-1) || signbit2
81 RT <- (res1 & m | smask1) + s64_1
82 RS <- (res2 & m | smask2) + s64_2
85 Note that if Rc=1 an Illegal Instruction is raised.
88 Similar to `RTp`, this instruction produces an implicit result,
89 `RS`, which under Scalar circumstances is defined as `RT+1`.
90 For SVP64 if `RT` is a Vector, `RS` begins immediately after the
91 Vector `RT` where the length of `RT` is set by `SVSTATE.MAXVL`
94 Special Registers Altered:
100 # Twin Butterfly Integer DCT Instruction(s)
102 ## Floating Twin Multiply-Add DCT [Single]
104 **Add the following to Book I Section 4.6.6.3**
109 |0 |6 |11 |16 |21 |31 |
110 | PO | FRT | FRA | FRB | XO |Rc |
113 * fdmadds FRT,FRA,FRB (Rc=0)
118 FRS <- FPADD32(FRT, FRB)
119 sub <- FPSUB32(FRT, FRB)
120 FRT <- FPMUL32(FRA, sub)
123 The Floating-Point operand in register FRT is added to the floating-point
124 operand in register FRB and the result stored in FRS.
126 Using the exact same operand input register values from FRT and FRB that
127 were used to create FRS, the Floating-Point operand in register FRB
128 is subtracted from the floating-point operand in register FRT and the
129 result then multiplied by FRA to create an intermediate result that is
132 The subtraction and multiply are treated as if they were `fsub`
133 followed by `fmul`, not `fmsub`. The creation of FRS and FRT are
134 treated as parallel independent operations.
136 Note that if Rc=1 an Illegal Instruction is raised.
139 Similar to `FRTp`, this instruction produces an implicit result,
140 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
141 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
142 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
145 Special Registers Altered:
153 ## Floating Multiply-Add FFT [Single]
155 **Add the following to Book I Section 4.6.6.3**
160 |0 |6 |11 |16 |21 |31 |
161 | PO | FRT | FRA | FRB | XO |Rc |
164 * ffmadds FRT,FRA,FRB (Rc=0)
169 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
170 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
176 FRS <- -([(FRT) * (FRA)] - (FRB))
177 FRT <- [(FRT) * (FRA)] + (FRB)
182 The floating-point operand in register FRT is multiplied
183 by the floating-point operand in register FRA. The float-
184 ing-point operand in register FRB is added to
185 this intermediate result, and the intermediate stored in FRS.
187 Using the exact same values of FRT, FRT and FRB as used to create FRS,
188 the floating-point operand in register FRT is multiplied
189 by the floating-point operand in register FRA. The float-
190 ing-point operand in register FRB is subtracted from
191 this intermediate result, and the intermediate stored in FRT.
194 a `fmadds` operation had been performed. FRS is created as if
195 a `fnmsubs` operation had simultaneously been performed with
196 the exact same register operands, in parallel, independently,
197 at exactly the same time.
199 FRT is a Read-Modify-Write operation.
201 Note that if Rc=1 an Illegal Instruction is raised.
204 Similar to `FRTp`, this instruction produces an implicit result,
205 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
206 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
207 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
211 Special Registers Altered: