(no commit message)
[libreriscv.git] / openpower / sv / twin_butterfly.mdwn
1 # Introduction
2
3 <!-- hide -->
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[ls016]]
11
12 <!-- show -->
13
14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
15
16 The number of general-purpose uses for DCT is huge. The
17 number of instructions needed instead of these Twin-Butterfly
18 instructions is also huge (**eight**) and given that it is
19 extremely common to explicitly loop-unroll them quantity
20 hundreds to thousands of instructions are dismayingly common
21 (for all ISAs).
22
23 The goal is to implement instructions that calculate the expression:
24
25 ```
26 fdct_round_shift((a +/- b) * c)
27 ```
28
29 For the single-coefficient butterfly instruction, and:
30
31 ```
32 fdct_round_shift(a * c1 +/- b * c2)
33 ```
34
35 For the double-coefficient butterfly instruction.
36
37 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
38
39 ```
40 #define ROUND_POWER_OF_TWO(value, n) \
41 (((value) + (1 << ((n)-1))) >> (n))
42 ```
43
44 These instructions are at the core of **ALL** FDCT calculations in many major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
45 Arm includes special instructions to optimize these operations, although they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
46
47 The suggestion is to have a single instruction to calculate both values `((a + b) * c) >> N`, and `((a - b) * c) >> N`.
48 The instruction will run in accumulate mode, so in order to calculate the 2-coeff version one would just have to call the same instruction with different order a, b and a different constant c.
49
50 ## Integer Butterfly Multiply Add/Sub FFT/DCT
51
52 **Add the following to Book I Section 3.3.9.1**
53
54 A-Form
55
56 ```
57 |0 |6 |11 |16 |21 |26 |31 |
58 | PO | RT | RA | RB | SH | XO |Rc |
59
60 ```
61
62 * maddsubrs RT,RA,SH,RB
63
64 Pseudo-code:
65
66 ```
67 n <- SH
68 sum <- (RT) + (RA)
69 diff <- (RT) - (RA)
70 prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1]
71 prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1]
72 res1 <- ROTL64(prod1, XLEN-n)
73 res2 <- ROTL64(prod2, XLEN-n)
74 m <- MASK(n, (XLEN-1))
75 signbit1 <- res1[0]
76 signbit2 <- res2[0]
77 smask1 <- ([signbit1]*XLEN) & ¬m
78 smask2 <- ([signbit2]*XLEN) & ¬m
79 s64_1 <- [0]*(XLEN-1) || signbit1
80 s64_2 <- [0]*(XLEN-1) || signbit2
81 RT <- (res1 & m | smask1) + s64_1
82 RS <- (res2 & m | smask2) + s64_2
83 ```
84
85 Note that if Rc=1 an Illegal Instruction is raised.
86 Rc=1 is `RESERVED`
87
88 Similar to `RTp`, this instruction produces an implicit result,
89 `RS`, which under Scalar circumstances is defined as `RT+1`.
90 For SVP64 if `RT` is a Vector, `RS` begins immediately after the
91 Vector `RT` where the length of `RT` is set by `SVSTATE.MAXVL`
92 (Max Vector Length).
93
94 Special Registers Altered:
95
96 ```
97 None
98 ```
99
100 # Twin Butterfly Integer DCT Instruction(s)
101
102 ## Floating Twin Multiply-Add DCT [Single]
103
104 **Add the following to Book I Section 4.6.6.3**
105
106 X-Form
107
108 ```
109 |0 |6 |11 |16 |21 |31 |
110 | PO | FRT | FRA | FRB | XO |Rc |
111 ```
112
113 * fdmadds FRT,FRA,FRB (Rc=0)
114
115 Pseudo-code:
116
117 ```
118 FRS <- FPADD32(FRT, FRB)
119 sub <- FPSUB32(FRT, FRB)
120 FRT <- FPMUL32(FRA, sub)
121 ```
122
123 The Floating-Point operand in register FRT is added to the floating-point
124 operand in register FRB and the result stored in FRS.
125
126 Using the exact same operand input register values from FRT and FRB that
127 were used to create FRS, the Floating-Point operand in register FRB
128 is subtracted from the floating-point operand in register FRT and the
129 result then multiplied by FRA to create an intermediate result that is
130 stored in FRT.
131
132 The subtraction and multiply are treated as if they were `fsub`
133 followed by `fmul`, not `fmsub`. The creation of FRS and FRT are
134 treated as parallel independent operations.
135
136 Note that if Rc=1 an Illegal Instruction is raised.
137 Rc=1 is `RESERVED`
138
139 Similar to `FRTp`, this instruction produces an implicit result,
140 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
141 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
142 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
143 (Max Vector Length).
144
145 Special Registers Altered:
146
147 ```
148 FPRF FR FI
149 FX OX UX XX
150 VXSNAN VXISI VXIMZ
151 ```
152
153 ## Floating Multiply-Add FFT [Single]
154
155 **Add the following to Book I Section 4.6.6.3**
156
157 X-Form
158
159 ```
160 |0 |6 |11 |16 |21 |31 |
161 | PO | FRT | FRA | FRB | XO |Rc |
162 ```
163
164 * ffmadds FRT,FRA,FRB (Rc=0)
165
166 Pseudo-code:
167
168 ```
169 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
170 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
171 ```
172
173 The two operations
174
175 ```
176 FRS <- -([(FRT) * (FRA)] - (FRB))
177 FRT <- [(FRT) * (FRA)] + (FRB)
178 ```
179
180 are performed.
181
182 The floating-point operand in register FRT is multiplied
183 by the floating-point operand in register FRA. The float-
184 ing-point operand in register FRB is added to
185 this intermediate result, and the intermediate stored in FRS.
186
187 Using the exact same values of FRT, FRT and FRB as used to create FRS,
188 the floating-point operand in register FRT is multiplied
189 by the floating-point operand in register FRA. The float-
190 ing-point operand in register FRB is subtracted from
191 this intermediate result, and the intermediate stored in FRT.
192
193 FRT is created as if
194 a `fmadds` operation had been performed. FRS is created as if
195 a `fnmsubs` operation had simultaneously been performed with
196 the exact same register operands, in parallel, independently,
197 at exactly the same time.
198
199 FRT is a Read-Modify-Write operation.
200
201 Note that if Rc=1 an Illegal Instruction is raised.
202 Rc=1 is `RESERVED`
203
204 Similar to `FRTp`, this instruction produces an implicit result,
205 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
206 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
207 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
208 (Max Vector Length).
209
210
211 Special Registers Altered:
212
213 ```
214 FPRF FR FI
215 FX OX UX XX
216 VXSNAN VXISI VXIMZ
217 ```