no, fdmadd(s) is definitely not the same as fmadd(s)
[libreriscv.git] / openpower / sv / twin_butterfly.mdwn
1 # Introduction
2
3 <!-- hide -->
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
11
12 <!-- show -->
13
14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
15
16 The number of general-purpose uses for DCT is huge. The number of
17 instructions needed instead of these Twin-Butterfly instructions is also
18 huge (**eight**) and given that it is extremely common to explicitly
19 loop-unroll them quantity hundreds to thousands of instructions are
20 dismayingly common (for all ISAs).
21
22 The goal is to implement instructions that calculate the expression:
23
24 ```
25 fdct_round_shift((a +/- b) * c)
26 ```
27
28 For the single-coefficient butterfly instruction, and:
29
30 ```
31 fdct_round_shift(a * c1 +/- b * c2)
32 ```
33
34 For the double-coefficient butterfly instruction.
35
36 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
37
38 ```
39 #define ROUND_POWER_OF_TWO(value, n) \
40 (((value) + (1 << ((n)-1))) >> (n))
41 ```
42
43 These instructions are at the core of **ALL** FDCT calculations in many
44 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
45 Arm includes special instructions to optimize these operations, although
46 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
47
48 The suggestion is to have a single instruction to calculate both values
49 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
50 run in accumulate mode, so in order to calculate the 2-coeff version
51 one would just have to call the same instruction with different order a,
52 b and a different constant c.
53
54 Example taken from libvpx
55 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
56
57 ```
58 #include <stdint.h>
59 #define ROUND_POWER_OF_TWO(value, n) \
60 (((value) + (1 << ((n)-1))) >> (n))
61 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
62 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
63 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
64 }
65 ```
66
67 8 instructions are required - replaced by just the one (maddsubrs):
68
69 ```
70 add 9,5,4
71 subf 5,5,4
72 mullw 9,9,6
73 mullw 5,5,6
74 addi 9,9,8192
75 addi 5,5,8192
76 srawi 9,9,14
77 srawi 5,5,14
78 ```
79
80 ## Integer Butterfly Multiply Add/Sub FFT/DCT
81
82 **Add the following to Book I Section 3.3.9.1**
83
84 A-Form
85
86 ```
87 |0 |6 |11 |16 |21 |26 |31 |
88 | PO | RT | RA | RB | SH | XO |Rc |
89
90 ```
91
92 * maddsubrs RT,RA,SH,RB
93
94 Pseudo-code:
95
96 ```
97 n <- SH
98 sum <- (RT) + (RA)
99 diff <- (RT) - (RA)
100 prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1] + 1
101 prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1] + 1
102 res1 <- ROTL64(prod1, XLEN-n)
103 res2 <- ROTL64(prod2, XLEN-n)
104 m <- MASK(n, (XLEN-1))
105 signbit1 <- res1[0]
106 signbit2 <- res2[0]
107 smask1 <- ([signbit1]*XLEN) & ¬m
108 smask2 <- ([signbit2]*XLEN) & ¬m
109 s64_1 <- [0]*(XLEN-1) || signbit1
110 s64_2 <- [0]*(XLEN-1) || signbit2
111 RT <- (res1 & m | smask1) + s64_1
112 RS <- (res2 & m | smask2) + s64_2
113 ```
114
115 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
116
117 Similar to `RTp`, this instruction produces an implicit result, `RS`,
118 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
119 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
120 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
121
122 Special Registers Altered:
123
124 ```
125 None
126 ```
127
128 # Twin Butterfly Floating-Point DCT Instruction(s)
129
130 ## Floating-Point Twin Multiply-Add DCT [Single]
131
132 **Add the following to Book I Section 4.6.6.3**
133
134 X-Form
135
136 ```
137 |0 |6 |11 |16 |21 |31 |
138 | PO | FRT | FRA | FRB | XO |Rc |
139 ```
140
141 * fdmadds FRT,FRA,FRB (Rc=0)
142
143 Pseudo-code:
144
145 ```
146 FRS <- FPADD32(FRT, FRB)
147 sub <- FPSUB32(FRT, FRB)
148 FRT <- FPMUL32(FRA, sub)
149 ```
150
151 The two IEEE754-FP32 operations
152
153 ```
154 FRS <- [(FRT) + (FRB)]
155 FRT <- [(FRT) - (FRB)] * (FRA)
156 ```
157
158 are simultaneously performed.
159
160 The Floating-Point operand in register FRT is added to the floating-point
161 operand in register FRB and the result stored in FRS.
162
163 Using the exact same operand input register values from FRT and FRB
164 that were used to create FRS, the Floating-Point operand in register
165 FRB is subtracted from the floating-point operand in register FRT and
166 the result then multiplied by FRA to create an intermediate result that
167 is stored in FRT.
168
169 The add into FRS is treated exactly as `fadds`. The creation of the
170 result FRT is **not** the same as that of `fmsubs`.
171 The creation of FRS and FRT are treated as parallel independent operations
172 which occur at the same time.
173
174 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
175
176 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
177 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
178 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
179 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
180
181 Special Registers Altered:
182
183 ```
184 FPRF FR FI
185 FX OX UX XX
186 VXSNAN VXISI VXIMZ
187 ```
188
189 ## Floating-Point Multiply-Add FFT [Single]
190
191 **Add the following to Book I Section 4.6.6.3**
192
193 X-Form
194
195 ```
196 |0 |6 |11 |16 |21 |31 |
197 | PO | FRT | FRA | FRB | XO |Rc |
198 ```
199
200 * ffmadds FRT,FRA,FRB (Rc=0)
201
202 Pseudo-code:
203
204 ```
205 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
206 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
207 ```
208
209 The two operations
210
211 ```
212 FRS <- -([(FRT) * (FRA)] - (FRB))
213 FRT <- [(FRT) * (FRA)] + (FRB)
214 ```
215
216 are performed.
217
218 The floating-point operand in register FRT is multiplied by the
219 floating-point operand in register FRA. The floating-point operand in
220 register FRB is added to this intermediate result, and the intermediate
221 stored in FRS.
222
223 Using the exact same values of FRT, FRT and FRB as used to create
224 FRS, the floating-point operand in register FRT is multiplied by the
225 floating-point operand in register FRA. The float- ing-point operand
226 in register FRB is subtracted from this intermediate result, and the
227 intermediate stored in FRT.
228
229 FRT is created as if a `fmadds` operation had been performed. FRS is
230 created as if a `fnmsubs` operation had simultaneously been performed
231 with the exact same register operands, in parallel, independently,
232 at exactly the same time.
233
234 FRT is a Read-Modify-Write operation.
235
236 Note that if Rc=1 an Illegal Instruction is raised.
237 Rc=1 is `RESERVED`
238
239 Similar to `FRTp`, this instruction produces an implicit result,
240 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
241 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
242 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
243 (Max Vector Length).
244
245
246 Special Registers Altered:
247
248 ```
249 FPRF FR FI
250 FX OX UX XX
251 VXSNAN VXISI VXIMZ
252 ```
253 ## Floating-Point Twin Multiply-Add DCT
254
255 **Add the following to Book I Section 4.6.6.3**
256
257 X-Form
258
259 ```
260 |0 |6 |11 |16 |21 |31 |
261 | PO | FRT | FRA | FRB | XO |Rc |
262 ```
263
264 * fdmadd FRT,FRA,FRB (Rc=0)
265
266 Pseudo-code:
267
268 ```
269 FRS <- FPADD64(FRT, FRB)
270 sub <- FPSUB64(FRT, FRB)
271 FRT <- FPMUL64(FRA, sub)
272 ```
273
274 The two IEEE754-FP64 operations
275
276 ```
277 FRS <- [(FRT) + (FRB)]
278 FRT <- [(FRT) - (FRB)] * (FRA)
279 ```
280
281 are simultaneously performed.
282
283 The Floating-Point operand in register FRT is added to the floating-point
284 operand in register FRB and the result stored in FRS.
285
286 Using the exact same operand input register values from FRT and FRB
287 that were used to create FRS, the Floating-Point operand in register
288 FRB is subtracted from the floating-point operand in register FRT and
289 the result then multiplied by FRA to create an intermediate result that
290 is stored in FRT.
291
292 The add into FRS is treated exactly as `fadd`. The creation of the
293 result FRT is **not** the same as that of `fmsub`.
294 The creation of FRS and FRT are treated as parallel independent operations
295 which occur at the same time.
296
297 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
298
299 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
300 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
301 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
302 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
303
304 Special Registers Altered:
305
306 ```
307 FPRF FR FI
308 FX OX UX XX
309 VXSNAN VXISI VXIMZ
310 ```
311
312 ## Floating-Point Twin Multiply-Add FFT
313
314 **Add the following to Book I Section 4.6.6.3**
315
316 X-Form
317
318 ```
319 |0 |6 |11 |16 |21 |31 |
320 | PO | FRT | FRA | FRB | XO |Rc |
321 ```
322
323 * ffmadd FRT,FRA,FRB (Rc=0)
324
325 Pseudo-code:
326
327 ```
328 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
329 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
330 ```
331
332 The two operations
333
334 ```
335 FRS <- -([(FRT) * (FRA)] - (FRB))
336 FRT <- [(FRT) * (FRA)] + (FRB)
337 ```
338
339 are performed.
340
341 The floating-point operand in register FRT is multiplied by the
342 floating-point operand in register FRA. The float- ing-point operand in
343 register FRB is added to this intermediate result, and the intermediate
344 stored in FRS.
345
346 Using the exact same values of FRT, FRT and FRB as used to create
347 FRS, the floating-point operand in register FRT is multiplied by the
348 floating-point operand in register FRA. The float- ing-point operand
349 in register FRB is subtracted from this intermediate result, and the
350 intermediate stored in FRT.
351
352 FRT is created as if a `fmadd` operation had been performed. FRS is
353 created as if a `fnmsub` operation had simultaneously been performed
354 with the exact same register operands, in parallel, independently,
355 at exactly the same time.
356
357 FRT is a Read-Modify-Write operation.
358
359 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
360
361 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
362 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
363 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
364 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
365
366 Special Registers Altered:
367
368 ```
369 FPRF FR FI
370 FX OX UX XX
371 VXSNAN VXISI VXIMZ
372 ```
373
374
375 ## [DRAFT] Floating-Point Add FFT/DCT [Single]
376
377 A-Form
378
379 * ffadds FRT,FRA,FRB (Rc=0)
380 * ffadds. FRT,FRA,FRB (Rc=1)
381
382 Pseudo-code:
383
384 ```
385 FRT <- FPADD32(FRA, FRB)
386 FRS <- FPSUB32(FRB, FRA)
387 ```
388
389 Special Registers Altered:
390
391 ```
392 FPRF FR FI
393 FX OX UX XX
394 VXSNAN VXISI
395 CR1 (if Rc=1)
396 ```
397
398 ## [DRAFT] Floating-Point Add FFT/DCT [Double]
399
400 A-Form
401
402 * ffadd FRT,FRA,FRB (Rc=0)
403 * ffadd. FRT,FRA,FRB (Rc=1)
404
405 Pseudo-code:
406
407 ```
408 FRT <- FPADD64(FRA, FRB)
409 FRS <- FPSUB64(FRB, FRA)
410 ```
411
412 Special Registers Altered:
413
414 ```
415 FPRF FR FI
416 FX OX UX XX
417 VXSNAN VXISI
418 CR1 (if Rc=1)
419 ```
420
421 ## [DRAFT] Floating-Point Subtract FFT/DCT [Single]
422
423 A-Form
424
425 * ffsubs FRT,FRA,FRB (Rc=0)
426 * ffsubs. FRT,FRA,FRB (Rc=1)
427
428 Pseudo-code:
429
430 ```
431 FRT <- FPSUB32(FRB, FRA)
432 FRS <- FPADD32(FRA, FRB)
433 ```
434
435 Special Registers Altered:
436
437 ```
438 FPRF FR FI
439 FX OX UX XX
440 VXSNAN VXISI
441 CR1 (if Rc=1)
442 ```
443
444 ## [DRAFT] Floating-Point Subtract FFT/DCT [Double]
445
446 A-Form
447
448 * ffsub FRT,FRA,FRB (Rc=0)
449 * ffsub. FRT,FRA,FRB (Rc=1)
450
451 Pseudo-code:
452
453 ```
454 FRT <- FPSUB64(FRB, FRA)
455 FRS <- FPADD64(FRA, FRB)
456 ```
457
458 Special Registers Altered:
459
460 ```
461 FPRF FR FI
462 FX OX UX XX
463 VXSNAN VXISI
464 CR1 (if Rc=1)
465 ```