clarify fdmadds wording in twin_butterfly.mdwn
[libreriscv.git] / openpower / sv / twin_butterfly.mdwn
1 # Introduction
2
3 <!-- hide -->
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
11
12 <!-- show -->
13
14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
15
16 The number of general-purpose uses for DCT is huge. The number of
17 instructions needed instead of these Twin-Butterfly instructions is also
18 huge (**eight**) and given that it is extremely common to explicitly
19 loop-unroll them quantity hundreds to thousands of instructions are
20 dismayingly common (for all ISAs).
21
22 The goal is to implement instructions that calculate the expression:
23
24 ```
25 fdct_round_shift((a +/- b) * c)
26 ```
27
28 For the single-coefficient butterfly instruction, and:
29
30 ```
31 fdct_round_shift(a * c1 +/- b * c2)
32 ```
33
34 For the double-coefficient butterfly instruction.
35
36 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
37
38 ```
39 #define ROUND_POWER_OF_TWO(value, n) \
40 (((value) + (1 << ((n)-1))) >> (n))
41 ```
42
43 These instructions are at the core of **ALL** FDCT calculations in many
44 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
45 Arm includes special instructions to optimize these operations, although
46 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
47
48 The suggestion is to have a single instruction to calculate both values
49 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
50 run in accumulate mode, so in order to calculate the 2-coeff version
51 one would just have to call the same instruction with different order a,
52 b and a different constant c.
53
54 Example taken from libvpx
55 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
56
57 ```
58 #include <stdint.h>
59 #define ROUND_POWER_OF_TWO(value, n) \
60 (((value) + (1 << ((n)-1))) >> (n))
61 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
62 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
63 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
64 }
65 ```
66
67 8 instructions are required - replaced by just the one (maddsubrs):
68
69 ```
70 add 9,5,4
71 subf 5,5,4
72 mullw 9,9,6
73 mullw 5,5,6
74 addi 9,9,8192
75 addi 5,5,8192
76 srawi 9,9,14
77 srawi 5,5,14
78 ```
79
80 -------
81
82 \newpage{}
83
84 ## Integer Butterfly Multiply Add/Sub FFT/DCT
85
86 **Add the following to Book I Section 3.3.9.1**
87
88 A-Form
89
90 ```
91 |0 |6 |11 |16 |21 |26 |31 |
92 | PO | RT | RA | RB | SH | XO |Rc |
93
94 ```
95
96 * maddsubrs RT,RA,SH,RB
97
98 Pseudo-code:
99
100 ```
101 n <- SH
102 sum <- (RT) + (RA)
103 diff <- (RT) - (RA)
104 prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1] + 1
105 prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1] + 1
106 res1 <- ROTL64(prod1, XLEN-n)
107 res2 <- ROTL64(prod2, XLEN-n)
108 m <- MASK(n, (XLEN-1))
109 signbit1 <- res1[0]
110 signbit2 <- res2[0]
111 smask1 <- ([signbit1]*XLEN) & ¬m
112 smask2 <- ([signbit2]*XLEN) & ¬m
113 s64_1 <- [0]*(XLEN-1) || signbit1
114 s64_2 <- [0]*(XLEN-1) || signbit2
115 RT <- (res1 & m | smask1) + s64_1
116 RS <- (res2 & m | smask2) + s64_2
117 ```
118
119 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
120
121 Similar to `RTp`, this instruction produces an implicit result, `RS`,
122 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
123 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
124 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
125
126 Special Registers Altered:
127
128 ```
129 None
130 ```
131
132 -------
133
134 \newpage{}
135
136 # Twin Butterfly Floating-Point DCT Instruction(s)
137
138 **Add the following to Book I Section 4.6.6.3**
139
140 ## Floating-Point Twin Multiply-Add DCT [Single]
141
142 X-Form
143
144 ```
145 |0 |6 |11 |16 |21 |31 |
146 | PO | FRT | FRA | FRB | XO |Rc |
147 ```
148
149 * fdmadds FRT,FRA,FRB (Rc=0)
150
151 Pseudo-code:
152
153 ```
154 FRS <- FPADD32(FRT, FRB)
155 sub <- FPSUB32(FRT, FRB)
156 FRT <- FPMUL32(FRA, sub)
157 ```
158
159 The two IEEE754-FP32 operations
160
161 ```
162 FRS <- [(FRT) + (FRB)]
163 FRT <- [(FRT) - (FRB)] * (FRA)
164 ```
165
166 are simultaneously performed.
167
168 The Floating-Point operand in register FRT is added to the floating-point
169 operand in register FRB and the result stored in FRS.
170
171 Using the exact same operand input register values from FRT and FRB
172 that were used to create FRS, the Floating-Point operand in register
173 FRB is subtracted from the floating-point operand in register FRT and
174 the result then rounded before being multiplied by FRA to create an
175 intermediate result that is stored in FRT.
176
177 The add into FRS is treated exactly as `fadds`. The creation of the
178 result FRT is **not** the same as that of `fmsubs`, but is instead as if
179 `fsubs` were performed first followed by `fmuls`. The creation of FRS
180 and FRT are treated as parallel independent operations which occur at
181 the same time.
182
183 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
184
185 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
186 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
187 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
188 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
189
190 Special Registers Altered:
191
192 ```
193 FPRF FR FI
194 FX OX UX XX
195 VXSNAN VXISI VXIMZ
196 ```
197
198 ## Floating-Point Multiply-Add FFT [Single]
199
200 X-Form
201
202 ```
203 |0 |6 |11 |16 |21 |31 |
204 | PO | FRT | FRA | FRB | XO |Rc |
205 ```
206
207 * ffmadds FRT,FRA,FRB (Rc=0)
208
209 Pseudo-code:
210
211 ```
212 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
213 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
214 ```
215
216 The two operations
217
218 ```
219 FRS <- -([(FRT) * (FRA)] - (FRB))
220 FRT <- [(FRT) * (FRA)] + (FRB)
221 ```
222
223 are performed.
224
225 The floating-point operand in register FRT is multiplied by the
226 floating-point operand in register FRA. The floating-point operand in
227 register FRB is added to this intermediate result, and the intermediate
228 stored in FRS.
229
230 Using the exact same values of FRT, FRT and FRB as used to create
231 FRS, the floating-point operand in register FRT is multiplied by the
232 floating-point operand in register FRA. The float- ing-point operand
233 in register FRB is subtracted from this intermediate result, and the
234 intermediate stored in FRT.
235
236 FRT is created as if a `fmadds` operation had been performed. FRS is
237 created as if a `fnmsubs` operation had simultaneously been performed
238 with the exact same register operands, in parallel, independently,
239 at exactly the same time.
240
241 FRT is a Read-Modify-Write operation.
242
243 Note that if Rc=1 an Illegal Instruction is raised.
244 Rc=1 is `RESERVED`
245
246 Similar to `FRTp`, this instruction produces an implicit result,
247 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
248 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
249 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
250 (Max Vector Length).
251
252
253 Special Registers Altered:
254
255 ```
256 FPRF FR FI
257 FX OX UX XX
258 VXSNAN VXISI VXIMZ
259 ```
260 ## Floating-Point Twin Multiply-Add DCT
261
262 X-Form
263
264 ```
265 |0 |6 |11 |16 |21 |31 |
266 | PO | FRT | FRA | FRB | XO |Rc |
267 ```
268
269 * fdmadd FRT,FRA,FRB (Rc=0)
270
271 Pseudo-code:
272
273 ```
274 FRS <- FPADD64(FRT, FRB)
275 sub <- FPSUB64(FRT, FRB)
276 FRT <- FPMUL64(FRA, sub)
277 ```
278
279 The two IEEE754-FP64 operations
280
281 ```
282 FRS <- [(FRT) + (FRB)]
283 FRT <- [(FRT) - (FRB)] * (FRA)
284 ```
285
286 are simultaneously performed.
287
288 The Floating-Point operand in register FRT is added to the floating-point
289 operand in register FRB and the result stored in FRS.
290
291 Using the exact same operand input register values from FRT and FRB
292 that were used to create FRS, the Floating-Point operand in register
293 FRB is subtracted from the floating-point operand in register FRT and
294 the result then rounded before being multiplied by FRA to create an
295 intermediate result that is stored in FRT.
296
297 The add into FRS is treated exactly as `fadd`. The creation of the
298 result FRT is **not** the same as that of `fmsub`, but is instead as if
299 `fsub` were performed first followed by `fmuls. The creation of FRS
300 and FRT are treated as parallel independent operations which occur at
301 the same time.
302
303 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
304
305 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
306 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
307 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
308 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
309
310 Special Registers Altered:
311
312 ```
313 FPRF FR FI
314 FX OX UX XX
315 VXSNAN VXISI VXIMZ
316 ```
317
318 ## Floating-Point Twin Multiply-Add FFT
319
320 X-Form
321
322 ```
323 |0 |6 |11 |16 |21 |31 |
324 | PO | FRT | FRA | FRB | XO |Rc |
325 ```
326
327 * ffmadd FRT,FRA,FRB (Rc=0)
328
329 Pseudo-code:
330
331 ```
332 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
333 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
334 ```
335
336 The two operations
337
338 ```
339 FRS <- -([(FRT) * (FRA)] - (FRB))
340 FRT <- [(FRT) * (FRA)] + (FRB)
341 ```
342
343 are performed.
344
345 The floating-point operand in register FRT is multiplied by the
346 floating-point operand in register FRA. The float- ing-point operand in
347 register FRB is added to this intermediate result, and the intermediate
348 stored in FRS.
349
350 Using the exact same values of FRT, FRT and FRB as used to create
351 FRS, the floating-point operand in register FRT is multiplied by the
352 floating-point operand in register FRA. The float- ing-point operand
353 in register FRB is subtracted from this intermediate result, and the
354 intermediate stored in FRT.
355
356 FRT is created as if a `fmadd` operation had been performed. FRS is
357 created as if a `fnmsub` operation had simultaneously been performed
358 with the exact same register operands, in parallel, independently,
359 at exactly the same time.
360
361 FRT is a Read-Modify-Write operation.
362
363 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
364
365 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
366 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
367 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
368 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
369
370 Special Registers Altered:
371
372 ```
373 FPRF FR FI
374 FX OX UX XX
375 VXSNAN VXISI VXIMZ
376 ```
377
378
379 ## [DRAFT] Floating-Point Add FFT/DCT [Single]
380
381 A-Form
382
383 * ffadds FRT,FRA,FRB (Rc=0)
384 * ffadds. FRT,FRA,FRB (Rc=1)
385
386 Pseudo-code:
387
388 ```
389 FRT <- FPADD32(FRA, FRB)
390 FRS <- FPSUB32(FRB, FRA)
391 ```
392
393 Special Registers Altered:
394
395 ```
396 FPRF FR FI
397 FX OX UX XX
398 VXSNAN VXISI
399 CR1 (if Rc=1)
400 ```
401
402 ## [DRAFT] Floating-Point Add FFT/DCT [Double]
403
404 A-Form
405
406 * ffadd FRT,FRA,FRB (Rc=0)
407 * ffadd. FRT,FRA,FRB (Rc=1)
408
409 Pseudo-code:
410
411 ```
412 FRT <- FPADD64(FRA, FRB)
413 FRS <- FPSUB64(FRB, FRA)
414 ```
415
416 Special Registers Altered:
417
418 ```
419 FPRF FR FI
420 FX OX UX XX
421 VXSNAN VXISI
422 CR1 (if Rc=1)
423 ```
424
425 ## [DRAFT] Floating-Point Subtract FFT/DCT [Single]
426
427 A-Form
428
429 * ffsubs FRT,FRA,FRB (Rc=0)
430 * ffsubs. FRT,FRA,FRB (Rc=1)
431
432 Pseudo-code:
433
434 ```
435 FRT <- FPSUB32(FRB, FRA)
436 FRS <- FPADD32(FRA, FRB)
437 ```
438
439 Special Registers Altered:
440
441 ```
442 FPRF FR FI
443 FX OX UX XX
444 VXSNAN VXISI
445 CR1 (if Rc=1)
446 ```
447
448 ## [DRAFT] Floating-Point Subtract FFT/DCT [Double]
449
450 A-Form
451
452 * ffsub FRT,FRA,FRB (Rc=0)
453 * ffsub. FRT,FRA,FRB (Rc=1)
454
455 Pseudo-code:
456
457 ```
458 FRT <- FPSUB64(FRB, FRA)
459 FRS <- FPADD64(FRA, FRB)
460 ```
461
462 Special Registers Altered:
463
464 ```
465 FPRF FR FI
466 FX OX UX XX
467 VXSNAN VXISI
468 CR1 (if Rc=1)
469 ```