(no commit message)
[libreriscv.git] / openpower / sv / twin_butterfly.mdwn
1 # Introduction
2
3 <!-- hide -->
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
11 <!-- show -->
12
13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
14 context to save considerably on DCT, DFT and FFT processing.
15
16 # Rationale for Twin Butterfly Integer DCT Instruction(s)
17
18 The number of general-purpose uses for DCT is huge. The number of
19 instructions needed instead of these Twin-Butterfly instructions is also
20 huge (**eight**) and given that it is extremely common to explicitly
21 loop-unroll them quantity hundreds to thousands of instructions are
22 dismayingly common (for all ISAs).
23
24 The goal is to implement instructions that calculate the expression:
25
26 ```
27 fdct_round_shift((a +/- b) * c)
28 ```
29
30 For the single-coefficient butterfly instruction, and:
31
32 ```
33 fdct_round_shift(a * c1 +/- b * c2)
34 ```
35
36 For the double-coefficient butterfly instruction.
37
38 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
39
40 ```
41 #define ROUND_POWER_OF_TWO(value, n) \
42 (((value) + (1 << ((n)-1))) >> (n))
43 ```
44
45 These instructions are at the core of **ALL** FDCT calculations in many
46 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
47 Arm includes special instructions to optimize these operations, although
48 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
49
50 The suggestion is to have a single instruction to calculate both values
51 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
52 run in accumulate mode, so in order to calculate the 2-coeff version
53 one would just have to call the same instruction with different order a,
54 b and a different constant c.
55
56 Example taken from libvpx
57 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
58
59 ```
60 #include <stdint.h>
61 #define ROUND_POWER_OF_TWO(value, n) \
62 (((value) + (1 << ((n)-1))) >> (n))
63 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
64 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
65 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
66 }
67 ```
68
69 8 instructions are required - replaced by just the one (maddsubrs):
70
71 ```
72 add 9,5,4
73 subf 5,5,4
74 mullw 9,9,6
75 mullw 5,5,6
76 addi 9,9,8192
77 addi 5,5,8192
78 srawi 9,9,14
79 srawi 5,5,14
80 ```
81
82 -------
83
84 \newpage{}
85
86 ## Integer Butterfly Multiply Add/Sub FFT/DCT
87
88 **Add the following to Book I Section 3.3.9.1**
89
90 A-Form
91
92 ```
93 |0 |6 |11 |16 |21 |26 |31 |
94 | PO | RT | RA | RB | SH | XO |Rc |
95 ```
96
97 * maddsubrs RT,RA,SH,RB
98
99 Pseudo-code:
100
101 ```
102 n <- SH
103 sum <- (RT) + (RA)
104 diff <- (RT) - (RA)
105 prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1]
106 prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1]
107 if n = 0 then
108 #round <- EXTS([0]*(XLEN-1) || [1]*1)
109 #prod1 <- ROTL64(prod1, 1)
110 #prod2 <- ROTL64(prod2, 1)
111 #prod1 <- prod1 + round
112 #prod2 <- prod2 + round
113 #res1 <- ROTL64(prod1, XLEN-1)
114 #res2 <- ROTL64(prod2, XLEN-1)
115 #m <- MASK(1, (XLEN-1))
116 RT <- prod1
117 RS <- prod2
118 else
119 round <- EXTS([0]*(XLEN -n -1) || [1]*1 || [0]*(n-1))
120 prod1 <- prod1 + round
121 prod2 <- prod2 + round
122 res1 <- ROTL64(prod1, XLEN-n)
123 res2 <- ROTL64(prod2, XLEN-n)
124 m <- MASK(n, (XLEN-1))
125 signbit1 <- prod1[0]
126 signbit2 <- prod2[0]
127 smask1 <- ([signbit1]*XLEN) & ¬m
128 smask2 <- ([signbit2]*XLEN) & ¬m
129 RT <- (res1 & m | smask1)
130 RS <- (res2 & m | smask2)
131 ```
132
133 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
134
135 Similar to `RTp`, this instruction produces an implicit result, `RS`,
136 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
137 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
138 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
139
140 Special Registers Altered:
141
142 ```
143 None
144 ```
145
146 -------
147
148 \newpage{}
149
150 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
151
152 **Add the following to Book I Section 4.6.6.3**
153
154 ## Floating-Point Twin Multiply-Add DCT [Single]
155
156 X-Form
157
158 ```
159 |0 |6 |11 |16 |21 |31 |
160 | PO | FRT | FRA | FRB | XO |Rc |
161 ```
162
163 * fdmadds FRT,FRA,FRB (Rc=0)
164
165 Pseudo-code:
166
167 ```
168 FRS <- FPADD32(FRT, FRB)
169 sub <- FPSUB32(FRT, FRB)
170 FRT <- FPMUL32(FRA, sub)
171 ```
172
173 The two IEEE754-FP32 operations
174
175 ```
176 FRS <- [(FRT) + (FRB)]
177 FRT <- [(FRT) - (FRB)] * (FRA)
178 ```
179
180 are simultaneously performed.
181
182 The Floating-Point operand in register FRT is added to the floating-point
183 operand in register FRB and the result stored in FRS.
184
185 Using the exact same operand input register values from FRT and FRB
186 that were used to create FRS, the Floating-Point operand in register
187 FRB is subtracted from the floating-point operand in register FRT and
188 the result then rounded before being multiplied by FRA to create an
189 intermediate result that is stored in FRT.
190
191 The add into FRS is treated exactly as `fadds`. The creation of the
192 result FRT is **not** the same as that of `fmsubs`, but is instead as if
193 `fsubs` were performed first followed by `fmuls`. The creation of FRS
194 and FRT are treated as parallel independent operations which occur at
195 the same time.
196
197 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
198
199 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
200 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
201 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
202 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
203
204 Special Registers Altered:
205
206 ```
207 FPRF FR FI
208 FX OX UX XX
209 VXSNAN VXISI VXIMZ
210 ```
211
212 ## Floating-Point Multiply-Add FFT [Single]
213
214 X-Form
215
216 ```
217 |0 |6 |11 |16 |21 |31 |
218 | PO | FRT | FRA | FRB | XO |Rc |
219 ```
220
221 * ffmadds FRT,FRA,FRB (Rc=0)
222
223 Pseudo-code:
224
225 ```
226 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
227 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
228 ```
229
230 The two operations
231
232 ```
233 FRS <- -([(FRT) * (FRA)] - (FRB))
234 FRT <- [(FRT) * (FRA)] + (FRB)
235 ```
236
237 are performed.
238
239 The floating-point operand in register FRT is multiplied by the
240 floating-point operand in register FRA. The floating-point operand in
241 register FRB is added to this intermediate result, and the intermediate
242 stored in FRS.
243
244 Using the exact same values of FRT, FRT and FRB as used to create
245 FRS, the floating-point operand in register FRT is multiplied by the
246 floating-point operand in register FRA. The floating-point operand
247 in register FRB is subtracted from this intermediate result, and the
248 intermediate stored in FRT.
249
250 FRT is created as if a `fmadds` operation had been performed. FRS is
251 created as if a `fnmsubs` operation had simultaneously been performed
252 with the exact same register operands, in parallel, independently,
253 at exactly the same time.
254
255 FRT is a Read-Modify-Write operation.
256
257 Note that if Rc=1 an Illegal Instruction is raised.
258 Rc=1 is `RESERVED`
259
260 Similar to `FRTp`, this instruction produces an implicit result,
261 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
262 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
263 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
264 (Max Vector Length).
265
266 Special Registers Altered:
267
268 ```
269 FPRF FR FI
270 FX OX UX XX
271 VXSNAN VXISI VXIMZ
272 ```
273
274 ## Floating-Point Twin Multiply-Add DCT
275
276 X-Form
277
278 ```
279 |0 |6 |11 |16 |21 |31 |
280 | PO | FRT | FRA | FRB | XO |Rc |
281 ```
282
283 * fdmadd FRT,FRA,FRB (Rc=0)
284
285 Pseudo-code:
286
287 ```
288 FRS <- FPADD64(FRT, FRB)
289 sub <- FPSUB64(FRT, FRB)
290 FRT <- FPMUL64(FRA, sub)
291 ```
292
293 The two IEEE754-FP64 operations
294
295 ```
296 FRS <- [(FRT) + (FRB)]
297 FRT <- [(FRT) - (FRB)] * (FRA)
298 ```
299
300 are simultaneously performed.
301
302 The Floating-Point operand in register FRT is added to the floating-point
303 operand in register FRB and the result stored in FRS.
304
305 Using the exact same operand input register values from FRT and FRB
306 that were used to create FRS, the Floating-Point operand in register
307 FRB is subtracted from the floating-point operand in register FRT and
308 the result then rounded before being multiplied by FRA to create an
309 intermediate result that is stored in FRT.
310
311 The add into FRS is treated exactly as `fadd`. The creation of the
312 result FRT is **not** the same as that of `fmsub`, but is instead as if
313 `fsub` were performed first followed by `fmuls. The creation of FRS
314 and FRT are treated as parallel independent operations which occur at
315 the same time.
316
317 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
318
319 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
320 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
321 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
322 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
323
324 Special Registers Altered:
325
326 ```
327 FPRF FR FI
328 FX OX UX XX
329 VXSNAN VXISI VXIMZ
330 ```
331
332 ## Floating-Point Twin Multiply-Add FFT
333
334 X-Form
335
336 ```
337 |0 |6 |11 |16 |21 |31 |
338 | PO | FRT | FRA | FRB | XO |Rc |
339 ```
340
341 * ffmadd FRT,FRA,FRB (Rc=0)
342
343 Pseudo-code:
344
345 ```
346 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
347 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
348 ```
349
350 The two operations
351
352 ```
353 FRS <- -([(FRT) * (FRA)] - (FRB))
354 FRT <- [(FRT) * (FRA)] + (FRB)
355 ```
356
357 are performed.
358
359 The floating-point operand in register FRT is multiplied by the
360 floating-point operand in register FRA. The float- ing-point operand in
361 register FRB is added to this intermediate result, and the intermediate
362 stored in FRS.
363
364 Using the exact same values of FRT, FRT and FRB as used to create
365 FRS, the floating-point operand in register FRT is multiplied by the
366 floating-point operand in register FRA. The float- ing-point operand
367 in register FRB is subtracted from this intermediate result, and the
368 intermediate stored in FRT.
369
370 FRT is created as if a `fmadd` operation had been performed. FRS is
371 created as if a `fnmsub` operation had simultaneously been performed
372 with the exact same register operands, in parallel, independently,
373 at exactly the same time.
374
375 FRT is a Read-Modify-Write operation.
376
377 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
378
379 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
380 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
381 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
382 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
383
384 Special Registers Altered:
385
386 ```
387 FPRF FR FI
388 FX OX UX XX
389 VXSNAN VXISI VXIMZ
390 ```
391
392
393 ## Floating-Point Add FFT/DCT [Single]
394
395 A-Form
396
397 ```
398 |0 |6 |11 |16 |21 |26 |31 |
399 | PO | FRT | FRA | FRB | / | XO |Rc |
400 ```
401
402 * ffadds FRT,FRA,FRB (Rc=0)
403
404 Pseudo-code:
405
406 ```
407 FRT <- FPADD32(FRA, FRB)
408 FRS <- FPSUB32(FRB, FRA)
409 ```
410
411 Special Registers Altered:
412
413 ```
414 FPRF FR FI
415 FX OX UX XX
416 VXSNAN VXISI
417 ```
418
419 ## Floating-Point Add FFT/DCT [Double]
420
421 A-Form
422
423 ```
424 |0 |6 |11 |16 |21 |26 |31 |
425 | PO | FRT | FRA | FRB | / | XO |Rc |
426 ```
427
428 * ffadd FRT,FRA,FRB (Rc=0)
429
430 Pseudo-code:
431
432 ```
433 FRT <- FPADD64(FRA, FRB)
434 FRS <- FPSUB64(FRB, FRA)
435 ```
436
437 Special Registers Altered:
438
439 ```
440 FPRF FR FI
441 FX OX UX XX
442 VXSNAN VXISI
443 ```
444
445 ## Floating-Point Subtract FFT/DCT [Single]
446
447 A-Form
448
449 ```
450 |0 |6 |11 |16 |21 |26 |31 |
451 | PO | FRT | FRA | FRB | / | XO |Rc |
452 ```
453
454 * ffsubs FRT,FRA,FRB (Rc=0)
455
456 Pseudo-code:
457
458 ```
459 FRT <- FPSUB32(FRB, FRA)
460 FRS <- FPADD32(FRA, FRB)
461 ```
462
463 Special Registers Altered:
464
465 ```
466 FPRF FR FI
467 FX OX UX XX
468 VXSNAN VXISI
469 ```
470
471 ## Floating-Point Subtract FFT/DCT [Double]
472
473 A-Form
474
475 ```
476 |0 |6 |11 |16 |21 |26 |31 |
477 | PO | FRT | FRA | FRB | / | XO |Rc |
478 ```
479
480 * ffsub FRT,FRA,FRB (Rc=0)
481
482 Pseudo-code:
483
484 ```
485 FRT <- FPSUB64(FRB, FRA)
486 FRS <- FPADD64(FRA, FRB)
487 ```
488
489 Special Registers Altered:
490
491 ```
492 FPRF FR FI
493 FX OX UX XX
494 VXSNAN VXISI
495 ```