do proper rounding, handle SH=0 (no rounding for now)
[libreriscv.git] / openpower / sv / twin_butterfly.mdwn
1 # Introduction
2
3 <!-- hide -->
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
6 information about implicit RS/FRS
7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
8 * [[openpower/isa/svfparith]]
9 * [[openpower/isa/svfixedarith]]
10 * [[openpower/sv/rfc/ls016]]
11
12 <!-- show -->
13
14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
15
16 The number of general-purpose uses for DCT is huge. The number of
17 instructions needed instead of these Twin-Butterfly instructions is also
18 huge (**eight**) and given that it is extremely common to explicitly
19 loop-unroll them quantity hundreds to thousands of instructions are
20 dismayingly common (for all ISAs).
21
22 The goal is to implement instructions that calculate the expression:
23
24 ```
25 fdct_round_shift((a +/- b) * c)
26 ```
27
28 For the single-coefficient butterfly instruction, and:
29
30 ```
31 fdct_round_shift(a * c1 +/- b * c2)
32 ```
33
34 For the double-coefficient butterfly instruction.
35
36 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
37
38 ```
39 #define ROUND_POWER_OF_TWO(value, n) \
40 (((value) + (1 << ((n)-1))) >> (n))
41 ```
42
43 These instructions are at the core of **ALL** FDCT calculations in many
44 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
45 Arm includes special instructions to optimize these operations, although
46 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
47
48 The suggestion is to have a single instruction to calculate both values
49 `((a + b) * c) >> N`, and `((a - b) * c) >> N`. The instruction will
50 run in accumulate mode, so in order to calculate the 2-coeff version
51 one would just have to call the same instruction with different order a,
52 b and a different constant c.
53
54 Example taken from libvpx
55 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
56
57 ```
58 #include <stdint.h>
59 #define ROUND_POWER_OF_TWO(value, n) \
60 (((value) + (1 << ((n)-1))) >> (n))
61 void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
62 t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
63 t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
64 }
65 ```
66
67 8 instructions are required - replaced by just the one (maddsubrs):
68
69 ```
70 add 9,5,4
71 subf 5,5,4
72 mullw 9,9,6
73 mullw 5,5,6
74 addi 9,9,8192
75 addi 5,5,8192
76 srawi 9,9,14
77 srawi 5,5,14
78 ```
79
80 -------
81
82 \newpage{}
83
84 ## Integer Butterfly Multiply Add/Sub FFT/DCT
85
86 **Add the following to Book I Section 3.3.9.1**
87
88 A-Form
89
90 ```
91 |0 |6 |11 |16 |21 |26 |31 |
92 | PO | RT | RA | RB | SH | XO |Rc |
93
94 ```
95
96 * maddsubrs RT,RA,SH,RB
97
98 Pseudo-code:
99
100 ```
101 n <- SH
102 sum <- (RT) + (RA)
103 diff <- (RT) - (RA)
104 prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1]
105 prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1]
106 if n = 0 then
107 #round <- EXTS([0]*(XLEN-1) || [1]*1)
108 #prod1 <- ROTL64(prod1, 1)
109 #prod2 <- ROTL64(prod2, 1)
110 #prod1 <- prod1 + round
111 #prod2 <- prod2 + round
112 #res1 <- ROTL64(prod1, XLEN-1)
113 #res2 <- ROTL64(prod2, XLEN-1)
114 #m <- MASK(1, (XLEN-1))
115 RT <- prod1
116 RS <- prod2
117 else
118 round <- EXTS([0]*(XLEN -n -1) || [1]*1 || [0]*(n-1))
119 prod1 <- prod1 + round
120 prod2 <- prod2 + round
121 res1 <- ROTL64(prod1, XLEN-n)
122 res2 <- ROTL64(prod2, XLEN-n)
123 m <- MASK(n, (XLEN-1))
124 signbit1 <- prod1[0]
125 signbit2 <- prod2[0]
126 smask1 <- ([signbit1]*XLEN) & ¬m
127 smask2 <- ([signbit2]*XLEN) & ¬m
128 RT <- (res1 & m | smask1)
129 RS <- (res2 & m | smask2)
130 ```
131
132 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
133
134 Similar to `RTp`, this instruction produces an implicit result, `RS`,
135 which under Scalar circumstances is defined as `RT+1`. For SVP64 if
136 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
137 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
138
139 Special Registers Altered:
140
141 ```
142 None
143 ```
144
145 -------
146
147 \newpage{}
148
149 # Twin Butterfly Floating-Point DCT Instruction(s)
150
151 **Add the following to Book I Section 4.6.6.3**
152
153 ## Floating-Point Twin Multiply-Add DCT [Single]
154
155 X-Form
156
157 ```
158 |0 |6 |11 |16 |21 |31 |
159 | PO | FRT | FRA | FRB | XO |Rc |
160 ```
161
162 * fdmadds FRT,FRA,FRB (Rc=0)
163
164 Pseudo-code:
165
166 ```
167 FRS <- FPADD32(FRT, FRB)
168 sub <- FPSUB32(FRT, FRB)
169 FRT <- FPMUL32(FRA, sub)
170 ```
171
172 The two IEEE754-FP32 operations
173
174 ```
175 FRS <- [(FRT) + (FRB)]
176 FRT <- [(FRT) - (FRB)] * (FRA)
177 ```
178
179 are simultaneously performed.
180
181 The Floating-Point operand in register FRT is added to the floating-point
182 operand in register FRB and the result stored in FRS.
183
184 Using the exact same operand input register values from FRT and FRB
185 that were used to create FRS, the Floating-Point operand in register
186 FRB is subtracted from the floating-point operand in register FRT and
187 the result then rounded before being multiplied by FRA to create an
188 intermediate result that is stored in FRT.
189
190 The add into FRS is treated exactly as `fadds`. The creation of the
191 result FRT is **not** the same as that of `fmsubs`, but is instead as if
192 `fsubs` were performed first followed by `fmuls`. The creation of FRS
193 and FRT are treated as parallel independent operations which occur at
194 the same time.
195
196 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
197
198 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
199 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
200 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
201 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
202
203 Special Registers Altered:
204
205 ```
206 FPRF FR FI
207 FX OX UX XX
208 VXSNAN VXISI VXIMZ
209 ```
210
211 ## Floating-Point Multiply-Add FFT [Single]
212
213 X-Form
214
215 ```
216 |0 |6 |11 |16 |21 |31 |
217 | PO | FRT | FRA | FRB | XO |Rc |
218 ```
219
220 * ffmadds FRT,FRA,FRB (Rc=0)
221
222 Pseudo-code:
223
224 ```
225 FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
226 FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
227 ```
228
229 The two operations
230
231 ```
232 FRS <- -([(FRT) * (FRA)] - (FRB))
233 FRT <- [(FRT) * (FRA)] + (FRB)
234 ```
235
236 are performed.
237
238 The floating-point operand in register FRT is multiplied by the
239 floating-point operand in register FRA. The floating-point operand in
240 register FRB is added to this intermediate result, and the intermediate
241 stored in FRS.
242
243 Using the exact same values of FRT, FRT and FRB as used to create
244 FRS, the floating-point operand in register FRT is multiplied by the
245 floating-point operand in register FRA. The float- ing-point operand
246 in register FRB is subtracted from this intermediate result, and the
247 intermediate stored in FRT.
248
249 FRT is created as if a `fmadds` operation had been performed. FRS is
250 created as if a `fnmsubs` operation had simultaneously been performed
251 with the exact same register operands, in parallel, independently,
252 at exactly the same time.
253
254 FRT is a Read-Modify-Write operation.
255
256 Note that if Rc=1 an Illegal Instruction is raised.
257 Rc=1 is `RESERVED`
258
259 Similar to `FRTp`, this instruction produces an implicit result,
260 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
261 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
262 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
263 (Max Vector Length).
264
265
266 Special Registers Altered:
267
268 ```
269 FPRF FR FI
270 FX OX UX XX
271 VXSNAN VXISI VXIMZ
272 ```
273 ## Floating-Point Twin Multiply-Add DCT
274
275 X-Form
276
277 ```
278 |0 |6 |11 |16 |21 |31 |
279 | PO | FRT | FRA | FRB | XO |Rc |
280 ```
281
282 * fdmadd FRT,FRA,FRB (Rc=0)
283
284 Pseudo-code:
285
286 ```
287 FRS <- FPADD64(FRT, FRB)
288 sub <- FPSUB64(FRT, FRB)
289 FRT <- FPMUL64(FRA, sub)
290 ```
291
292 The two IEEE754-FP64 operations
293
294 ```
295 FRS <- [(FRT) + (FRB)]
296 FRT <- [(FRT) - (FRB)] * (FRA)
297 ```
298
299 are simultaneously performed.
300
301 The Floating-Point operand in register FRT is added to the floating-point
302 operand in register FRB and the result stored in FRS.
303
304 Using the exact same operand input register values from FRT and FRB
305 that were used to create FRS, the Floating-Point operand in register
306 FRB is subtracted from the floating-point operand in register FRT and
307 the result then rounded before being multiplied by FRA to create an
308 intermediate result that is stored in FRT.
309
310 The add into FRS is treated exactly as `fadd`. The creation of the
311 result FRT is **not** the same as that of `fmsub`, but is instead as if
312 `fsub` were performed first followed by `fmuls. The creation of FRS
313 and FRT are treated as parallel independent operations which occur at
314 the same time.
315
316 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
317
318 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
319 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
320 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
321 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
322
323 Special Registers Altered:
324
325 ```
326 FPRF FR FI
327 FX OX UX XX
328 VXSNAN VXISI VXIMZ
329 ```
330
331 ## Floating-Point Twin Multiply-Add FFT
332
333 X-Form
334
335 ```
336 |0 |6 |11 |16 |21 |31 |
337 | PO | FRT | FRA | FRB | XO |Rc |
338 ```
339
340 * ffmadd FRT,FRA,FRB (Rc=0)
341
342 Pseudo-code:
343
344 ```
345 FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
346 FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
347 ```
348
349 The two operations
350
351 ```
352 FRS <- -([(FRT) * (FRA)] - (FRB))
353 FRT <- [(FRT) * (FRA)] + (FRB)
354 ```
355
356 are performed.
357
358 The floating-point operand in register FRT is multiplied by the
359 floating-point operand in register FRA. The float- ing-point operand in
360 register FRB is added to this intermediate result, and the intermediate
361 stored in FRS.
362
363 Using the exact same values of FRT, FRT and FRB as used to create
364 FRS, the floating-point operand in register FRT is multiplied by the
365 floating-point operand in register FRA. The float- ing-point operand
366 in register FRB is subtracted from this intermediate result, and the
367 intermediate stored in FRT.
368
369 FRT is created as if a `fmadd` operation had been performed. FRS is
370 created as if a `fnmsub` operation had simultaneously been performed
371 with the exact same register operands, in parallel, independently,
372 at exactly the same time.
373
374 FRT is a Read-Modify-Write operation.
375
376 Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
377
378 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
379 which under Scalar circumstances is defined as `FRT+1`. For SVP64 if
380 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
381 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
382
383 Special Registers Altered:
384
385 ```
386 FPRF FR FI
387 FX OX UX XX
388 VXSNAN VXISI VXIMZ
389 ```
390
391
392 ## [DRAFT] Floating-Point Add FFT/DCT [Single]
393
394 A-Form
395
396 * ffadds FRT,FRA,FRB (Rc=0)
397 * ffadds. FRT,FRA,FRB (Rc=1)
398
399 Pseudo-code:
400
401 ```
402 FRT <- FPADD32(FRA, FRB)
403 FRS <- FPSUB32(FRB, FRA)
404 ```
405
406 Special Registers Altered:
407
408 ```
409 FPRF FR FI
410 FX OX UX XX
411 VXSNAN VXISI
412 CR1 (if Rc=1)
413 ```
414
415 ## [DRAFT] Floating-Point Add FFT/DCT [Double]
416
417 A-Form
418
419 * ffadd FRT,FRA,FRB (Rc=0)
420 * ffadd. FRT,FRA,FRB (Rc=1)
421
422 Pseudo-code:
423
424 ```
425 FRT <- FPADD64(FRA, FRB)
426 FRS <- FPSUB64(FRB, FRA)
427 ```
428
429 Special Registers Altered:
430
431 ```
432 FPRF FR FI
433 FX OX UX XX
434 VXSNAN VXISI
435 CR1 (if Rc=1)
436 ```
437
438 ## [DRAFT] Floating-Point Subtract FFT/DCT [Single]
439
440 A-Form
441
442 * ffsubs FRT,FRA,FRB (Rc=0)
443 * ffsubs. FRT,FRA,FRB (Rc=1)
444
445 Pseudo-code:
446
447 ```
448 FRT <- FPSUB32(FRB, FRA)
449 FRS <- FPADD32(FRA, FRB)
450 ```
451
452 Special Registers Altered:
453
454 ```
455 FPRF FR FI
456 FX OX UX XX
457 VXSNAN VXISI
458 CR1 (if Rc=1)
459 ```
460
461 ## [DRAFT] Floating-Point Subtract FFT/DCT [Double]
462
463 A-Form
464
465 * ffsub FRT,FRA,FRB (Rc=0)
466 * ffsub. FRT,FRA,FRB (Rc=1)
467
468 Pseudo-code:
469
470 ```
471 FRT <- FPSUB64(FRB, FRA)
472 FRS <- FPADD64(FRA, FRB)
473 ```
474
475 Special Registers Altered:
476
477 ```
478 FPRF FR FI
479 FX OX UX XX
480 VXSNAN VXISI
481 CR1 (if Rc=1)
482 ```