openpower/sv/twin_butterfly.mdwn

   1 # Introduction
   2
   3 <!-- hide -->
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
   5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
   6   information about implicit RS/FRS
   7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
   8 * [[openpower/isa/svfparith]]
   9 * [[openpower/isa/svfixedarith]]
  10 * [[openpower/sv/rfc/ls016]]
  11 <!-- show -->
  12
  13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
  14 context to save considerably on DCT, DFT and FFT processing.  Whilst some hardware
  15 implementations may not necessarily implement them efficiently (slower Micro-coding)
  16 savings still come from the reduction in temporary registers as well as instruction
  17 count.
  18
  19 # Rationale for Twin Butterfly Integer DCT Instruction(s)
  20
  21 The number of general-purpose uses for DCT is huge. The number of
  22 instructions needed instead of these Twin-Butterfly instructions is also
  23 huge (**eight**) and given that it is extremely common to explicitly
  24 loop-unroll them quantity hundreds to thousands of instructions are
  25 dismayingly common (for all ISAs).
  26
  27 The goal is to implement instructions that calculate the expression:
  28
  29 ```
  30     fdct_round_shift((a +/- b) * c)
  31 ```
  32
  33 For the single-coefficient butterfly instruction, and:
  34
  35 ```
  36     fdct_round_shift(a * c1  +/- b * c2)
  37 ```
  38
  39 For the double-coefficient butterfly instruction.
  40
  41 In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
  42
  43 ```
  44     #define ROUND_POWER_OF_TWO(value, n) \
  45             (((value) + (1 << ((n)-1))) >> (n))
  46 ```
  47
  48 These instructions are at the core of **ALL** FDCT calculations in many
  49 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
  50 ARM includes special instructions to optimize these operations, although
  51 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
  52
  53 The suggestion is to have a single instruction to calculate both values
  54 `((a + b) * c) >> N`, and `((a - b) * c) >> N`.  The instruction will
  55 run in accumulate mode, so in order to calculate the 2-coeff version
  56 one would just have to call the same instruction with different order a,
  57 b and a different constant c.
  58
  59 Example taken from libvpx
  60 <https://chromium.googlesource.com/webm/libvpx/+/refs/tags/v1.13.0/vpx_dsp/fwd_txfm.c#132>:
  61
  62 ```
  63     #include <stdint.h>
  64     #define ROUND_POWER_OF_TWO(value, n) \
  65             (((value) + (1 << ((n)-1))) >> (n))
  66     void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
  67         t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
  68         t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
  69     }
  70 ```
  71
  72 8 instructions are required  - replaced by just the one (maddsubrs):
  73
  74 ```
  75     add 9,5,4
  76     subf 5,5,4
  77     mullw 9,9,6
  78     mullw 5,5,6
  79     addi 9,9,8192
  80     addi 5,5,8192
  81     srawi 9,9,14
  82     srawi 5,5,14
  83 ```
  84
  85 -------
  86
  87 \newpage{}
  88
  89 ## Integer Butterfly Multiply Add/Sub FFT/DCT
  90
  91 **Add the following to Book I Section 3.3.9.1**
  92
  93 A-Form
  94
  95 ```
  96     |0     |6     |11      |16     |21      |26    |31 |
  97     | PO   |  RT  |   RA   |   RB  |   SH   |   XO |Rc |
  98 ```
  99
 100 * maddsubrs  RT,RA,RB,SH
 101
 102 Pseudo-code:
 103
 104 ```
 105     n <- SH
 106     sum <- (RT[0] || RT) + (RA[0] || RA)
 107     diff <- (RT[0] || RT) - (RA[0] || RA)
 108     prod1 <- MULS(RB, sum)
 109     prod2 <- MULS(RB, diff)
 110     if n = 0 then
 111         prod1_lo <- prod1[XLEN+1:(XLEN*2)]
 112         prod2_lo <- prod2[XLEN+1:(XLEN*2)]
 113         RT <- prod1_lo
 114         RS <- prod2_lo
 115     else
 116         round <- [0]*(XLEN*2 + 1)
 117         round[XLEN*2 - n + 1] <- 1
 118         prod1 <- prod1 + round
 119         prod2 <- prod2 + round
 120         res1 <- prod1[XLEN - n + 1:XLEN*2 - n]
 121         res2 <- prod2[XLEN - n + 1:XLEN*2 - n]
 122         RT <- res1
 123         RS <- res2
 124 ```
 125
 126 Similar to `RTp`, this instruction produces an implicit result, `RS`,
 127 which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
 128 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
 129 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 130
 131 Special Registers Altered:
 132
 133 ```
 134     None
 135 ```
 136
 137 # [DRAFT] Integer Butterfly Multiply Add and Round Shift FFT/DCT
 138
 139 A-Form
 140
 141 * maddrs  RT,RA,RB,SH
 142
 143 Pseudo-code:
 144
 145 ```
 146     n <- SH
 147     prod <- MULS(RB, RA)
 148     if n = 0 then
 149         prod_lo <- prod[XLEN:(XLEN*2) - 1]
 150         RT <- (RT) + prod_lo
 151     else
 152         res[0:XLEN*2-1] <- (EXTSXL((RT)[0], 1) || (RT)) + prod
 153         round <- [0]*XLEN*2
 154         round[XLEN*2 - n] <- 1
 155         res <- res + round
 156         RT <- res[XLEN - n:XLEN*2 - n -1]
 157 ```
 158
 159 Special Registers Altered:
 160
 161     None
 162
 163 # [DRAFT] Integer Butterfly Multiply Sub and Round Shift FFT/DCT
 164
 165 A-Form
 166
 167 * msubrs  RT,RA,RB,SH
 168
 169 Pseudo-code:
 170
 171 ```
 172     n <- SH
 173     prod <- MULS(RB, RA)
 174     if n = 0 then
 175         prod_lo <- prod[XLEN:(XLEN*2) - 1]
 176         RT <- (RT) - prod_lo
 177     else
 178         res[0:XLEN*2-1] <- (EXTSXL((RT)[0], 1) || (RT)) - prod
 179         round <- [0]*XLEN*2
 180         round[XLEN*2 - n] <- 1
 181         res <- res + round
 182         RT <- res[XLEN - n:XLEN*2 - n -1]
 183 ```
 184
 185 Special Registers Altered:
 186
 187     None
 188
 189
 190 This pair of instructions is supposed to be used in complement to the maddsubrs
 191 to produce the double-coefficient butterfly instruction. In order for that
 192 to work, instead of passing c2 as coefficient, we have to pass c2-c1 instead.
 193
 194 In essence, we are calculating the quantity `a * c1 +/- b * c1` first, with
 195 `maddsubrs` *without* shifting (so `SH=0`) and then we add/sub `b * (c2-c1)`
 196 from the previous `RT`, and *then* do the shifting.
 197
 198 In the following example, assume `a` in `R1`, `b` in `R10`, `c1` in `R11` and `c2 - c1` in `R12`.
 199 The first instruction will put `a * c1 + b * c1` in `R1` (`RT`), `a * c1 - b * c1` in `RS`
 200 (here, `RS = RT +1`, so `R2`).
 201 Then, `maddrs` will add `b * (c2 - c1)` to `R1` (`RT`), and `msubrs` will subtract it from `R2` (`RS`), and then
 202 round shift right both quantities 14 bits:
 203
 204 ```
 205     maddsubrs 1,10,0,11
 206     maddrs 1,10,12,14
 207     msubrs 2,10,12,14
 208 ```
 209
 210 In scalar code, that would take ~16 instructions for both operations.
 211
 212 -------
 213
 214 \newpage{}
 215
 216 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
 217
 218 **Add the following to Book I Section 4.6.6.3**
 219
 220 ## Floating-Point Twin Multiply-Add DCT [Single]
 221
 222 X-Form
 223
 224 ```
 225     |0     |6     |11      |16     |21      |31 |
 226     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 227 ```
 228
 229 * fdmadds FRT,FRA,FRB (Rc=0)
 230
 231 Pseudo-code:
 232
 233 ```
 234     FRS <- FPADD32(FRT, FRB)
 235     sub <- FPSUB32(FRT, FRB)
 236     FRT <- FPMUL32(FRA, sub)
 237 ```
 238
 239 The two IEEE754-FP32 operations
 240
 241 ```
 242     FRS <- [(FRT) + (FRB)]
 243     FRT <- [(FRT) - (FRB)] * (FRA)
 244 ```
 245
 246 are simultaneously performed.
 247
 248 The Floating-Point operand in register FRT is added to the floating-point
 249 operand in register FRB and the result stored in FRS.
 250
 251 Using the exact same operand input register values from FRT and FRB
 252 that were used to create FRS, the Floating-Point operand in register
 253 FRB is subtracted from the floating-point operand in register FRT and
 254 the result then rounded before being multiplied by FRA to create an
 255 intermediate result that is stored in FRT.
 256
 257 The add into FRS is treated exactly as `fadds`.  The creation of the
 258 result FRT is **not** the same as that of `fmsubs`, but is instead as if
 259 `fsubs` were performed first followed by `fmuls`.  The creation of FRS
 260 and FRT are treated as parallel independent operations which occur at
 261 the same time.
 262
 263 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 264
 265 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 266 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 267 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 268 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 269
 270 Special Registers Altered:
 271
 272 ```
 273     FPRF FR FI
 274     FX OX UX XX
 275     VXSNAN VXISI VXIMZ
 276 ```
 277
 278 ## Floating-Point Multiply-Add FFT [Single]
 279
 280 X-Form
 281
 282 ```
 283     |0     |6     |11      |16     |21      |31 |
 284     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 285 ```
 286
 287 * ffmadds FRT,FRA,FRB (Rc=0)
 288
 289 Pseudo-code:
 290
 291 ```
 292     FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
 293     FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
 294 ```
 295
 296 The two operations
 297
 298 ```
 299     FRS <- -([(FRT) * (FRA)] - (FRB))
 300     FRT <-   [(FRT) * (FRA)] + (FRB)
 301 ```
 302
 303 are performed.
 304
 305 The floating-point operand in register FRT is multiplied by the
 306 floating-point operand in register FRA. The floating-point operand in
 307 register FRB is added to this intermediate result, and the intermediate
 308 stored in FRS.
 309
 310 Using the exact same values of FRT, FRT and FRB as used to create
 311 FRS, the floating-point operand in register FRT is multiplied by the
 312 floating-point operand in register FRA. The floating-point operand
 313 in register FRB is subtracted from this intermediate result, and the
 314 intermediate stored in FRT.
 315
 316 FRT is created as if a `fmadds` operation had been performed. FRS is
 317 created as if a `fnmsubs` operation had simultaneously been performed
 318 with the exact same register operands, in parallel, independently,
 319 at exactly the same time.
 320
 321 FRT is a Read-Modify-Write operation.
 322
 323 Note that if Rc=1 an Illegal Instruction is raised.
 324 Rc=1 is `RESERVED`
 325
 326 Similar to `FRTp`, this instruction produces an implicit result,
 327 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
 328 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
 329 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
 330 (Max Vector Length).
 331
 332 Special Registers Altered:
 333
 334 ```
 335     FPRF FR FI
 336     FX OX UX XX
 337     VXSNAN VXISI VXIMZ
 338 ```
 339
 340 ## Floating-Point Twin Multiply-Add DCT
 341
 342 X-Form
 343
 344 ```
 345     |0     |6     |11      |16     |21      |31 |
 346     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 347 ```
 348
 349 * fdmadd FRT,FRA,FRB (Rc=0)
 350
 351 Pseudo-code:
 352
 353 ```
 354     FRS <- FPADD64(FRT, FRB)
 355     sub <- FPSUB64(FRT, FRB)
 356     FRT <- FPMUL64(FRA, sub)
 357 ```
 358
 359 The two IEEE754-FP64 operations
 360
 361 ```
 362     FRS <- [(FRT) + (FRB)]
 363     FRT <- [(FRT) - (FRB)] * (FRA)
 364 ```
 365
 366 are simultaneously performed.
 367
 368 The Floating-Point operand in register FRT is added to the floating-point
 369 operand in register FRB and the result stored in FRS.
 370
 371 Using the exact same operand input register values from FRT and FRB
 372 that were used to create FRS, the Floating-Point operand in register
 373 FRB is subtracted from the floating-point operand in register FRT and
 374 the result then rounded before being multiplied by FRA to create an
 375 intermediate result that is stored in FRT.
 376
 377 The add into FRS is treated exactly as `fadd`.  The creation of the
 378 result FRT is **not** the same as that of `fmsub`, but is instead as if
 379 `fsub` were performed first followed by `fmuls.  The creation of FRS
 380 and FRT are treated as parallel independent operations which occur at
 381 the same time.
 382
 383 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 384
 385 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 386 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 387 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 388 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 389
 390 Special Registers Altered:
 391
 392 ```
 393     FPRF FR FI
 394     FX OX UX XX
 395     VXSNAN VXISI VXIMZ
 396 ```
 397
 398 ## Floating-Point Twin Multiply-Add FFT
 399
 400 X-Form
 401
 402 ```
 403     |0     |6     |11      |16     |21      |31 |
 404     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 405 ```
 406
 407 * ffmadd FRT,FRA,FRB (Rc=0)
 408
 409 Pseudo-code:
 410
 411 ```
 412     FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
 413     FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
 414 ```
 415
 416 The two operations
 417
 418 ```
 419     FRS <- -([(FRT) * (FRA)] - (FRB))
 420     FRT <-   [(FRT) * (FRA)] + (FRB)
 421 ```
 422
 423 are performed.
 424
 425 The floating-point operand in register FRT is multiplied by the
 426 floating-point operand in register FRA. The float- ing-point operand in
 427 register FRB is added to this intermediate result, and the intermediate
 428 stored in FRS.
 429
 430 Using the exact same values of FRT, FRT and FRB as used to create
 431 FRS, the floating-point operand in register FRT is multiplied by the
 432 floating-point operand in register FRA. The float- ing-point operand
 433 in register FRB is subtracted from this intermediate result, and the
 434 intermediate stored in FRT.
 435
 436 FRT is created as if a `fmadd` operation had been performed. FRS is
 437 created as if a `fnmsub` operation had simultaneously been performed
 438 with the exact same register operands, in parallel, independently,
 439 at exactly the same time.
 440
 441 FRT is a Read-Modify-Write operation.
 442
 443 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 444
 445 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 446 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 447 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 448 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 449
 450 Special Registers Altered:
 451
 452 ```
 453     FPRF FR FI
 454     FX OX UX XX
 455     VXSNAN VXISI VXIMZ
 456 ```
 457
 458
 459 ## Floating-Point Add FFT/DCT [Single]
 460
 461 A-Form
 462
 463 ```
 464     |0     |6     |11      |16     |21      |26    |31 |
 465     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 466 ```
 467
 468 * ffadds FRT,FRA,FRB (Rc=0)
 469
 470 Pseudo-code:
 471
 472 ```
 473     FRT <- FPADD32(FRA, FRB)
 474     FRS <- FPSUB32(FRB, FRA)
 475 ```
 476
 477 Special Registers Altered:
 478
 479 ```
 480     FPRF FR FI
 481     FX OX UX XX
 482     VXSNAN VXISI
 483 ```
 484
 485 ## Floating-Point Add FFT/DCT [Double]
 486
 487 A-Form
 488
 489 ```
 490     |0     |6     |11      |16     |21      |26    |31 |
 491     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 492 ```
 493
 494 * ffadd FRT,FRA,FRB (Rc=0)
 495
 496 Pseudo-code:
 497
 498 ```
 499     FRT <- FPADD64(FRA, FRB)
 500     FRS <- FPSUB64(FRB, FRA)
 501 ```
 502
 503 Special Registers Altered:
 504
 505 ```
 506     FPRF FR FI
 507     FX OX UX XX
 508     VXSNAN VXISI
 509 ```
 510
 511 ## Floating-Point Subtract FFT/DCT [Single]
 512
 513 A-Form
 514
 515 ```
 516     |0     |6     |11      |16     |21      |26    |31 |
 517     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 518 ```
 519
 520 * ffsubs FRT,FRA,FRB (Rc=0)
 521
 522 Pseudo-code:
 523
 524 ```
 525     FRT <- FPSUB32(FRB, FRA)
 526     FRS <- FPADD32(FRA, FRB)
 527 ```
 528
 529 Special Registers Altered:
 530
 531 ```
 532     FPRF FR FI
 533     FX OX UX XX
 534     VXSNAN VXISI
 535 ```
 536
 537 ## Floating-Point Subtract FFT/DCT [Double]
 538
 539 A-Form
 540
 541 ```
 542     |0     |6     |11      |16     |21      |26    |31 |
 543     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 544 ```
 545
 546 * ffsub FRT,FRA,FRB (Rc=0)
 547
 548 Pseudo-code:
 549
 550 ```
 551     FRT <- FPSUB64(FRB, FRA)
 552     FRS <- FPADD64(FRA, FRB)
 553 ```
 554
 555 Special Registers Altered:
 556
 557 ```
 558     FPRF FR FI
 559     FX OX UX XX
 560     VXSNAN VXISI
 561 ```