openpower/sv/twin_butterfly.mdwn

   1 # Introduction
   2
   3 <!-- hide -->
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
   5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
   6   information about implicit RS/FRS
   7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
   8 * [[openpower/isa/svfparith]]
   9 * [[openpower/isa/svfixedarith]]
  10 * [[openpower/sv/rfc/ls016]]
  11 <!-- show -->
  12
  13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
  14 context to save considerably on DCT, DFT and FFT processing.  Whilst some hardware
  15 implementations may not necessarily implement them efficiently (slower Micro-coding)
  16 savings still come from the reduction in temporary registers as well as instruction
  17 count.
  18
  19 # Rationale for Twin Butterfly Integer DCT Instruction(s)
  20
  21 The number of general-purpose uses for DCT is huge. The number of
  22 instructions needed instead of these Twin-Butterfly instructions is also
  23 huge (**eight**) and given that it is extremely common to explicitly
  24 loop-unroll them quantity hundreds to thousands of instructions are
  25 dismayingly common (for all ISAs).
  26
  27 The goal is to implement instructions that calculate the expression:
  28
  29 ```
  30     fdct_round_shift((a +/- b) * c)
  31 ```
  32
  33 For the single-coefficient butterfly instruction, and:
  34
  35 ```
  36     fdct_round_shift(a * c1  +/- b * c2)
  37 ```
  38
  39 For the double-coefficient butterfly instruction.
  40
  41 In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
  42
  43 ```
  44     #define ROUND_POWER_OF_TWO(value, n) \
  45             (((value) + (1 << ((n)-1))) >> (n))
  46 ```
  47
  48 These instructions are at the core of **ALL** FDCT calculations in many
  49 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
  50 ARM includes special instructions to optimize these operations, although
  51 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
  52
  53 The suggestion is to have a single instruction to calculate both values
  54 `((a + b) * c) >> N`, and `((a - b) * c) >> N`.  The instruction will
  55 run in accumulate mode, so in order to calculate the 2-coeff version
  56 one would just have to call the same instruction with different order a,
  57 b and a different constant c.
  58
  59 Example taken from libvpx
  60 <https://chromium.googlesource.com/webm/libvpx/+/refs/tags/v1.13.0/vpx_dsp/fwd_txfm.c#132>:
  61
  62 ```
  63     #include <stdint.h>
  64     #define ROUND_POWER_OF_TWO(value, n) \
  65             (((value) + (1 << ((n)-1))) >> (n))
  66     void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
  67         t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
  68         t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
  69     }
  70 ```
  71
  72 8 instructions are required  - replaced by just the one (maddsubrs):
  73
  74 ```
  75     add 9,5,4
  76     subf 5,5,4
  77     mullw 9,9,6
  78     mullw 5,5,6
  79     addi 9,9,8192
  80     addi 5,5,8192
  81     srawi 9,9,14
  82     srawi 5,5,14
  83 ```
  84
  85 -------
  86
  87 \newpage{}
  88
  89 ## Integer Butterfly Multiply Add/Sub FFT/DCT
  90
  91 **Add the following to Book I Section 3.3.9.1**
  92
  93 A-Form
  94
  95 ```
  96     |0     |6     |11      |16     |21      |26    |31 |
  97     | PO   |  RT  |   RA   |   RB  |   SH   |   XO |Rc |
  98 ```
  99
 100 * maddsubrs  RT,RA,SH,RB
 101
 102 Pseudo-code:
 103
 104 ```
 105     n <- SH
 106     sum <- (RT) + (RA)
 107     diff <- (RT) - (RA)
 108     prod1 <- MULS(RB, sum)
 109     prod2 <- MULS(RB, diff)
 110     if n = 0 then
 111         prod1_lo <- prod1[XLEN:(XLEN*2) - 1]
 112         prod2_lo <- prod2[XLEN:(XLEN*2) - 1]
 113         RT <- prod1_lo
 114         RS <- prod2_lo
 115     else
 116         round <- [0]*(XLEN*2)
 117         round[XLEN*2 - n] <- 1
 118         prod1 <- prod1 + round
 119         prod2 <- prod2 + round
 120         m <- MASK(XLEN - n - 2, XLEN - 1)
 121         res1 <- prod1[XLEN - n:XLEN*2 - n - 1]
 122         res2 <- prod2[XLEN - n:XLEN*2 - n - 1]
 123         signbit1 <- prod1[0]
 124         signbit2 <- prod2[0]
 125         smask1 <- ([signbit1]*XLEN) & ¬m
 126         smask2 <- ([signbit2]*XLEN) & ¬m
 127         RT <- (res1 | smask1)
 128         RS <- (res2 | smask2)
 129 ```
 130
 131 Similar to `RTp`, this instruction produces an implicit result, `RS`,
 132 which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
 133 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
 134 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 135
 136 Special Registers Altered:
 137
 138 ```
 139     None
 140 ```
 141
 142 # [DRAFT] Integer Butterfly Multiply Add/Sub and Accumulate FFT/DCT
 143
 144 A-Form
 145
 146 * maddrs  RT,RA,SH,RB
 147
 148 Pseudo-code:
 149
 150     n <- SH
 151     prod <- MULS(RB, RA)
 152     if n = 0 then
 153         prod_lo <- prod[XLEN:(XLEN*2) - 1]
 154         RT <- (RT) + prod_lo
 155     n <- SH
 156     sum <- (RT) + (RA)
 157     diff <- (RT) - (RA)
 158     prod1 <- MULS(RB, sum)
 159     prod1_lo <- prod1[XLEN:(XLEN*2)-1]
 160     prod2 <- MULS(RB, diff)
 161     prod2_lo <- prod2[XLEN:(XLEN*2)-1]
 162     if n = 0 then
 163         RT <- prod1_lo
 164         RS <- prod2_lo
 165     else
 166         round <- [0]*XLEN
 167         round[XLEN -n] <- 1
 168         prod1_lo <- prod1_lo + round
 169         prod2_lo <- prod2_lo + round
 170         m <- MASK(n, (XLEN-1))
 171         res1 <- ROTL64(prod1_lo, XLEN-n) & m
 172         res2 <- ROTL64(prod2_lo, XLEN-n) & m
 173         signbit1 <- prod1_lo[0]
 174         signbit2 <- prod2_lo[0]
 175         smask1 <- ([signbit1]*XLEN) & ¬m
 176         smask2 <- ([signbit2]*XLEN) & ¬m
 177         RT <- (res1 | smask1)
 178         RS <- (res2 | smask2)
 179
 180 Special Registers Altered:
 181
 182     None
 183
 184 Similar to `RTp`, this instruction produces an implicit result, `RS`,
 185 which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
 186 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
 187 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 188
 189 This instruction is supposed to be used in complement to the maddsubrs
 190 to produce the double-coefficient butterfly instruction. In order for that
 191 to work, instead of passing c2 as coefficient, we have to pass c2-c1 instead.
 192
 193 In essence, we are calculating the quantity `a * c1 +/- b * c1` first, with
 194 `maddsubrs` *without* shifting (so `SH=0`) and then we add/sub `b * (c2-c1)`
 195 from the previous `RT`/`RS`, and *then* do the shifting.
 196
 197 In the following example, assume `a` in `R1`, `b` in `R10`, `c1` in `R11` and `c2 - c1` in `R12`.
 198 The first instruction will put `a * c1 + b * c1` in `R1` (`RT`), `a * c1 - b * c1` in `RS`
 199 (here, `RS = RT +1`, so `R2`).
 200 Then, `maddrs` will add `b * (c2 - c1)` to `R1` (`RT`), and subtract it from `R2` (`RS`), and then
 201 round shift right both quantities 14 bits:
 202
 203 ```
 204     maddsubrs 1,10,0,11
 205     maddrs 1,10,14,12
 206 ```
 207
 208 In scalar code, that would take ~16 instructions for both operations.
 209
 210 -------
 211
 212 \newpage{}
 213
 214 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
 215
 216 **Add the following to Book I Section 4.6.6.3**
 217
 218 ## Floating-Point Twin Multiply-Add DCT [Single]
 219
 220 X-Form
 221
 222 ```
 223     |0     |6     |11      |16     |21      |31 |
 224     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 225 ```
 226
 227 * fdmadds FRT,FRA,FRB (Rc=0)
 228
 229 Pseudo-code:
 230
 231 ```
 232     FRS <- FPADD32(FRT, FRB)
 233     sub <- FPSUB32(FRT, FRB)
 234     FRT <- FPMUL32(FRA, sub)
 235 ```
 236
 237 The two IEEE754-FP32 operations
 238
 239 ```
 240     FRS <- [(FRT) + (FRB)]
 241     FRT <- [(FRT) - (FRB)] * (FRA)
 242 ```
 243
 244 are simultaneously performed.
 245
 246 The Floating-Point operand in register FRT is added to the floating-point
 247 operand in register FRB and the result stored in FRS.
 248
 249 Using the exact same operand input register values from FRT and FRB
 250 that were used to create FRS, the Floating-Point operand in register
 251 FRB is subtracted from the floating-point operand in register FRT and
 252 the result then rounded before being multiplied by FRA to create an
 253 intermediate result that is stored in FRT.
 254
 255 The add into FRS is treated exactly as `fadds`.  The creation of the
 256 result FRT is **not** the same as that of `fmsubs`, but is instead as if
 257 `fsubs` were performed first followed by `fmuls`.  The creation of FRS
 258 and FRT are treated as parallel independent operations which occur at
 259 the same time.
 260
 261 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 262
 263 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 264 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 265 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 266 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 267
 268 Special Registers Altered:
 269
 270 ```
 271     FPRF FR FI
 272     FX OX UX XX
 273     VXSNAN VXISI VXIMZ
 274 ```
 275
 276 ## Floating-Point Multiply-Add FFT [Single]
 277
 278 X-Form
 279
 280 ```
 281     |0     |6     |11      |16     |21      |31 |
 282     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 283 ```
 284
 285 * ffmadds FRT,FRA,FRB (Rc=0)
 286
 287 Pseudo-code:
 288
 289 ```
 290     FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
 291     FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
 292 ```
 293
 294 The two operations
 295
 296 ```
 297     FRS <- -([(FRT) * (FRA)] - (FRB))
 298     FRT <-   [(FRT) * (FRA)] + (FRB)
 299 ```
 300
 301 are performed.
 302
 303 The floating-point operand in register FRT is multiplied by the
 304 floating-point operand in register FRA. The floating-point operand in
 305 register FRB is added to this intermediate result, and the intermediate
 306 stored in FRS.
 307
 308 Using the exact same values of FRT, FRT and FRB as used to create
 309 FRS, the floating-point operand in register FRT is multiplied by the
 310 floating-point operand in register FRA. The floating-point operand
 311 in register FRB is subtracted from this intermediate result, and the
 312 intermediate stored in FRT.
 313
 314 FRT is created as if a `fmadds` operation had been performed. FRS is
 315 created as if a `fnmsubs` operation had simultaneously been performed
 316 with the exact same register operands, in parallel, independently,
 317 at exactly the same time.
 318
 319 FRT is a Read-Modify-Write operation.
 320
 321 Note that if Rc=1 an Illegal Instruction is raised.
 322 Rc=1 is `RESERVED`
 323
 324 Similar to `FRTp`, this instruction produces an implicit result,
 325 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
 326 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
 327 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
 328 (Max Vector Length).
 329
 330 Special Registers Altered:
 331
 332 ```
 333     FPRF FR FI
 334     FX OX UX XX
 335     VXSNAN VXISI VXIMZ
 336 ```
 337
 338 ## Floating-Point Twin Multiply-Add DCT
 339
 340 X-Form
 341
 342 ```
 343     |0     |6     |11      |16     |21      |31 |
 344     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 345 ```
 346
 347 * fdmadd FRT,FRA,FRB (Rc=0)
 348
 349 Pseudo-code:
 350
 351 ```
 352     FRS <- FPADD64(FRT, FRB)
 353     sub <- FPSUB64(FRT, FRB)
 354     FRT <- FPMUL64(FRA, sub)
 355 ```
 356
 357 The two IEEE754-FP64 operations
 358
 359 ```
 360     FRS <- [(FRT) + (FRB)]
 361     FRT <- [(FRT) - (FRB)] * (FRA)
 362 ```
 363
 364 are simultaneously performed.
 365
 366 The Floating-Point operand in register FRT is added to the floating-point
 367 operand in register FRB and the result stored in FRS.
 368
 369 Using the exact same operand input register values from FRT and FRB
 370 that were used to create FRS, the Floating-Point operand in register
 371 FRB is subtracted from the floating-point operand in register FRT and
 372 the result then rounded before being multiplied by FRA to create an
 373 intermediate result that is stored in FRT.
 374
 375 The add into FRS is treated exactly as `fadd`.  The creation of the
 376 result FRT is **not** the same as that of `fmsub`, but is instead as if
 377 `fsub` were performed first followed by `fmuls.  The creation of FRS
 378 and FRT are treated as parallel independent operations which occur at
 379 the same time.
 380
 381 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 382
 383 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 384 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 385 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 386 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 387
 388 Special Registers Altered:
 389
 390 ```
 391     FPRF FR FI
 392     FX OX UX XX
 393     VXSNAN VXISI VXIMZ
 394 ```
 395
 396 ## Floating-Point Twin Multiply-Add FFT
 397
 398 X-Form
 399
 400 ```
 401     |0     |6     |11      |16     |21      |31 |
 402     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 403 ```
 404
 405 * ffmadd FRT,FRA,FRB (Rc=0)
 406
 407 Pseudo-code:
 408
 409 ```
 410     FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
 411     FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
 412 ```
 413
 414 The two operations
 415
 416 ```
 417     FRS <- -([(FRT) * (FRA)] - (FRB))
 418     FRT <-   [(FRT) * (FRA)] + (FRB)
 419 ```
 420
 421 are performed.
 422
 423 The floating-point operand in register FRT is multiplied by the
 424 floating-point operand in register FRA. The float- ing-point operand in
 425 register FRB is added to this intermediate result, and the intermediate
 426 stored in FRS.
 427
 428 Using the exact same values of FRT, FRT and FRB as used to create
 429 FRS, the floating-point operand in register FRT is multiplied by the
 430 floating-point operand in register FRA. The float- ing-point operand
 431 in register FRB is subtracted from this intermediate result, and the
 432 intermediate stored in FRT.
 433
 434 FRT is created as if a `fmadd` operation had been performed. FRS is
 435 created as if a `fnmsub` operation had simultaneously been performed
 436 with the exact same register operands, in parallel, independently,
 437 at exactly the same time.
 438
 439 FRT is a Read-Modify-Write operation.
 440
 441 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 442
 443 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 444 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 445 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 446 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 447
 448 Special Registers Altered:
 449
 450 ```
 451     FPRF FR FI
 452     FX OX UX XX
 453     VXSNAN VXISI VXIMZ
 454 ```
 455
 456
 457 ## Floating-Point Add FFT/DCT [Single]
 458
 459 A-Form
 460
 461 ```
 462     |0     |6     |11      |16     |21      |26    |31 |
 463     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 464 ```
 465
 466 * ffadds FRT,FRA,FRB (Rc=0)
 467
 468 Pseudo-code:
 469
 470 ```
 471     FRT <- FPADD32(FRA, FRB)
 472     FRS <- FPSUB32(FRB, FRA)
 473 ```
 474
 475 Special Registers Altered:
 476
 477 ```
 478     FPRF FR FI
 479     FX OX UX XX
 480     VXSNAN VXISI
 481 ```
 482
 483 ## Floating-Point Add FFT/DCT [Double]
 484
 485 A-Form
 486
 487 ```
 488     |0     |6     |11      |16     |21      |26    |31 |
 489     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 490 ```
 491
 492 * ffadd FRT,FRA,FRB (Rc=0)
 493
 494 Pseudo-code:
 495
 496 ```
 497     FRT <- FPADD64(FRA, FRB)
 498     FRS <- FPSUB64(FRB, FRA)
 499 ```
 500
 501 Special Registers Altered:
 502
 503 ```
 504     FPRF FR FI
 505     FX OX UX XX
 506     VXSNAN VXISI
 507 ```
 508
 509 ## Floating-Point Subtract FFT/DCT [Single]
 510
 511 A-Form
 512
 513 ```
 514     |0     |6     |11      |16     |21      |26    |31 |
 515     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 516 ```
 517
 518 * ffsubs FRT,FRA,FRB (Rc=0)
 519
 520 Pseudo-code:
 521
 522 ```
 523     FRT <- FPSUB32(FRB, FRA)
 524     FRS <- FPADD32(FRA, FRB)
 525 ```
 526
 527 Special Registers Altered:
 528
 529 ```
 530     FPRF FR FI
 531     FX OX UX XX
 532     VXSNAN VXISI
 533 ```
 534
 535 ## Floating-Point Subtract FFT/DCT [Double]
 536
 537 A-Form
 538
 539 ```
 540     |0     |6     |11      |16     |21      |26    |31 |
 541     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 542 ```
 543
 544 * ffsub FRT,FRA,FRB (Rc=0)
 545
 546 Pseudo-code:
 547
 548 ```
 549     FRT <- FPSUB64(FRB, FRA)
 550     FRS <- FPADD64(FRA, FRB)
 551 ```
 552
 553 Special Registers Altered:
 554
 555 ```
 556     FPRF FR FI
 557     FX OX UX XX
 558     VXSNAN VXISI
 559 ```