openpower/sv/twin_butterfly.mdwn

   1 # Introduction
   2
   3 <!-- hide -->
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
   5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
   6   information about implicit RS/FRS
   7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
   8 * [[openpower/isa/svfparith]]
   9 * [[openpower/isa/svfixedarith]]
  10 * [[openpower/sv/rfc/ls016]]
  11 <!-- show -->
  12
  13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
  14 context to save considerably on DCT, DFT and FFT processing.  Whilst some hardware
  15 implementations may not necessarily implement them efficiently (slower Micro-coding)
  16 savings still come from the reduction in temporary registers as well as instruction
  17 count.
  18
  19 # Rationale for Twin Butterfly Integer DCT Instruction(s)
  20
  21 The number of general-purpose uses for DCT is huge. The number of
  22 instructions needed instead of these Twin-Butterfly instructions is also
  23 huge (**eight**) and given that it is extremely common to explicitly
  24 loop-unroll them quantity hundreds to thousands of instructions are
  25 dismayingly common (for all ISAs).
  26
  27 The goal is to implement instructions that calculate the expression:
  28
  29 ```
  30     fdct_round_shift((a +/- b) * c)
  31 ```
  32
  33 For the single-coefficient butterfly instruction, and:
  34
  35 ```
  36     fdct_round_shift(a * c1  +/- b * c2)
  37 ```
  38
  39 For the double-coefficient butterfly instruction.
  40
  41 In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
  42
  43 ```
  44     #define ROUND_POWER_OF_TWO(value, n) \
  45             (((value) + (1 << ((n)-1))) >> (n))
  46 ```
  47
  48 These instructions are at the core of **ALL** FDCT calculations in many
  49 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
  50 ARM includes special instructions to optimize these operations, although
  51 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
  52
  53 The suggestion is to have a single instruction to calculate both values
  54 `((a + b) * c) >> N`, and `((a - b) * c) >> N`.  The instruction will
  55 run in accumulate mode, so in order to calculate the 2-coeff version
  56 one would just have to call the same instruction with different order a,
  57 b and a different constant c.
  58
  59 Example taken from libvpx
  60 <https://chromium.googlesource.com/webm/libvpx/+/refs/tags/v1.13.0/vpx_dsp/fwd_txfm.c#132>:
  61
  62 ```
  63     #include <stdint.h>
  64     #define ROUND_POWER_OF_TWO(value, n) \
  65             (((value) + (1 << ((n)-1))) >> (n))
  66     void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
  67         t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
  68         t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
  69     }
  70 ```
  71
  72 8 instructions are required  - replaced by just the one (maddsubrs):
  73
  74 ```
  75     add 9,5,4
  76     subf 5,5,4
  77     mullw 9,9,6
  78     mullw 5,5,6
  79     addi 9,9,8192
  80     addi 5,5,8192
  81     srawi 9,9,14
  82     srawi 5,5,14
  83 ```
  84
  85 -------
  86
  87 \newpage{}
  88
  89 ## Integer Butterfly Multiply Add/Sub FFT/DCT
  90
  91 **Add the following to Book I Section 3.3.9.1**
  92
  93 A-Form
  94
  95 ```
  96     |0     |6     |11      |16     |21      |26    |31 |
  97     | PO   |  RT  |   RA   |   RB  |   SH   |   XO |Rc |
  98 ```
  99
 100 * maddsubrs  RT,RA,SH,RB
 101
 102 Pseudo-code:
 103
 104 ```
 105     n <- SH
 106     sum <- (RT) + (RA)
 107     diff <- (RT) - (RA)
 108     prod1 <- MULS(RB, sum)
 109     prod2 <- MULS(RB, diff)
 110     if n = 0 then
 111         prod1_lo <- prod1[XLEN:(XLEN*2) - 1]
 112         prod2_lo <- prod2[XLEN:(XLEN*2) - 1]
 113         RT <- prod1_lo
 114         RS <- prod2_lo
 115     else
 116         round <- [0]*(XLEN*2)
 117         round[XLEN*2 - n] <- 1
 118         prod1 <- prod1 + round
 119         prod2 <- prod2 + round
 120         m <- MASK(XLEN - n - 2, XLEN - 1)
 121         res1 <- prod1[XLEN - n:XLEN*2 - n - 1]
 122         res2 <- prod2[XLEN - n:XLEN*2 - n - 1]
 123         signbit1 <- prod1[0]
 124         signbit2 <- prod2[0]
 125         smask1 <- ([signbit1]*XLEN) & ¬m
 126         smask2 <- ([signbit2]*XLEN) & ¬m
 127         RT <- (res1 | smask1)
 128         RS <- (res2 | smask2)
 129 ```
 130
 131 Similar to `RTp`, this instruction produces an implicit result, `RS`,
 132 which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
 133 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
 134 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 135
 136 Special Registers Altered:
 137
 138 ```
 139     None
 140 ```
 141
 142 # [DRAFT] Integer Butterfly Multiply Add/Sub and Accumulate FFT/DCT
 143
 144 A-Form
 145
 146 * maddrs  RT,RA,SH,RB
 147
 148 Pseudo-code:
 149
 150 ```
 151     n <- SH
 152     prod <- MULS(RB, RA)
 153     if n = 0 then
 154         prod_lo <- prod[XLEN:(XLEN*2) - 1]
 155         RT <- (RT) + prod_lo
 156         RS <- (RS) - prod_lo
 157     else
 158         res1[0:XLEN*2-1] <- (EXTSXL((RT)[0], 1) || (RT)) + prod
 159         res2[0:XLEN*2-1] <- (EXTSXL((RS)[0], 1) || (RS)) - prod
 160         round <- [0]*XLEN*2
 161         round[XLEN*2 - n] <- 1
 162         res1 <- res1 + round
 163         res2 <- res2 + round
 164         signbit1 <- res1[0]
 165         signbit2 <- res2[0]
 166         m <- MASK(XLEN -n - 2, XLEN - 1)
 167         res1 <- res1[XLEN - n:XLEN*2 - n -1]
 168         res2 <- res2[XLEN - n:XLEN*2 - n -1]
 169         smask1 <- ([signbit1]*XLEN) & ¬m
 170         smask2 <- ([signbit2]*XLEN) & ¬m
 171         RT <- (res1 | smask1)
 172         RS <- (res2 | smask2)
 173 ```
 174
 175 Special Registers Altered:
 176
 177     None
 178
 179 Similar to `RTp`, this instruction produces an implicit result, `RS`,
 180 which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
 181 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
 182 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 183
 184 This instruction is supposed to be used in complement to the maddsubrs
 185 to produce the double-coefficient butterfly instruction. In order for that
 186 to work, instead of passing c2 as coefficient, we have to pass c2-c1 instead.
 187
 188 In essence, we are calculating the quantity `a * c1 +/- b * c1` first, with
 189 `maddsubrs` *without* shifting (so `SH=0`) and then we add/sub `b * (c2-c1)`
 190 from the previous `RT`/`RS`, and *then* do the shifting.
 191
 192 In the following example, assume `a` in `R1`, `b` in `R10`, `c1` in `R11` and `c2 - c1` in `R12`.
 193 The first instruction will put `a * c1 + b * c1` in `R1` (`RT`), `a * c1 - b * c1` in `RS`
 194 (here, `RS = RT +1`, so `R2`).
 195 Then, `maddrs` will add `b * (c2 - c1)` to `R1` (`RT`), and subtract it from `R2` (`RS`), and then
 196 round shift right both quantities 14 bits:
 197
 198 ```
 199     maddsubrs 1,10,0,11
 200     maddrs 1,10,14,12
 201 ```
 202
 203 In scalar code, that would take ~16 instructions for both operations.
 204
 205 -------
 206
 207 \newpage{}
 208
 209 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
 210
 211 **Add the following to Book I Section 4.6.6.3**
 212
 213 ## Floating-Point Twin Multiply-Add DCT [Single]
 214
 215 X-Form
 216
 217 ```
 218     |0     |6     |11      |16     |21      |31 |
 219     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 220 ```
 221
 222 * fdmadds FRT,FRA,FRB (Rc=0)
 223
 224 Pseudo-code:
 225
 226 ```
 227     FRS <- FPADD32(FRT, FRB)
 228     sub <- FPSUB32(FRT, FRB)
 229     FRT <- FPMUL32(FRA, sub)
 230 ```
 231
 232 The two IEEE754-FP32 operations
 233
 234 ```
 235     FRS <- [(FRT) + (FRB)]
 236     FRT <- [(FRT) - (FRB)] * (FRA)
 237 ```
 238
 239 are simultaneously performed.
 240
 241 The Floating-Point operand in register FRT is added to the floating-point
 242 operand in register FRB and the result stored in FRS.
 243
 244 Using the exact same operand input register values from FRT and FRB
 245 that were used to create FRS, the Floating-Point operand in register
 246 FRB is subtracted from the floating-point operand in register FRT and
 247 the result then rounded before being multiplied by FRA to create an
 248 intermediate result that is stored in FRT.
 249
 250 The add into FRS is treated exactly as `fadds`.  The creation of the
 251 result FRT is **not** the same as that of `fmsubs`, but is instead as if
 252 `fsubs` were performed first followed by `fmuls`.  The creation of FRS
 253 and FRT are treated as parallel independent operations which occur at
 254 the same time.
 255
 256 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 257
 258 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 259 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 260 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 261 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 262
 263 Special Registers Altered:
 264
 265 ```
 266     FPRF FR FI
 267     FX OX UX XX
 268     VXSNAN VXISI VXIMZ
 269 ```
 270
 271 ## Floating-Point Multiply-Add FFT [Single]
 272
 273 X-Form
 274
 275 ```
 276     |0     |6     |11      |16     |21      |31 |
 277     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 278 ```
 279
 280 * ffmadds FRT,FRA,FRB (Rc=0)
 281
 282 Pseudo-code:
 283
 284 ```
 285     FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
 286     FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
 287 ```
 288
 289 The two operations
 290
 291 ```
 292     FRS <- -([(FRT) * (FRA)] - (FRB))
 293     FRT <-   [(FRT) * (FRA)] + (FRB)
 294 ```
 295
 296 are performed.
 297
 298 The floating-point operand in register FRT is multiplied by the
 299 floating-point operand in register FRA. The floating-point operand in
 300 register FRB is added to this intermediate result, and the intermediate
 301 stored in FRS.
 302
 303 Using the exact same values of FRT, FRT and FRB as used to create
 304 FRS, the floating-point operand in register FRT is multiplied by the
 305 floating-point operand in register FRA. The floating-point operand
 306 in register FRB is subtracted from this intermediate result, and the
 307 intermediate stored in FRT.
 308
 309 FRT is created as if a `fmadds` operation had been performed. FRS is
 310 created as if a `fnmsubs` operation had simultaneously been performed
 311 with the exact same register operands, in parallel, independently,
 312 at exactly the same time.
 313
 314 FRT is a Read-Modify-Write operation.
 315
 316 Note that if Rc=1 an Illegal Instruction is raised.
 317 Rc=1 is `RESERVED`
 318
 319 Similar to `FRTp`, this instruction produces an implicit result,
 320 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
 321 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
 322 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
 323 (Max Vector Length).
 324
 325 Special Registers Altered:
 326
 327 ```
 328     FPRF FR FI
 329     FX OX UX XX
 330     VXSNAN VXISI VXIMZ
 331 ```
 332
 333 ## Floating-Point Twin Multiply-Add DCT
 334
 335 X-Form
 336
 337 ```
 338     |0     |6     |11      |16     |21      |31 |
 339     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 340 ```
 341
 342 * fdmadd FRT,FRA,FRB (Rc=0)
 343
 344 Pseudo-code:
 345
 346 ```
 347     FRS <- FPADD64(FRT, FRB)
 348     sub <- FPSUB64(FRT, FRB)
 349     FRT <- FPMUL64(FRA, sub)
 350 ```
 351
 352 The two IEEE754-FP64 operations
 353
 354 ```
 355     FRS <- [(FRT) + (FRB)]
 356     FRT <- [(FRT) - (FRB)] * (FRA)
 357 ```
 358
 359 are simultaneously performed.
 360
 361 The Floating-Point operand in register FRT is added to the floating-point
 362 operand in register FRB and the result stored in FRS.
 363
 364 Using the exact same operand input register values from FRT and FRB
 365 that were used to create FRS, the Floating-Point operand in register
 366 FRB is subtracted from the floating-point operand in register FRT and
 367 the result then rounded before being multiplied by FRA to create an
 368 intermediate result that is stored in FRT.
 369
 370 The add into FRS is treated exactly as `fadd`.  The creation of the
 371 result FRT is **not** the same as that of `fmsub`, but is instead as if
 372 `fsub` were performed first followed by `fmuls.  The creation of FRS
 373 and FRT are treated as parallel independent operations which occur at
 374 the same time.
 375
 376 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 377
 378 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 379 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 380 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 381 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 382
 383 Special Registers Altered:
 384
 385 ```
 386     FPRF FR FI
 387     FX OX UX XX
 388     VXSNAN VXISI VXIMZ
 389 ```
 390
 391 ## Floating-Point Twin Multiply-Add FFT
 392
 393 X-Form
 394
 395 ```
 396     |0     |6     |11      |16     |21      |31 |
 397     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 398 ```
 399
 400 * ffmadd FRT,FRA,FRB (Rc=0)
 401
 402 Pseudo-code:
 403
 404 ```
 405     FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
 406     FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
 407 ```
 408
 409 The two operations
 410
 411 ```
 412     FRS <- -([(FRT) * (FRA)] - (FRB))
 413     FRT <-   [(FRT) * (FRA)] + (FRB)
 414 ```
 415
 416 are performed.
 417
 418 The floating-point operand in register FRT is multiplied by the
 419 floating-point operand in register FRA. The float- ing-point operand in
 420 register FRB is added to this intermediate result, and the intermediate
 421 stored in FRS.
 422
 423 Using the exact same values of FRT, FRT and FRB as used to create
 424 FRS, the floating-point operand in register FRT is multiplied by the
 425 floating-point operand in register FRA. The float- ing-point operand
 426 in register FRB is subtracted from this intermediate result, and the
 427 intermediate stored in FRT.
 428
 429 FRT is created as if a `fmadd` operation had been performed. FRS is
 430 created as if a `fnmsub` operation had simultaneously been performed
 431 with the exact same register operands, in parallel, independently,
 432 at exactly the same time.
 433
 434 FRT is a Read-Modify-Write operation.
 435
 436 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 437
 438 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 439 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 440 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 441 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 442
 443 Special Registers Altered:
 444
 445 ```
 446     FPRF FR FI
 447     FX OX UX XX
 448     VXSNAN VXISI VXIMZ
 449 ```
 450
 451
 452 ## Floating-Point Add FFT/DCT [Single]
 453
 454 A-Form
 455
 456 ```
 457     |0     |6     |11      |16     |21      |26    |31 |
 458     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 459 ```
 460
 461 * ffadds FRT,FRA,FRB (Rc=0)
 462
 463 Pseudo-code:
 464
 465 ```
 466     FRT <- FPADD32(FRA, FRB)
 467     FRS <- FPSUB32(FRB, FRA)
 468 ```
 469
 470 Special Registers Altered:
 471
 472 ```
 473     FPRF FR FI
 474     FX OX UX XX
 475     VXSNAN VXISI
 476 ```
 477
 478 ## Floating-Point Add FFT/DCT [Double]
 479
 480 A-Form
 481
 482 ```
 483     |0     |6     |11      |16     |21      |26    |31 |
 484     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 485 ```
 486
 487 * ffadd FRT,FRA,FRB (Rc=0)
 488
 489 Pseudo-code:
 490
 491 ```
 492     FRT <- FPADD64(FRA, FRB)
 493     FRS <- FPSUB64(FRB, FRA)
 494 ```
 495
 496 Special Registers Altered:
 497
 498 ```
 499     FPRF FR FI
 500     FX OX UX XX
 501     VXSNAN VXISI
 502 ```
 503
 504 ## Floating-Point Subtract FFT/DCT [Single]
 505
 506 A-Form
 507
 508 ```
 509     |0     |6     |11      |16     |21      |26    |31 |
 510     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 511 ```
 512
 513 * ffsubs FRT,FRA,FRB (Rc=0)
 514
 515 Pseudo-code:
 516
 517 ```
 518     FRT <- FPSUB32(FRB, FRA)
 519     FRS <- FPADD32(FRA, FRB)
 520 ```
 521
 522 Special Registers Altered:
 523
 524 ```
 525     FPRF FR FI
 526     FX OX UX XX
 527     VXSNAN VXISI
 528 ```
 529
 530 ## Floating-Point Subtract FFT/DCT [Double]
 531
 532 A-Form
 533
 534 ```
 535     |0     |6     |11      |16     |21      |26    |31 |
 536     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 537 ```
 538
 539 * ffsub FRT,FRA,FRB (Rc=0)
 540
 541 Pseudo-code:
 542
 543 ```
 544     FRT <- FPSUB64(FRB, FRA)
 545     FRS <- FPADD64(FRA, FRB)
 546 ```
 547
 548 Special Registers Altered:
 549
 550 ```
 551     FPRF FR FI
 552     FX OX UX XX
 553     VXSNAN VXISI
 554 ```