openpower/sv/twin_butterfly.mdwn

   1 # Introduction
   2
   3 <!-- hide -->
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
   5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
   6   information about implicit RS/FRS
   7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
   8 * [[openpower/isa/svfparith]]
   9 * [[openpower/isa/svfixedarith]]
  10 * [[openpower/sv/rfc/ls016]]
  11 <!-- show -->
  12
  13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
  14 context to save considerably on DCT, DFT and FFT processing.  Whilst some hardware
  15 implementations may not necessarily implement them efficiently (slower Micro-coding)
  16 savings still come from the reduction in temporary registers as well as instruction
  17 count.
  18
  19 # Rationale for Twin Butterfly Integer DCT Instruction(s)
  20
  21 The number of general-purpose uses for DCT is huge. The number of
  22 instructions needed instead of these Twin-Butterfly instructions is also
  23 huge (**eight**) and given that it is extremely common to explicitly
  24 loop-unroll them quantity hundreds to thousands of instructions are
  25 dismayingly common (for all ISAs).
  26
  27 The goal is to implement instructions that calculate the expression:
  28
  29 ```
  30     fdct_round_shift((a +/- b) * c)
  31 ```
  32
  33 For the single-coefficient butterfly instruction, and:
  34
  35 ```
  36     fdct_round_shift(a * c1  +/- b * c2)
  37 ```
  38
  39 For the double-coefficient butterfly instruction.
  40
  41 In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
  42
  43 ```
  44     #define ROUND_POWER_OF_TWO(value, n) \
  45             (((value) + (1 << ((n)-1))) >> (n))
  46 ```
  47
  48 These instructions are at the core of **ALL** FDCT calculations in many
  49 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
  50 ARM includes special instructions to optimize these operations, although
  51 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
  52
  53 The suggestion is to have a single instruction to calculate both values
  54 `((a + b) * c) >> N`, and `((a - b) * c) >> N`.  The instruction will
  55 run in accumulate mode, so in order to calculate the 2-coeff version
  56 one would just have to call the same instruction with different order a,
  57 b and a different constant c.
  58
  59 Example taken from libvpx
  60 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
  61
  62 ```
  63     #include <stdint.h>
  64     #define ROUND_POWER_OF_TWO(value, n) \
  65             (((value) + (1 << ((n)-1))) >> (n))
  66     void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
  67         t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
  68         t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
  69     }
  70 ```
  71
  72 8 instructions are required  - replaced by just the one (maddsubrs):
  73
  74 ```
  75     add 9,5,4
  76     subf 5,5,4
  77     mullw 9,9,6
  78     mullw 5,5,6
  79     addi 9,9,8192
  80     addi 5,5,8192
  81     srawi 9,9,14
  82     srawi 5,5,14
  83 ```
  84
  85 -------
  86
  87 \newpage{}
  88
  89 ## Integer Butterfly Multiply Add/Sub FFT/DCT
  90
  91 **Add the following to Book I Section 3.3.9.1**
  92
  93 A-Form
  94
  95 ```
  96     |0     |6     |11      |16     |21      |26    |31 |
  97     | PO   |  RT  |   RA   |   RB  |   SH   |   XO |Rc |
  98 ```
  99
 100 * maddsubrs  RT,RA,SH,RB
 101
 102 Pseudo-code:
 103
 104 ```
 105     n <- SH
 106     sum <- (RT) + (RA)
 107     diff <- (RT) - (RA)
 108     prod1 <- MULS(RB, sum)
 109     prod1_lo <- prod1[XLEN:(XLEN*2)-1]
 110     prod2 <- MULS(RB, diff)
 111     prod2_lo <- prod2[XLEN:(XLEN*2)-1]
 112     if n = 0 then
 113         RT <- prod1_lo
 114         RS <- prod2_lo
 115     else
 116         round <- [0]*XLEN
 117         round[XLEN -n] <- 1
 118         prod1_lo <- prod1_lo + round
 119         prod2_lo <- prod2_lo + round
 120         m <- MASK(n, (XLEN-1))
 121         res1 <- ROTL64(prod1_lo, XLEN-n) & m
 122         res2 <- ROTL64(prod2_lo, XLEN-n) & m
 123         signbit1 <- prod1_lo[0]
 124         signbit2 <- prod2_lo[0]
 125         smask1 <- ([signbit1]*XLEN) & ¬m
 126         smask2 <- ([signbit2]*XLEN) & ¬m
 127         RT <- (res1 | smask1)
 128         RS <- (res2 | smask2)
 129 ```
 130
 131 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 132
 133 Similar to `RTp`, this instruction produces an implicit result, `RS`,
 134 which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
 135 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
 136 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 137
 138 Special Registers Altered:
 139
 140 ```
 141     None
 142 ```
 143
 144 -------
 145
 146 \newpage{}
 147
 148 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
 149
 150 **Add the following to Book I Section 4.6.6.3**
 151
 152 ## Floating-Point Twin Multiply-Add DCT [Single]
 153
 154 X-Form
 155
 156 ```
 157     |0     |6     |11      |16     |21      |31 |
 158     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 159 ```
 160
 161 * fdmadds FRT,FRA,FRB (Rc=0)
 162
 163 Pseudo-code:
 164
 165 ```
 166     FRS <- FPADD32(FRT, FRB)
 167     sub <- FPSUB32(FRT, FRB)
 168     FRT <- FPMUL32(FRA, sub)
 169 ```
 170
 171 The two IEEE754-FP32 operations
 172
 173 ```
 174     FRS <- [(FRT) + (FRB)]
 175     FRT <- [(FRT) - (FRB)] * (FRA)
 176 ```
 177
 178 are simultaneously performed.
 179
 180 The Floating-Point operand in register FRT is added to the floating-point
 181 operand in register FRB and the result stored in FRS.
 182
 183 Using the exact same operand input register values from FRT and FRB
 184 that were used to create FRS, the Floating-Point operand in register
 185 FRB is subtracted from the floating-point operand in register FRT and
 186 the result then rounded before being multiplied by FRA to create an
 187 intermediate result that is stored in FRT.
 188
 189 The add into FRS is treated exactly as `fadds`.  The creation of the
 190 result FRT is **not** the same as that of `fmsubs`, but is instead as if
 191 `fsubs` were performed first followed by `fmuls`.  The creation of FRS
 192 and FRT are treated as parallel independent operations which occur at
 193 the same time.
 194
 195 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 196
 197 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 198 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 199 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 200 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 201
 202 Special Registers Altered:
 203
 204 ```
 205     FPRF FR FI
 206     FX OX UX XX
 207     VXSNAN VXISI VXIMZ
 208 ```
 209
 210 ## Floating-Point Multiply-Add FFT [Single]
 211
 212 X-Form
 213
 214 ```
 215     |0     |6     |11      |16     |21      |31 |
 216     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 217 ```
 218
 219 * ffmadds FRT,FRA,FRB (Rc=0)
 220
 221 Pseudo-code:
 222
 223 ```
 224     FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
 225     FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
 226 ```
 227
 228 The two operations
 229
 230 ```
 231     FRS <- -([(FRT) * (FRA)] - (FRB))
 232     FRT <-   [(FRT) * (FRA)] + (FRB)
 233 ```
 234
 235 are performed.
 236
 237 The floating-point operand in register FRT is multiplied by the
 238 floating-point operand in register FRA. The floating-point operand in
 239 register FRB is added to this intermediate result, and the intermediate
 240 stored in FRS.
 241
 242 Using the exact same values of FRT, FRT and FRB as used to create
 243 FRS, the floating-point operand in register FRT is multiplied by the
 244 floating-point operand in register FRA. The floating-point operand
 245 in register FRB is subtracted from this intermediate result, and the
 246 intermediate stored in FRT.
 247
 248 FRT is created as if a `fmadds` operation had been performed. FRS is
 249 created as if a `fnmsubs` operation had simultaneously been performed
 250 with the exact same register operands, in parallel, independently,
 251 at exactly the same time.
 252
 253 FRT is a Read-Modify-Write operation.
 254
 255 Note that if Rc=1 an Illegal Instruction is raised.
 256 Rc=1 is `RESERVED`
 257
 258 Similar to `FRTp`, this instruction produces an implicit result,
 259 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
 260 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
 261 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
 262 (Max Vector Length).
 263
 264 Special Registers Altered:
 265
 266 ```
 267     FPRF FR FI
 268     FX OX UX XX
 269     VXSNAN VXISI VXIMZ
 270 ```
 271
 272 ## Floating-Point Twin Multiply-Add DCT
 273
 274 X-Form
 275
 276 ```
 277     |0     |6     |11      |16     |21      |31 |
 278     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 279 ```
 280
 281 * fdmadd FRT,FRA,FRB (Rc=0)
 282
 283 Pseudo-code:
 284
 285 ```
 286     FRS <- FPADD64(FRT, FRB)
 287     sub <- FPSUB64(FRT, FRB)
 288     FRT <- FPMUL64(FRA, sub)
 289 ```
 290
 291 The two IEEE754-FP64 operations
 292
 293 ```
 294     FRS <- [(FRT) + (FRB)]
 295     FRT <- [(FRT) - (FRB)] * (FRA)
 296 ```
 297
 298 are simultaneously performed.
 299
 300 The Floating-Point operand in register FRT is added to the floating-point
 301 operand in register FRB and the result stored in FRS.
 302
 303 Using the exact same operand input register values from FRT and FRB
 304 that were used to create FRS, the Floating-Point operand in register
 305 FRB is subtracted from the floating-point operand in register FRT and
 306 the result then rounded before being multiplied by FRA to create an
 307 intermediate result that is stored in FRT.
 308
 309 The add into FRS is treated exactly as `fadd`.  The creation of the
 310 result FRT is **not** the same as that of `fmsub`, but is instead as if
 311 `fsub` were performed first followed by `fmuls.  The creation of FRS
 312 and FRT are treated as parallel independent operations which occur at
 313 the same time.
 314
 315 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 316
 317 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 318 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 319 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 320 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 321
 322 Special Registers Altered:
 323
 324 ```
 325     FPRF FR FI
 326     FX OX UX XX
 327     VXSNAN VXISI VXIMZ
 328 ```
 329
 330 ## Floating-Point Twin Multiply-Add FFT
 331
 332 X-Form
 333
 334 ```
 335     |0     |6     |11      |16     |21      |31 |
 336     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 337 ```
 338
 339 * ffmadd FRT,FRA,FRB (Rc=0)
 340
 341 Pseudo-code:
 342
 343 ```
 344     FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
 345     FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
 346 ```
 347
 348 The two operations
 349
 350 ```
 351     FRS <- -([(FRT) * (FRA)] - (FRB))
 352     FRT <-   [(FRT) * (FRA)] + (FRB)
 353 ```
 354
 355 are performed.
 356
 357 The floating-point operand in register FRT is multiplied by the
 358 floating-point operand in register FRA. The float- ing-point operand in
 359 register FRB is added to this intermediate result, and the intermediate
 360 stored in FRS.
 361
 362 Using the exact same values of FRT, FRT and FRB as used to create
 363 FRS, the floating-point operand in register FRT is multiplied by the
 364 floating-point operand in register FRA. The float- ing-point operand
 365 in register FRB is subtracted from this intermediate result, and the
 366 intermediate stored in FRT.
 367
 368 FRT is created as if a `fmadd` operation had been performed. FRS is
 369 created as if a `fnmsub` operation had simultaneously been performed
 370 with the exact same register operands, in parallel, independently,
 371 at exactly the same time.
 372
 373 FRT is a Read-Modify-Write operation.
 374
 375 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 376
 377 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 378 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 379 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 380 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 381
 382 Special Registers Altered:
 383
 384 ```
 385     FPRF FR FI
 386     FX OX UX XX
 387     VXSNAN VXISI VXIMZ
 388 ```
 389
 390
 391 ## Floating-Point Add FFT/DCT [Single]
 392
 393 A-Form
 394
 395 ```
 396     |0     |6     |11      |16     |21      |26    |31 |
 397     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 398 ```
 399
 400 * ffadds FRT,FRA,FRB (Rc=0)
 401
 402 Pseudo-code:
 403
 404 ```
 405     FRT <- FPADD32(FRA, FRB)
 406     FRS <- FPSUB32(FRB, FRA)
 407 ```
 408
 409 Special Registers Altered:
 410
 411 ```
 412     FPRF FR FI
 413     FX OX UX XX
 414     VXSNAN VXISI
 415 ```
 416
 417 ## Floating-Point Add FFT/DCT [Double]
 418
 419 A-Form
 420
 421 ```
 422     |0     |6     |11      |16     |21      |26    |31 |
 423     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 424 ```
 425
 426 * ffadd FRT,FRA,FRB (Rc=0)
 427
 428 Pseudo-code:
 429
 430 ```
 431     FRT <- FPADD64(FRA, FRB)
 432     FRS <- FPSUB64(FRB, FRA)
 433 ```
 434
 435 Special Registers Altered:
 436
 437 ```
 438     FPRF FR FI
 439     FX OX UX XX
 440     VXSNAN VXISI
 441 ```
 442
 443 ## Floating-Point Subtract FFT/DCT [Single]
 444
 445 A-Form
 446
 447 ```
 448     |0     |6     |11      |16     |21      |26    |31 |
 449     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 450 ```
 451
 452 * ffsubs FRT,FRA,FRB (Rc=0)
 453
 454 Pseudo-code:
 455
 456 ```
 457     FRT <- FPSUB32(FRB, FRA)
 458     FRS <- FPADD32(FRA, FRB)
 459 ```
 460
 461 Special Registers Altered:
 462
 463 ```
 464     FPRF FR FI
 465     FX OX UX XX
 466     VXSNAN VXISI
 467 ```
 468
 469 ## Floating-Point Subtract FFT/DCT [Double]
 470
 471 A-Form
 472
 473 ```
 474     |0     |6     |11      |16     |21      |26    |31 |
 475     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 476 ```
 477
 478 * ffsub FRT,FRA,FRB (Rc=0)
 479
 480 Pseudo-code:
 481
 482 ```
 483     FRT <- FPSUB64(FRB, FRA)
 484     FRS <- FPADD64(FRA, FRB)
 485 ```
 486
 487 Special Registers Altered:
 488
 489 ```
 490     FPRF FR FI
 491     FX OX UX XX
 492     VXSNAN VXISI
 493 ```