openpower/sv/twin_butterfly.mdwn

   1 # Introduction
   2
   3 <!-- hide -->
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
   5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
   6   information about implicit RS/FRS
   7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
   8 * [[openpower/isa/svfparith]]
   9 * [[openpower/isa/svfixedarith]]
  10 * [[openpower/sv/rfc/ls016]]
  11 <!-- show -->
  12
  13 Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
  14 context to save considerably on DCT, DFT and FFT processing.
  15
  16 # Rationale for Twin Butterfly Integer DCT Instruction(s)
  17
  18 The number of general-purpose uses for DCT is huge. The number of
  19 instructions needed instead of these Twin-Butterfly instructions is also
  20 huge (**eight**) and given that it is extremely common to explicitly
  21 loop-unroll them quantity hundreds to thousands of instructions are
  22 dismayingly common (for all ISAs).
  23
  24 The goal is to implement instructions that calculate the expression:
  25
  26 ```
  27     fdct_round_shift((a +/- b) * c)
  28 ```
  29
  30 For the single-coefficient butterfly instruction, and:
  31
  32 ```
  33     fdct_round_shift(a * c1  +/- b * c2)
  34 ```
  35
  36 For the double-coefficient butterfly instruction.
  37
  38 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
  39
  40 ```
  41     #define ROUND_POWER_OF_TWO(value, n) \
  42             (((value) + (1 << ((n)-1))) >> (n))
  43 ```
  44
  45 These instructions are at the core of **ALL** FDCT calculations in many
  46 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
  47 Arm includes special instructions to optimize these operations, although
  48 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
  49
  50 The suggestion is to have a single instruction to calculate both values
  51 `((a + b) * c) >> N`, and `((a - b) * c) >> N`.  The instruction will
  52 run in accumulate mode, so in order to calculate the 2-coeff version
  53 one would just have to call the same instruction with different order a,
  54 b and a different constant c.
  55
  56 Example taken from libvpx
  57 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
  58
  59 ```
  60     #include <stdint.h>
  61     #define ROUND_POWER_OF_TWO(value, n) \
  62             (((value) + (1 << ((n)-1))) >> (n))
  63     void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
  64         t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
  65         t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
  66     }
  67 ```
  68
  69 8 instructions are required  - replaced by just the one (maddsubrs):
  70
  71 ```
  72     add 9,5,4
  73     subf 5,5,4
  74     mullw 9,9,6
  75     mullw 5,5,6
  76     addi 9,9,8192
  77     addi 5,5,8192
  78     srawi 9,9,14
  79     srawi 5,5,14
  80 ```
  81
  82 -------
  83
  84 \newpage{}
  85
  86 ## Integer Butterfly Multiply Add/Sub FFT/DCT
  87
  88 **Add the following to Book I Section 3.3.9.1**
  89
  90 A-Form
  91
  92 ```
  93     |0     |6     |11      |16     |21      |26    |31 |
  94     | PO   |  RT  |   RA   |   RB  |   SH   |   XO |Rc |
  95 ```
  96
  97 * maddsubrs  RT,RA,SH,RB
  98
  99 Pseudo-code:
 100
 101 ```
 102     n <- SH
 103     sum <- (RT) + (RA)
 104     diff <- (RT) - (RA)
 105     prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1]
 106     prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1]
 107     if n = 0 then
 108         #round <- EXTS([0]*(XLEN-1) || [1]*1)
 109         #prod1 <- ROTL64(prod1, 1)
 110         #prod2 <- ROTL64(prod2, 1)
 111         #prod1 <- prod1 + round
 112         #prod2 <- prod2 + round
 113         #res1 <- ROTL64(prod1, XLEN-1)
 114         #res2 <- ROTL64(prod2, XLEN-1)
 115         #m <- MASK(1, (XLEN-1))
 116         RT <- prod1
 117         RS <- prod2
 118     else
 119         round <- EXTS([0]*(XLEN -n -1) || [1]*1 || [0]*(n-1))
 120         prod1 <- prod1 + round
 121         prod2 <- prod2 + round
 122         res1 <- ROTL64(prod1, XLEN-n)
 123         res2 <- ROTL64(prod2, XLEN-n)
 124         m <- MASK(n, (XLEN-1))
 125         signbit1 <- prod1[0]
 126         signbit2 <- prod2[0]
 127         smask1 <- ([signbit1]*XLEN) & ¬m
 128         smask2 <- ([signbit2]*XLEN) & ¬m
 129         RT <- (res1 & m | smask1)
 130         RS <- (res2 & m | smask2)
 131 ```
 132
 133 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 134
 135 Similar to `RTp`, this instruction produces an implicit result, `RS`,
 136 which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
 137 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
 138 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 139
 140 Special Registers Altered:
 141
 142 ```
 143     None
 144 ```
 145
 146 -------
 147
 148 \newpage{}
 149
 150 # Twin Butterfly Floating-Point DCT and FFT Instruction(s)
 151
 152 **Add the following to Book I Section 4.6.6.3**
 153
 154 ## Floating-Point Twin Multiply-Add DCT [Single]
 155
 156 X-Form
 157
 158 ```
 159     |0     |6     |11      |16     |21      |31 |
 160     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 161 ```
 162
 163 * fdmadds FRT,FRA,FRB (Rc=0)
 164
 165 Pseudo-code:
 166
 167 ```
 168     FRS <- FPADD32(FRT, FRB)
 169     sub <- FPSUB32(FRT, FRB)
 170     FRT <- FPMUL32(FRA, sub)
 171 ```
 172
 173 The two IEEE754-FP32 operations
 174
 175 ```
 176     FRS <- [(FRT) + (FRB)]
 177     FRT <- [(FRT) - (FRB)] * (FRA)
 178 ```
 179
 180 are simultaneously performed.
 181
 182 The Floating-Point operand in register FRT is added to the floating-point
 183 operand in register FRB and the result stored in FRS.
 184
 185 Using the exact same operand input register values from FRT and FRB
 186 that were used to create FRS, the Floating-Point operand in register
 187 FRB is subtracted from the floating-point operand in register FRT and
 188 the result then rounded before being multiplied by FRA to create an
 189 intermediate result that is stored in FRT.
 190
 191 The add into FRS is treated exactly as `fadds`.  The creation of the
 192 result FRT is **not** the same as that of `fmsubs`, but is instead as if
 193 `fsubs` were performed first followed by `fmuls`.  The creation of FRS
 194 and FRT are treated as parallel independent operations which occur at
 195 the same time.
 196
 197 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 198
 199 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 200 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 201 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 202 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 203
 204 Special Registers Altered:
 205
 206 ```
 207     FPRF FR FI
 208     FX OX UX XX
 209     VXSNAN VXISI VXIMZ
 210 ```
 211
 212 ## Floating-Point Multiply-Add FFT [Single]
 213
 214 X-Form
 215
 216 ```
 217     |0     |6     |11      |16     |21      |31 |
 218     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 219 ```
 220
 221 * ffmadds FRT,FRA,FRB (Rc=0)
 222
 223 Pseudo-code:
 224
 225 ```
 226     FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
 227     FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
 228 ```
 229
 230 The two operations
 231
 232 ```
 233     FRS <- -([(FRT) * (FRA)] - (FRB))
 234     FRT <-   [(FRT) * (FRA)] + (FRB)
 235 ```
 236
 237 are performed.
 238
 239 The floating-point operand in register FRT is multiplied by the
 240 floating-point operand in register FRA. The floating-point operand in
 241 register FRB is added to this intermediate result, and the intermediate
 242 stored in FRS.
 243
 244 Using the exact same values of FRT, FRT and FRB as used to create
 245 FRS, the floating-point operand in register FRT is multiplied by the
 246 floating-point operand in register FRA. The floating-point operand
 247 in register FRB is subtracted from this intermediate result, and the
 248 intermediate stored in FRT.
 249
 250 FRT is created as if a `fmadds` operation had been performed. FRS is
 251 created as if a `fnmsubs` operation had simultaneously been performed
 252 with the exact same register operands, in parallel, independently,
 253 at exactly the same time.
 254
 255 FRT is a Read-Modify-Write operation.
 256
 257 Note that if Rc=1 an Illegal Instruction is raised.
 258 Rc=1 is `RESERVED`
 259
 260 Similar to `FRTp`, this instruction produces an implicit result,
 261 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
 262 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
 263 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
 264 (Max Vector Length).
 265
 266 Special Registers Altered:
 267
 268 ```
 269     FPRF FR FI
 270     FX OX UX XX
 271     VXSNAN VXISI VXIMZ
 272 ```
 273
 274 ## Floating-Point Twin Multiply-Add DCT
 275
 276 X-Form
 277
 278 ```
 279     |0     |6     |11      |16     |21      |31 |
 280     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 281 ```
 282
 283 * fdmadd FRT,FRA,FRB (Rc=0)
 284
 285 Pseudo-code:
 286
 287 ```
 288     FRS <- FPADD64(FRT, FRB)
 289     sub <- FPSUB64(FRT, FRB)
 290     FRT <- FPMUL64(FRA, sub)
 291 ```
 292
 293 The two IEEE754-FP64 operations
 294
 295 ```
 296     FRS <- [(FRT) + (FRB)]
 297     FRT <- [(FRT) - (FRB)] * (FRA)
 298 ```
 299
 300 are simultaneously performed.
 301
 302 The Floating-Point operand in register FRT is added to the floating-point
 303 operand in register FRB and the result stored in FRS.
 304
 305 Using the exact same operand input register values from FRT and FRB
 306 that were used to create FRS, the Floating-Point operand in register
 307 FRB is subtracted from the floating-point operand in register FRT and
 308 the result then rounded before being multiplied by FRA to create an
 309 intermediate result that is stored in FRT.
 310
 311 The add into FRS is treated exactly as `fadd`.  The creation of the
 312 result FRT is **not** the same as that of `fmsub`, but is instead as if
 313 `fsub` were performed first followed by `fmuls.  The creation of FRS
 314 and FRT are treated as parallel independent operations which occur at
 315 the same time.
 316
 317 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 318
 319 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 320 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 321 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 322 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 323
 324 Special Registers Altered:
 325
 326 ```
 327     FPRF FR FI
 328     FX OX UX XX
 329     VXSNAN VXISI VXIMZ
 330 ```
 331
 332 ## Floating-Point Twin Multiply-Add FFT
 333
 334 X-Form
 335
 336 ```
 337     |0     |6     |11      |16     |21      |31 |
 338     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 339 ```
 340
 341 * ffmadd FRT,FRA,FRB (Rc=0)
 342
 343 Pseudo-code:
 344
 345 ```
 346     FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
 347     FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
 348 ```
 349
 350 The two operations
 351
 352 ```
 353     FRS <- -([(FRT) * (FRA)] - (FRB))
 354     FRT <-   [(FRT) * (FRA)] + (FRB)
 355 ```
 356
 357 are performed.
 358
 359 The floating-point operand in register FRT is multiplied by the
 360 floating-point operand in register FRA. The float- ing-point operand in
 361 register FRB is added to this intermediate result, and the intermediate
 362 stored in FRS.
 363
 364 Using the exact same values of FRT, FRT and FRB as used to create
 365 FRS, the floating-point operand in register FRT is multiplied by the
 366 floating-point operand in register FRA. The float- ing-point operand
 367 in register FRB is subtracted from this intermediate result, and the
 368 intermediate stored in FRT.
 369
 370 FRT is created as if a `fmadd` operation had been performed. FRS is
 371 created as if a `fnmsub` operation had simultaneously been performed
 372 with the exact same register operands, in parallel, independently,
 373 at exactly the same time.
 374
 375 FRT is a Read-Modify-Write operation.
 376
 377 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 378
 379 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 380 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 381 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 382 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 383
 384 Special Registers Altered:
 385
 386 ```
 387     FPRF FR FI
 388     FX OX UX XX
 389     VXSNAN VXISI VXIMZ
 390 ```
 391
 392
 393 ## Floating-Point Add FFT/DCT [Single]
 394
 395 A-Form
 396
 397 ```
 398     |0     |6     |11      |16     |21      |26    |31 |
 399     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 400 ```
 401
 402 * ffadds FRT,FRA,FRB (Rc=0)
 403
 404 Pseudo-code:
 405
 406 ```
 407     FRT <- FPADD32(FRA, FRB)
 408     FRS <- FPSUB32(FRB, FRA)
 409 ```
 410
 411 Special Registers Altered:
 412
 413 ```
 414     FPRF FR FI
 415     FX OX UX XX
 416     VXSNAN VXISI
 417 ```
 418
 419 ## Floating-Point Add FFT/DCT [Double]
 420
 421 A-Form
 422
 423 ```
 424     |0     |6     |11      |16     |21      |26    |31 |
 425     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 426 ```
 427
 428 * ffadd FRT,FRA,FRB (Rc=0)
 429
 430 Pseudo-code:
 431
 432 ```
 433     FRT <- FPADD64(FRA, FRB)
 434     FRS <- FPSUB64(FRB, FRA)
 435 ```
 436
 437 Special Registers Altered:
 438
 439 ```
 440     FPRF FR FI
 441     FX OX UX XX
 442     VXSNAN VXISI
 443 ```
 444
 445 ## Floating-Point Subtract FFT/DCT [Single]
 446
 447 A-Form
 448
 449 ```
 450     |0     |6     |11      |16     |21      |26    |31 |
 451     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 452 ```
 453
 454 * ffsubs FRT,FRA,FRB (Rc=0)
 455
 456 Pseudo-code:
 457
 458 ```
 459     FRT <- FPSUB32(FRB, FRA)
 460     FRS <- FPADD32(FRA, FRB)
 461 ```
 462
 463 Special Registers Altered:
 464
 465 ```
 466     FPRF FR FI
 467     FX OX UX XX
 468     VXSNAN VXISI
 469 ```
 470
 471 ## Floating-Point Subtract FFT/DCT [Double]
 472
 473 A-Form
 474
 475 ```
 476     |0     |6     |11      |16     |21      |26    |31 |
 477     | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
 478 ```
 479
 480 * ffsub FRT,FRA,FRB (Rc=0)
 481
 482 Pseudo-code:
 483
 484 ```
 485     FRT <- FPSUB64(FRB, FRA)
 486     FRS <- FPADD64(FRA, FRB)
 487 ```
 488
 489 Special Registers Altered:
 490
 491 ```
 492     FPRF FR FI
 493     FX OX UX XX
 494     VXSNAN VXISI
 495 ```