openpower/sv/twin_butterfly.mdwn

   1 # Introduction
   2
   3 <!-- hide -->
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
   5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
   6   information about implicit RS/FRS
   7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
   8 * [[openpower/isa/svfparith]]
   9 * [[openpower/isa/svfixedarith]]
  10 * [[openpower/sv/rfc/ls016]]
  11
  12 <!-- show -->
  13
  14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
  15
  16 The number of general-purpose uses for DCT is huge. The number of
  17 instructions needed instead of these Twin-Butterfly instructions is also
  18 huge (**eight**) and given that it is extremely common to explicitly
  19 loop-unroll them quantity hundreds to thousands of instructions are
  20 dismayingly common (for all ISAs).
  21
  22 The goal is to implement instructions that calculate the expression:
  23
  24 ```
  25     fdct_round_shift((a +/- b) * c)
  26 ```
  27
  28 For the single-coefficient butterfly instruction, and:
  29
  30 ```
  31     fdct_round_shift(a * c1  +/- b * c2)
  32 ```
  33
  34 For the double-coefficient butterfly instruction.
  35
  36 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
  37
  38 ```
  39     #define ROUND_POWER_OF_TWO(value, n) \
  40             (((value) + (1 << ((n)-1))) >> (n))
  41 ```
  42
  43 These instructions are at the core of **ALL** FDCT calculations in many
  44 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
  45 Arm includes special instructions to optimize these operations, although
  46 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
  47
  48 The suggestion is to have a single instruction to calculate both values
  49 `((a + b) * c) >> N`, and `((a - b) * c) >> N`.  The instruction will
  50 run in accumulate mode, so in order to calculate the 2-coeff version
  51 one would just have to call the same instruction with different order a,
  52 b and a different constant c.
  53
  54 Example taken from libvpx
  55 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
  56
  57 ```
  58     #include <stdint.h>
  59     #define ROUND_POWER_OF_TWO(value, n) \
  60             (((value) + (1 << ((n)-1))) >> (n))
  61     void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
  62         t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
  63         t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
  64     }
  65 ```
  66
  67 8 instructions are required  - replaced by just the one (maddsubrs):
  68
  69 ```
  70     add 9,5,4
  71     subf 5,5,4
  72     mullw 9,9,6
  73     mullw 5,5,6
  74     addi 9,9,8192
  75     addi 5,5,8192
  76     srawi 9,9,14
  77     srawi 5,5,14
  78 ```
  79
  80 -------
  81
  82 \newpage{}
  83
  84 ## Integer Butterfly Multiply Add/Sub FFT/DCT
  85
  86 **Add the following to Book I Section 3.3.9.1**
  87
  88 A-Form
  89
  90 ```
  91     |0     |6     |11      |16     |21      |26    |31 |
  92     | PO   |  RT  |   RA   |   RB  |   SH   |   XO |Rc |
  93
  94 ```
  95
  96 * maddsubrs  RT,RA,SH,RB
  97
  98 Pseudo-code:
  99
 100 ```
 101     n <- SH
 102     sum <- (RT) + (RA)
 103     diff <- (RT) - (RA)
 104     prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1]
 105     prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1]
 106     if n = 0 then
 107         #round <- EXTS([0]*(XLEN-1) || [1]*1)
 108         #prod1 <- ROTL64(prod1, 1)
 109         #prod2 <- ROTL64(prod2, 1)
 110         #prod1 <- prod1 + round
 111         #prod2 <- prod2 + round
 112         #res1 <- ROTL64(prod1, XLEN-1)
 113         #res2 <- ROTL64(prod2, XLEN-1)
 114         #m <- MASK(1, (XLEN-1))
 115         RT <- prod1
 116         RS <- prod2
 117     else
 118         round <- EXTS([0]*(XLEN -n -1) || [1]*1 || [0]*(n-1))
 119         prod1 <- prod1 + round
 120         prod2 <- prod2 + round
 121         res1 <- ROTL64(prod1, XLEN-n)
 122         res2 <- ROTL64(prod2, XLEN-n)
 123         m <- MASK(n, (XLEN-1))
 124         signbit1 <- prod1[0]
 125         signbit2 <- prod2[0]
 126         smask1 <- ([signbit1]*XLEN) & ¬m
 127         smask2 <- ([signbit2]*XLEN) & ¬m
 128         RT <- (res1 & m | smask1)
 129         RS <- (res2 & m | smask2)
 130 ```
 131
 132 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 133
 134 Similar to `RTp`, this instruction produces an implicit result, `RS`,
 135 which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
 136 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
 137 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 138
 139 Special Registers Altered:
 140
 141 ```
 142     None
 143 ```
 144
 145 -------
 146
 147 \newpage{}
 148
 149 # Twin Butterfly Floating-Point DCT Instruction(s)
 150
 151 **Add the following to Book I Section 4.6.6.3**
 152
 153 ## Floating-Point Twin Multiply-Add DCT [Single]
 154
 155 X-Form
 156
 157 ```
 158     |0     |6     |11      |16     |21      |31 |
 159     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 160 ```
 161
 162 * fdmadds FRT,FRA,FRB (Rc=0)
 163
 164 Pseudo-code:
 165
 166 ```
 167     FRS <- FPADD32(FRT, FRB)
 168     sub <- FPSUB32(FRT, FRB)
 169     FRT <- FPMUL32(FRA, sub)
 170 ```
 171
 172 The two IEEE754-FP32 operations
 173
 174 ```
 175     FRS <- [(FRT) + (FRB)]
 176     FRT <- [(FRT) - (FRB)] * (FRA)
 177 ```
 178
 179 are simultaneously performed.
 180
 181 The Floating-Point operand in register FRT is added to the floating-point
 182 operand in register FRB and the result stored in FRS.
 183
 184 Using the exact same operand input register values from FRT and FRB
 185 that were used to create FRS, the Floating-Point operand in register
 186 FRB is subtracted from the floating-point operand in register FRT and
 187 the result then rounded before being multiplied by FRA to create an
 188 intermediate result that is stored in FRT.
 189
 190 The add into FRS is treated exactly as `fadds`.  The creation of the
 191 result FRT is **not** the same as that of `fmsubs`, but is instead as if
 192 `fsubs` were performed first followed by `fmuls`.  The creation of FRS
 193 and FRT are treated as parallel independent operations which occur at
 194 the same time.
 195
 196 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 197
 198 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 199 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 200 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 201 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 202
 203 Special Registers Altered:
 204
 205 ```
 206     FPRF FR FI
 207     FX OX UX XX
 208     VXSNAN VXISI VXIMZ
 209 ```
 210
 211 ## Floating-Point Multiply-Add FFT [Single]
 212
 213 X-Form
 214
 215 ```
 216     |0     |6     |11      |16     |21      |31 |
 217     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 218 ```
 219
 220 * ffmadds FRT,FRA,FRB (Rc=0)
 221
 222 Pseudo-code:
 223
 224 ```
 225     FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
 226     FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
 227 ```
 228
 229 The two operations
 230
 231 ```
 232     FRS <- -([(FRT) * (FRA)] - (FRB))
 233     FRT <-   [(FRT) * (FRA)] + (FRB)
 234 ```
 235
 236 are performed.
 237
 238 The floating-point operand in register FRT is multiplied by the
 239 floating-point operand in register FRA. The floating-point operand in
 240 register FRB is added to this intermediate result, and the intermediate
 241 stored in FRS.
 242
 243 Using the exact same values of FRT, FRT and FRB as used to create
 244 FRS, the floating-point operand in register FRT is multiplied by the
 245 floating-point operand in register FRA. The float- ing-point operand
 246 in register FRB is subtracted from this intermediate result, and the
 247 intermediate stored in FRT.
 248
 249 FRT is created as if a `fmadds` operation had been performed. FRS is
 250 created as if a `fnmsubs` operation had simultaneously been performed
 251 with the exact same register operands, in parallel, independently,
 252 at exactly the same time.
 253
 254 FRT is a Read-Modify-Write operation.
 255
 256 Note that if Rc=1 an Illegal Instruction is raised.
 257 Rc=1 is `RESERVED`
 258
 259 Similar to `FRTp`, this instruction produces an implicit result,
 260 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
 261 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
 262 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
 263 (Max Vector Length).
 264
 265
 266 Special Registers Altered:
 267
 268 ```
 269     FPRF FR FI
 270     FX OX UX XX
 271     VXSNAN VXISI VXIMZ
 272 ```
 273 ## Floating-Point Twin Multiply-Add DCT
 274
 275 X-Form
 276
 277 ```
 278     |0     |6     |11      |16     |21      |31 |
 279     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 280 ```
 281
 282 * fdmadd FRT,FRA,FRB (Rc=0)
 283
 284 Pseudo-code:
 285
 286 ```
 287     FRS <- FPADD64(FRT, FRB)
 288     sub <- FPSUB64(FRT, FRB)
 289     FRT <- FPMUL64(FRA, sub)
 290 ```
 291
 292 The two IEEE754-FP64 operations
 293
 294 ```
 295     FRS <- [(FRT) + (FRB)]
 296     FRT <- [(FRT) - (FRB)] * (FRA)
 297 ```
 298
 299 are simultaneously performed.
 300
 301 The Floating-Point operand in register FRT is added to the floating-point
 302 operand in register FRB and the result stored in FRS.
 303
 304 Using the exact same operand input register values from FRT and FRB
 305 that were used to create FRS, the Floating-Point operand in register
 306 FRB is subtracted from the floating-point operand in register FRT and
 307 the result then rounded before being multiplied by FRA to create an
 308 intermediate result that is stored in FRT.
 309
 310 The add into FRS is treated exactly as `fadd`.  The creation of the
 311 result FRT is **not** the same as that of `fmsub`, but is instead as if
 312 `fsub` were performed first followed by `fmuls.  The creation of FRS
 313 and FRT are treated as parallel independent operations which occur at
 314 the same time.
 315
 316 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 317
 318 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 319 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 320 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 321 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 322
 323 Special Registers Altered:
 324
 325 ```
 326     FPRF FR FI
 327     FX OX UX XX
 328     VXSNAN VXISI VXIMZ
 329 ```
 330
 331 ## Floating-Point Twin Multiply-Add FFT
 332
 333 X-Form
 334
 335 ```
 336     |0     |6     |11      |16     |21      |31 |
 337     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 338 ```
 339
 340 * ffmadd FRT,FRA,FRB (Rc=0)
 341
 342 Pseudo-code:
 343
 344 ```
 345     FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
 346     FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
 347 ```
 348
 349 The two operations
 350
 351 ```
 352     FRS <- -([(FRT) * (FRA)] - (FRB))
 353     FRT <-   [(FRT) * (FRA)] + (FRB)
 354 ```
 355
 356 are performed.
 357
 358 The floating-point operand in register FRT is multiplied by the
 359 floating-point operand in register FRA. The float- ing-point operand in
 360 register FRB is added to this intermediate result, and the intermediate
 361 stored in FRS.
 362
 363 Using the exact same values of FRT, FRT and FRB as used to create
 364 FRS, the floating-point operand in register FRT is multiplied by the
 365 floating-point operand in register FRA. The float- ing-point operand
 366 in register FRB is subtracted from this intermediate result, and the
 367 intermediate stored in FRT.
 368
 369 FRT is created as if a `fmadd` operation had been performed. FRS is
 370 created as if a `fnmsub` operation had simultaneously been performed
 371 with the exact same register operands, in parallel, independently,
 372 at exactly the same time.
 373
 374 FRT is a Read-Modify-Write operation.
 375
 376 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 377
 378 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 379 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 380 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 381 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 382
 383 Special Registers Altered:
 384
 385 ```
 386     FPRF FR FI
 387     FX OX UX XX
 388     VXSNAN VXISI VXIMZ
 389 ```
 390
 391
 392 ## [DRAFT] Floating-Point Add FFT/DCT [Single]
 393
 394 A-Form
 395
 396 * ffadds FRT,FRA,FRB (Rc=0)
 397 * ffadds. FRT,FRA,FRB (Rc=1)
 398
 399 Pseudo-code:
 400
 401 ```
 402     FRT <- FPADD32(FRA, FRB)
 403     FRS <- FPSUB32(FRB, FRA)
 404 ```
 405
 406 Special Registers Altered:
 407
 408 ```
 409     FPRF FR FI
 410     FX OX UX XX
 411     VXSNAN VXISI
 412     CR1          (if Rc=1)
 413 ```
 414
 415 ## [DRAFT] Floating-Point Add FFT/DCT [Double]
 416
 417 A-Form
 418
 419 * ffadd FRT,FRA,FRB (Rc=0)
 420 * ffadd. FRT,FRA,FRB (Rc=1)
 421
 422 Pseudo-code:
 423
 424 ```
 425     FRT <- FPADD64(FRA, FRB)
 426     FRS <- FPSUB64(FRB, FRA)
 427 ```
 428
 429 Special Registers Altered:
 430
 431 ```
 432     FPRF FR FI
 433     FX OX UX XX
 434     VXSNAN VXISI
 435     CR1          (if Rc=1)
 436 ```
 437
 438 ## [DRAFT] Floating-Point Subtract FFT/DCT [Single]
 439
 440 A-Form
 441
 442 * ffsubs FRT,FRA,FRB (Rc=0)
 443 * ffsubs. FRT,FRA,FRB (Rc=1)
 444
 445 Pseudo-code:
 446
 447 ```
 448     FRT <- FPSUB32(FRB, FRA)
 449     FRS <- FPADD32(FRA, FRB)
 450 ```
 451
 452 Special Registers Altered:
 453
 454 ```
 455     FPRF FR FI
 456     FX OX UX XX
 457     VXSNAN VXISI
 458     CR1          (if Rc=1)
 459 ```
 460
 461 ## [DRAFT] Floating-Point Subtract FFT/DCT [Double]
 462
 463 A-Form
 464
 465 * ffsub FRT,FRA,FRB (Rc=0)
 466 * ffsub. FRT,FRA,FRB (Rc=1)
 467
 468 Pseudo-code:
 469
 470 ```
 471     FRT <- FPSUB64(FRB, FRA)
 472     FRS <- FPADD64(FRA, FRB)
 473 ```
 474
 475 Special Registers Altered:
 476
 477 ```
 478     FPRF FR FI
 479     FX OX UX XX
 480     VXSNAN VXISI
 481     CR1          (if Rc=1)
 482 ```