openpower/sv/twin_butterfly.mdwn

   1 # Introduction
   2
   3 <!-- hide -->
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
   5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
   6   information about implicit RS/FRS
   7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
   8 * [[openpower/isa/svfparith]]
   9 * [[openpower/isa/svfixedarith]]
  10 * [[openpower/sv/rfc/ls016]]
  11
  12 <!-- show -->
  13
  14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
  15
  16 The number of general-purpose uses for DCT is huge. The number of
  17 instructions needed instead of these Twin-Butterfly instructions is also
  18 huge (**eight**) and given that it is extremely common to explicitly
  19 loop-unroll them quantity hundreds to thousands of instructions are
  20 dismayingly common (for all ISAs).
  21
  22 The goal is to implement instructions that calculate the expression:
  23
  24 ```
  25     fdct_round_shift((a +/- b) * c)
  26 ```
  27
  28 For the single-coefficient butterfly instruction, and:
  29
  30 ```
  31     fdct_round_shift(a * c1  +/- b * c2)
  32 ```
  33
  34 For the double-coefficient butterfly instruction.
  35
  36 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
  37
  38 ```
  39     #define ROUND_POWER_OF_TWO(value, n) \
  40             (((value) + (1 << ((n)-1))) >> (n))
  41 ```
  42
  43 These instructions are at the core of **ALL** FDCT calculations in many
  44 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
  45 Arm includes special instructions to optimize these operations, although
  46 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
  47
  48 The suggestion is to have a single instruction to calculate both values
  49 `((a + b) * c) >> N`, and `((a - b) * c) >> N`.  The instruction will
  50 run in accumulate mode, so in order to calculate the 2-coeff version
  51 one would just have to call the same instruction with different order a,
  52 b and a different constant c.
  53
  54 Example taken from libvpx
  55 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
  56
  57 ```
  58     #include <stdint.h>
  59     #define ROUND_POWER_OF_TWO(value, n) \
  60             (((value) + (1 << ((n)-1))) >> (n))
  61     void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
  62         t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
  63         t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
  64     }
  65 ```
  66
  67 8 instructions are required  - replaced by just the one (maddsubrs):
  68
  69 ```
  70     add 9,5,4
  71     subf 5,5,4
  72     mullw 9,9,6
  73     mullw 5,5,6
  74     addi 9,9,8192
  75     addi 5,5,8192
  76     srawi 9,9,14
  77     srawi 5,5,14
  78 ```
  79
  80 -------
  81
  82 \newpage{}
  83
  84 ## Integer Butterfly Multiply Add/Sub FFT/DCT
  85
  86 **Add the following to Book I Section 3.3.9.1**
  87
  88 A-Form
  89
  90 ```
  91     |0     |6     |11      |16     |21      |26    |31 |
  92     | PO   |  RT  |   RA   |   RB  |   SH   |   XO |Rc |
  93
  94 ```
  95
  96 * maddsubrs  RT,RA,SH,RB
  97
  98 Pseudo-code:
  99
 100 ```
 101     n <- SH
 102     sum <- (RT) + (RA)
 103     diff <- (RT) - (RA)
 104     prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1] + 1
 105     prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1] + 1
 106     res1 <- ROTL64(prod1, XLEN-n)
 107     res2 <- ROTL64(prod2, XLEN-n)
 108     m <- MASK(n, (XLEN-1))
 109     signbit1 <- res1[0]
 110     signbit2 <- res2[0]
 111     smask1 <- ([signbit1]*XLEN) & ¬m
 112     smask2 <- ([signbit2]*XLEN) & ¬m
 113     s64_1 <- [0]*(XLEN-1) || signbit1
 114     s64_2 <- [0]*(XLEN-1) || signbit2
 115     RT <- (res1 & m | smask1) + s64_1
 116     RS <- (res2 & m | smask2) + s64_2
 117 ```
 118
 119 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 120
 121 Similar to `RTp`, this instruction produces an implicit result, `RS`,
 122 which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
 123 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
 124 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 125
 126 Special Registers Altered:
 127
 128 ```
 129     None
 130 ```
 131
 132 -------
 133
 134 \newpage{}
 135
 136 # Twin Butterfly Floating-Point DCT Instruction(s)
 137
 138 ## Floating-Point Twin Multiply-Add DCT [Single]
 139
 140 **Add the following to Book I Section 4.6.6.3**
 141
 142 X-Form
 143
 144 ```
 145     |0     |6     |11      |16     |21      |31 |
 146     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 147 ```
 148
 149 * fdmadds FRT,FRA,FRB (Rc=0)
 150
 151 Pseudo-code:
 152
 153 ```
 154     FRS <- FPADD32(FRT, FRB)
 155     sub <- FPSUB32(FRT, FRB)
 156     FRT <- FPMUL32(FRA, sub)
 157 ```
 158
 159 The two IEEE754-FP32 operations
 160
 161 ```
 162     FRS <- [(FRT) + (FRB)]
 163     FRT <- [(FRT) - (FRB)] * (FRA)
 164 ```
 165
 166 are simultaneously performed.
 167
 168 The Floating-Point operand in register FRT is added to the floating-point
 169 operand in register FRB and the result stored in FRS.
 170
 171 Using the exact same operand input register values from FRT and FRB
 172 that were used to create FRS, the Floating-Point operand in register
 173 FRB is subtracted from the floating-point operand in register FRT and
 174 the result then multiplied by FRA to create an intermediate result that
 175 is stored in FRT.
 176
 177 The add into FRS is treated exactly as `fadds`.  The creation of the
 178 result FRT is **not** the same as that of `fmsubs`.
 179 The creation of FRS and FRT are treated as parallel independent operations
 180 which occur at the same time.
 181
 182 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 183
 184 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 185 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 186 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 187 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 188
 189 Special Registers Altered:
 190
 191 ```
 192     FPRF FR FI
 193     FX OX UX XX
 194     VXSNAN VXISI VXIMZ
 195 ```
 196
 197 ## Floating-Point Multiply-Add FFT [Single]
 198
 199 **Add the following to Book I Section 4.6.6.3**
 200
 201 X-Form
 202
 203 ```
 204     |0     |6     |11      |16     |21      |31 |
 205     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 206 ```
 207
 208 * ffmadds FRT,FRA,FRB (Rc=0)
 209
 210 Pseudo-code:
 211
 212 ```
 213     FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
 214     FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
 215 ```
 216
 217 The two operations
 218
 219 ```
 220     FRS <- -([(FRT) * (FRA)] - (FRB))
 221     FRT <-   [(FRT) * (FRA)] + (FRB)
 222 ```
 223
 224 are performed.
 225
 226 The floating-point operand in register FRT is multiplied by the
 227 floating-point operand in register FRA. The floating-point operand in
 228 register FRB is added to this intermediate result, and the intermediate
 229 stored in FRS.
 230
 231 Using the exact same values of FRT, FRT and FRB as used to create
 232 FRS, the floating-point operand in register FRT is multiplied by the
 233 floating-point operand in register FRA. The float- ing-point operand
 234 in register FRB is subtracted from this intermediate result, and the
 235 intermediate stored in FRT.
 236
 237 FRT is created as if a `fmadds` operation had been performed. FRS is
 238 created as if a `fnmsubs` operation had simultaneously been performed
 239 with the exact same register operands, in parallel, independently,
 240 at exactly the same time.
 241
 242 FRT is a Read-Modify-Write operation.
 243
 244 Note that if Rc=1 an Illegal Instruction is raised.
 245 Rc=1 is `RESERVED`
 246
 247 Similar to `FRTp`, this instruction produces an implicit result,
 248 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
 249 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
 250 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
 251 (Max Vector Length).
 252
 253
 254 Special Registers Altered:
 255
 256 ```
 257     FPRF FR FI
 258     FX OX UX XX
 259     VXSNAN VXISI VXIMZ
 260 ```
 261 ## Floating-Point Twin Multiply-Add DCT
 262
 263 **Add the following to Book I Section 4.6.6.3**
 264
 265 X-Form
 266
 267 ```
 268     |0     |6     |11      |16     |21      |31 |
 269     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 270 ```
 271
 272 * fdmadd FRT,FRA,FRB (Rc=0)
 273
 274 Pseudo-code:
 275
 276 ```
 277     FRS <- FPADD64(FRT, FRB)
 278     sub <- FPSUB64(FRT, FRB)
 279     FRT <- FPMUL64(FRA, sub)
 280 ```
 281
 282 The two IEEE754-FP64 operations
 283
 284 ```
 285     FRS <- [(FRT) + (FRB)]
 286     FRT <- [(FRT) - (FRB)] * (FRA)
 287 ```
 288
 289 are simultaneously performed.
 290
 291 The Floating-Point operand in register FRT is added to the floating-point
 292 operand in register FRB and the result stored in FRS.
 293
 294 Using the exact same operand input register values from FRT and FRB
 295 that were used to create FRS, the Floating-Point operand in register
 296 FRB is subtracted from the floating-point operand in register FRT and
 297 the result then multiplied by FRA to create an intermediate result that
 298 is stored in FRT.
 299
 300 The add into FRS is treated exactly as `fadd`.  The creation of the
 301 result FRT is **not** the same as that of `fmsub`.
 302 The creation of FRS and FRT are treated as parallel independent operations
 303 which occur at the same time.
 304
 305 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 306
 307 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 308 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 309 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 310 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 311
 312 Special Registers Altered:
 313
 314 ```
 315     FPRF FR FI
 316     FX OX UX XX
 317     VXSNAN VXISI VXIMZ
 318 ```
 319
 320 ## Floating-Point Twin Multiply-Add FFT
 321
 322 **Add the following to Book I Section 4.6.6.3**
 323
 324 X-Form
 325
 326 ```
 327     |0     |6     |11      |16     |21      |31 |
 328     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 329 ```
 330
 331 * ffmadd FRT,FRA,FRB (Rc=0)
 332
 333 Pseudo-code:
 334
 335 ```
 336     FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
 337     FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
 338 ```
 339
 340 The two operations
 341
 342 ```
 343     FRS <- -([(FRT) * (FRA)] - (FRB))
 344     FRT <-   [(FRT) * (FRA)] + (FRB)
 345 ```
 346
 347 are performed.
 348
 349 The floating-point operand in register FRT is multiplied by the
 350 floating-point operand in register FRA. The float- ing-point operand in
 351 register FRB is added to this intermediate result, and the intermediate
 352 stored in FRS.
 353
 354 Using the exact same values of FRT, FRT and FRB as used to create
 355 FRS, the floating-point operand in register FRT is multiplied by the
 356 floating-point operand in register FRA. The float- ing-point operand
 357 in register FRB is subtracted from this intermediate result, and the
 358 intermediate stored in FRT.
 359
 360 FRT is created as if a `fmadd` operation had been performed. FRS is
 361 created as if a `fnmsub` operation had simultaneously been performed
 362 with the exact same register operands, in parallel, independently,
 363 at exactly the same time.
 364
 365 FRT is a Read-Modify-Write operation.
 366
 367 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 368
 369 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 370 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 371 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 372 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 373
 374 Special Registers Altered:
 375
 376 ```
 377     FPRF FR FI
 378     FX OX UX XX
 379     VXSNAN VXISI VXIMZ
 380 ```
 381
 382
 383 ## [DRAFT] Floating-Point Add FFT/DCT [Single]
 384
 385 A-Form
 386
 387 * ffadds FRT,FRA,FRB (Rc=0)
 388 * ffadds. FRT,FRA,FRB (Rc=1)
 389
 390 Pseudo-code:
 391
 392 ```
 393     FRT <- FPADD32(FRA, FRB)
 394     FRS <- FPSUB32(FRB, FRA)
 395 ```
 396
 397 Special Registers Altered:
 398
 399 ```
 400     FPRF FR FI
 401     FX OX UX XX
 402     VXSNAN VXISI
 403     CR1          (if Rc=1)
 404 ```
 405
 406 ## [DRAFT] Floating-Point Add FFT/DCT [Double]
 407
 408 A-Form
 409
 410 * ffadd FRT,FRA,FRB (Rc=0)
 411 * ffadd. FRT,FRA,FRB (Rc=1)
 412
 413 Pseudo-code:
 414
 415 ```
 416     FRT <- FPADD64(FRA, FRB)
 417     FRS <- FPSUB64(FRB, FRA)
 418 ```
 419
 420 Special Registers Altered:
 421
 422 ```
 423     FPRF FR FI
 424     FX OX UX XX
 425     VXSNAN VXISI
 426     CR1          (if Rc=1)
 427 ```
 428
 429 ## [DRAFT] Floating-Point Subtract FFT/DCT [Single]
 430
 431 A-Form
 432
 433 * ffsubs FRT,FRA,FRB (Rc=0)
 434 * ffsubs. FRT,FRA,FRB (Rc=1)
 435
 436 Pseudo-code:
 437
 438 ```
 439     FRT <- FPSUB32(FRB, FRA)
 440     FRS <- FPADD32(FRA, FRB)
 441 ```
 442
 443 Special Registers Altered:
 444
 445 ```
 446     FPRF FR FI
 447     FX OX UX XX
 448     VXSNAN VXISI
 449     CR1          (if Rc=1)
 450 ```
 451
 452 ## [DRAFT] Floating-Point Subtract FFT/DCT [Double]
 453
 454 A-Form
 455
 456 * ffsub FRT,FRA,FRB (Rc=0)
 457 * ffsub. FRT,FRA,FRB (Rc=1)
 458
 459 Pseudo-code:
 460
 461 ```
 462     FRT <- FPSUB64(FRB, FRA)
 463     FRS <- FPADD64(FRA, FRB)
 464 ```
 465
 466 Special Registers Altered:
 467
 468 ```
 469     FPRF FR FI
 470     FX OX UX XX
 471     VXSNAN VXISI
 472     CR1          (if Rc=1)
 473 ```