openpower/sv/twin_butterfly.mdwn

   1 # Introduction
   2
   3 <!-- hide -->
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
   5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
   6   information about implicit RS/FRS
   7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
   8 * [[openpower/isa/svfparith]]
   9 * [[openpower/isa/svfixedarith]]
  10 * [[openpower/sv/rfc/ls016]]
  11
  12 <!-- show -->
  13
  14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
  15
  16 The number of general-purpose uses for DCT is huge. The number of
  17 instructions needed instead of these Twin-Butterfly instructions is also
  18 huge (**eight**) and given that it is extremely common to explicitly
  19 loop-unroll them quantity hundreds to thousands of instructions are
  20 dismayingly common (for all ISAs).
  21
  22 The goal is to implement instructions that calculate the expression:
  23
  24 ```
  25     fdct_round_shift((a +/- b) * c)
  26 ```
  27
  28 For the single-coefficient butterfly instruction, and:
  29
  30 ```
  31     fdct_round_shift(a * c1  +/- b * c2)
  32 ```
  33
  34 For the double-coefficient butterfly instruction.
  35
  36 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
  37
  38 ```
  39     #define ROUND_POWER_OF_TWO(value, n) \
  40             (((value) + (1 << ((n)-1))) >> (n))
  41 ```
  42
  43 These instructions are at the core of **ALL** FDCT calculations in many
  44 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
  45 Arm includes special instructions to optimize these operations, although
  46 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
  47
  48 The suggestion is to have a single instruction to calculate both values
  49 `((a + b) * c) >> N`, and `((a - b) * c) >> N`.  The instruction will
  50 run in accumulate mode, so in order to calculate the 2-coeff version
  51 one would just have to call the same instruction with different order a,
  52 b and a different constant c.
  53
  54 Example taken from libvpx
  55 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
  56
  57 ```
  58     #include <stdint.h>
  59     #define ROUND_POWER_OF_TWO(value, n) \
  60             (((value) + (1 << ((n)-1))) >> (n))
  61     void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
  62         t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
  63         t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
  64     }
  65 ```
  66
  67 8 instructions are required  - replaced by just the one (maddsubrs):
  68
  69 ```
  70     add 9,5,4
  71     subf 5,5,4
  72     mullw 9,9,6
  73     mullw 5,5,6
  74     addi 9,9,8192
  75     addi 5,5,8192
  76     srawi 9,9,14
  77     srawi 5,5,14
  78 ```
  79
  80 ## Integer Butterfly Multiply Add/Sub FFT/DCT
  81
  82 **Add the following to Book I Section 3.3.9.1**
  83
  84 A-Form
  85
  86 ```
  87     |0     |6     |11      |16     |21      |26    |31 |
  88     | PO   |  RT  |   RA   |   RB  |   SH   |   XO |Rc |
  89
  90 ```
  91
  92 * maddsubrs  RT,RA,SH,RB
  93
  94 Pseudo-code:
  95
  96 ```
  97     n <- SH
  98     sum <- (RT) + (RA)
  99     diff <- (RT) - (RA)
 100     prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1] + 1
 101     prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1] + 1
 102     res1 <- ROTL64(prod1, XLEN-n)
 103     res2 <- ROTL64(prod2, XLEN-n)
 104     m <- MASK(n, (XLEN-1))
 105     signbit1 <- res1[0]
 106     signbit2 <- res2[0]
 107     smask1 <- ([signbit1]*XLEN) & ¬m
 108     smask2 <- ([signbit2]*XLEN) & ¬m
 109     s64_1 <- [0]*(XLEN-1) || signbit1
 110     s64_2 <- [0]*(XLEN-1) || signbit2
 111     RT <- (res1 & m | smask1) + s64_1
 112     RS <- (res2 & m | smask2) + s64_2
 113 ```
 114
 115 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 116
 117 Similar to `RTp`, this instruction produces an implicit result, `RS`,
 118 which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
 119 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
 120 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 121
 122 Special Registers Altered:
 123
 124 ```
 125     None
 126 ```
 127
 128 # Twin Butterfly Floating-Point DCT Instruction(s)
 129
 130 ## Floating-Point Twin Multiply-Add DCT [Single]
 131
 132 **Add the following to Book I Section 4.6.6.3**
 133
 134 X-Form
 135
 136 ```
 137     |0     |6     |11      |16     |21      |31 |
 138     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 139 ```
 140
 141 * fdmadds FRT,FRA,FRB (Rc=0)
 142
 143 Pseudo-code:
 144
 145 ```
 146     FRS <- FPADD32(FRT, FRB)
 147     sub <- FPSUB32(FRT, FRB)
 148     FRT <- FPMUL32(FRA, sub)
 149 ```
 150
 151 The two IEEE754-FP32 operations
 152
 153 ```
 154     FRS <- [(FRT) + (FRB)]
 155     FRT <- [(FRT) - (FRB)] * (FRA)
 156 ```
 157
 158 are simultaneously performed.
 159
 160 The Floating-Point operand in register FRT is added to the floating-point
 161 operand in register FRB and the result stored in FRS.
 162
 163 Using the exact same operand input register values from FRT and FRB
 164 that were used to create FRS, the Floating-Point operand in register
 165 FRB is subtracted from the floating-point operand in register FRT and
 166 the result then multiplied by FRA to create an intermediate result that
 167 is stored in FRT.
 168
 169 The add into FRS is treated exactly as `fadds`.  The creation of the
 170 result FRT is **not** the same as that of `fmsubs`.
 171 The creation of FRS and FRT are treated as parallel independent operations
 172 which occur at the same time.
 173
 174 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 175
 176 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 177 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 178 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 179 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 180
 181 Special Registers Altered:
 182
 183 ```
 184     FPRF FR FI
 185     FX OX UX XX
 186     VXSNAN VXISI VXIMZ
 187 ```
 188
 189 ## Floating-Point Multiply-Add FFT [Single]
 190
 191 **Add the following to Book I Section 4.6.6.3**
 192
 193 X-Form
 194
 195 ```
 196     |0     |6     |11      |16     |21      |31 |
 197     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 198 ```
 199
 200 * ffmadds FRT,FRA,FRB (Rc=0)
 201
 202 Pseudo-code:
 203
 204 ```
 205     FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
 206     FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
 207 ```
 208
 209 The two operations
 210
 211 ```
 212     FRS <- -([(FRT) * (FRA)] - (FRB))
 213     FRT <-   [(FRT) * (FRA)] + (FRB)
 214 ```
 215
 216 are performed.
 217
 218 The floating-point operand in register FRT is multiplied by the
 219 floating-point operand in register FRA. The floating-point operand in
 220 register FRB is added to this intermediate result, and the intermediate
 221 stored in FRS.
 222
 223 Using the exact same values of FRT, FRT and FRB as used to create
 224 FRS, the floating-point operand in register FRT is multiplied by the
 225 floating-point operand in register FRA. The float- ing-point operand
 226 in register FRB is subtracted from this intermediate result, and the
 227 intermediate stored in FRT.
 228
 229 FRT is created as if a `fmadds` operation had been performed. FRS is
 230 created as if a `fnmsubs` operation had simultaneously been performed
 231 with the exact same register operands, in parallel, independently,
 232 at exactly the same time.
 233
 234 FRT is a Read-Modify-Write operation.
 235
 236 Note that if Rc=1 an Illegal Instruction is raised.
 237 Rc=1 is `RESERVED`
 238
 239 Similar to `FRTp`, this instruction produces an implicit result,
 240 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
 241 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
 242 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
 243 (Max Vector Length).
 244
 245
 246 Special Registers Altered:
 247
 248 ```
 249     FPRF FR FI
 250     FX OX UX XX
 251     VXSNAN VXISI VXIMZ
 252 ```
 253 ## Floating-Point Twin Multiply-Add DCT
 254
 255 **Add the following to Book I Section 4.6.6.3**
 256
 257 X-Form
 258
 259 ```
 260     |0     |6     |11      |16     |21      |31 |
 261     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 262 ```
 263
 264 * fdmadd FRT,FRA,FRB (Rc=0)
 265
 266 Pseudo-code:
 267
 268 ```
 269     FRS <- FPADD64(FRT, FRB)
 270     sub <- FPSUB64(FRT, FRB)
 271     FRT <- FPMUL64(FRA, sub)
 272 ```
 273
 274 The two IEEE754-FP64 operations
 275
 276 ```
 277     FRS <- [(FRT) + (FRB)]
 278     FRT <- [(FRT) - (FRB)] * (FRA)
 279 ```
 280
 281 are simultaneously performed.
 282
 283 The Floating-Point operand in register FRT is added to the floating-point
 284 operand in register FRB and the result stored in FRS.
 285
 286 Using the exact same operand input register values from FRT and FRB
 287 that were used to create FRS, the Floating-Point operand in register
 288 FRB is subtracted from the floating-point operand in register FRT and
 289 the result then multiplied by FRA to create an intermediate result that
 290 is stored in FRT.
 291
 292 The add into FRS is treated exactly as `fadd`.  The creation of the
 293 result FRT is **not** the same as that of `fmsub`.
 294 The creation of FRS and FRT are treated as parallel independent operations
 295 which occur at the same time.
 296
 297 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 298
 299 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 300 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 301 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 302 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 303
 304 Special Registers Altered:
 305
 306 ```
 307     FPRF FR FI
 308     FX OX UX XX
 309     VXSNAN VXISI VXIMZ
 310 ```
 311
 312 ## Floating-Point Twin Multiply-Add FFT
 313
 314 **Add the following to Book I Section 4.6.6.3**
 315
 316 X-Form
 317
 318 ```
 319     |0     |6     |11      |16     |21      |31 |
 320     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 321 ```
 322
 323 * ffmadd FRT,FRA,FRB (Rc=0)
 324
 325 Pseudo-code:
 326
 327 ```
 328     FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
 329     FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
 330 ```
 331
 332 The two operations
 333
 334 ```
 335     FRS <- -([(FRT) * (FRA)] - (FRB))
 336     FRT <-   [(FRT) * (FRA)] + (FRB)
 337 ```
 338
 339 are performed.
 340
 341 The floating-point operand in register FRT is multiplied by the
 342 floating-point operand in register FRA. The float- ing-point operand in
 343 register FRB is added to this intermediate result, and the intermediate
 344 stored in FRS.
 345
 346 Using the exact same values of FRT, FRT and FRB as used to create
 347 FRS, the floating-point operand in register FRT is multiplied by the
 348 floating-point operand in register FRA. The float- ing-point operand
 349 in register FRB is subtracted from this intermediate result, and the
 350 intermediate stored in FRT.
 351
 352 FRT is created as if a `fmadd` operation had been performed. FRS is
 353 created as if a `fnmsub` operation had simultaneously been performed
 354 with the exact same register operands, in parallel, independently,
 355 at exactly the same time.
 356
 357 FRT is a Read-Modify-Write operation.
 358
 359 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 360
 361 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 362 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 363 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 364 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 365
 366 Special Registers Altered:
 367
 368 ```
 369     FPRF FR FI
 370     FX OX UX XX
 371     VXSNAN VXISI VXIMZ
 372 ```
 373
 374
 375 ## [DRAFT] Floating-Point Add FFT/DCT [Single]
 376
 377 A-Form
 378
 379 * ffadds FRT,FRA,FRB (Rc=0)
 380 * ffadds. FRT,FRA,FRB (Rc=1)
 381
 382 Pseudo-code:
 383
 384 ```
 385     FRT <- FPADD32(FRA, FRB)
 386     FRS <- FPSUB32(FRB, FRA)
 387 ```
 388
 389 Special Registers Altered:
 390
 391 ```
 392     FPRF FR FI
 393     FX OX UX XX
 394     VXSNAN VXISI
 395     CR1          (if Rc=1)
 396 ```
 397
 398 ## [DRAFT] Floating-Point Add FFT/DCT [Double]
 399
 400 A-Form
 401
 402 * ffadd FRT,FRA,FRB (Rc=0)
 403 * ffadd. FRT,FRA,FRB (Rc=1)
 404
 405 Pseudo-code:
 406
 407 ```
 408     FRT <- FPADD64(FRA, FRB)
 409     FRS <- FPSUB64(FRB, FRA)
 410 ```
 411
 412 Special Registers Altered:
 413
 414 ```
 415     FPRF FR FI
 416     FX OX UX XX
 417     VXSNAN VXISI
 418     CR1          (if Rc=1)
 419 ```
 420
 421 ## [DRAFT] Floating-Point Subtract FFT/DCT [Single]
 422
 423 A-Form
 424
 425 * ffsubs FRT,FRA,FRB (Rc=0)
 426 * ffsubs. FRT,FRA,FRB (Rc=1)
 427
 428 Pseudo-code:
 429
 430 ```
 431     FRT <- FPSUB32(FRB, FRA)
 432     FRS <- FPADD32(FRA, FRB)
 433 ```
 434
 435 Special Registers Altered:
 436
 437 ```
 438     FPRF FR FI
 439     FX OX UX XX
 440     VXSNAN VXISI
 441     CR1          (if Rc=1)
 442 ```
 443
 444 ## [DRAFT] Floating-Point Subtract FFT/DCT [Double]
 445
 446 A-Form
 447
 448 * ffsub FRT,FRA,FRB (Rc=0)
 449 * ffsub. FRT,FRA,FRB (Rc=1)
 450
 451 Pseudo-code:
 452
 453 ```
 454     FRT <- FPSUB64(FRB, FRA)
 455     FRS <- FPADD64(FRA, FRB)
 456 ```
 457
 458 Special Registers Altered:
 459
 460 ```
 461     FPRF FR FI
 462     FX OX UX XX
 463     VXSNAN VXISI
 464     CR1          (if Rc=1)
 465 ```