openpower/sv/twin_butterfly.mdwn

   1 # Introduction
   2
   3 <!-- hide -->
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
   5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
   6   information about implicit RS/FRS
   7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
   8 * [[openpower/isa/svfparith]]
   9 * [[openpower/isa/svfixedarith]]
  10 * [[openpower/sv/rfc/ls016]]
  11
  12 <!-- show -->
  13
  14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
  15
  16 The number of general-purpose uses for DCT is huge. The number of
  17 instructions needed instead of these Twin-Butterfly instructions is also
  18 huge (**eight**) and given that it is extremely common to explicitly
  19 loop-unroll them quantity hundreds to thousands of instructions are
  20 dismayingly common (for all ISAs).
  21
  22 The goal is to implement instructions that calculate the expression:
  23
  24 ```
  25     fdct_round_shift((a +/- b) * c)
  26 ```
  27
  28 For the single-coefficient butterfly instruction, and:
  29
  30 ```
  31     fdct_round_shift(a * c1  +/- b * c2)
  32 ```
  33
  34 For the double-coefficient butterfly instruction.
  35
  36 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
  37
  38 ```
  39     #define ROUND_POWER_OF_TWO(value, n) \
  40             (((value) + (1 << ((n)-1))) >> (n))
  41 ```
  42
  43 These instructions are at the core of **ALL** FDCT calculations in many
  44 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
  45 Arm includes special instructions to optimize these operations, although
  46 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
  47
  48 The suggestion is to have a single instruction to calculate both values
  49 `((a + b) * c) >> N`, and `((a - b) * c) >> N`.  The instruction will
  50 run in accumulate mode, so in order to calculate the 2-coeff version
  51 one would just have to call the same instruction with different order a,
  52 b and a different constant c.
  53
  54 Example taken from libvpx
  55 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
  56
  57 ```
  58     #include <stdint.h>
  59     #define ROUND_POWER_OF_TWO(value, n) \
  60             (((value) + (1 << ((n)-1))) >> (n))
  61     void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
  62         t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
  63         t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
  64     }
  65 ```
  66
  67 8 instructions are required  - replaced by just the one (maddsubrs):
  68
  69 ```
  70     add 9,5,4
  71     subf 5,5,4
  72     mullw 9,9,6
  73     mullw 5,5,6
  74     addi 9,9,8192
  75     addi 5,5,8192
  76     srawi 9,9,14
  77     srawi 5,5,14
  78 ```
  79
  80 -------
  81
  82 \newpage{}
  83
  84 ## Integer Butterfly Multiply Add/Sub FFT/DCT
  85
  86 **Add the following to Book I Section 3.3.9.1**
  87
  88 A-Form
  89
  90 ```
  91     |0     |6     |11      |16     |21      |26    |31 |
  92     | PO   |  RT  |   RA   |   RB  |   SH   |   XO |Rc |
  93
  94 ```
  95
  96 * maddsubrs  RT,RA,SH,RB
  97
  98 Pseudo-code:
  99
 100 ```
 101     n <- SH
 102     sum <- (RT) + (RA)
 103     diff <- (RT) - (RA)
 104     prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1] + 1
 105     prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1] + 1
 106     res1 <- ROTL64(prod1, XLEN-n)
 107     res2 <- ROTL64(prod2, XLEN-n)
 108     m <- MASK(n, (XLEN-1))
 109     signbit1 <- res1[0]
 110     signbit2 <- res2[0]
 111     smask1 <- ([signbit1]*XLEN) & ¬m
 112     smask2 <- ([signbit2]*XLEN) & ¬m
 113     s64_1 <- [0]*(XLEN-1) || signbit1
 114     s64_2 <- [0]*(XLEN-1) || signbit2
 115     RT <- (res1 & m | smask1) + s64_1
 116     RS <- (res2 & m | smask2) + s64_2
 117 ```
 118
 119 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 120
 121 Similar to `RTp`, this instruction produces an implicit result, `RS`,
 122 which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
 123 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
 124 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 125
 126 Special Registers Altered:
 127
 128 ```
 129     None
 130 ```
 131
 132 -------
 133
 134 \newpage{}
 135
 136 # Twin Butterfly Floating-Point DCT Instruction(s)
 137
 138 **Add the following to Book I Section 4.6.6.3**
 139
 140 ## Floating-Point Twin Multiply-Add DCT [Single]
 141
 142 X-Form
 143
 144 ```
 145     |0     |6     |11      |16     |21      |31 |
 146     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 147 ```
 148
 149 * fdmadds FRT,FRA,FRB (Rc=0)
 150
 151 Pseudo-code:
 152
 153 ```
 154     FRS <- FPADD32(FRT, FRB)
 155     sub <- FPSUB32(FRT, FRB)
 156     FRT <- FPMUL32(FRA, sub)
 157 ```
 158
 159 The two IEEE754-FP32 operations
 160
 161 ```
 162     FRS <- [(FRT) + (FRB)]
 163     FRT <- [(FRT) - (FRB)] * (FRA)
 164 ```
 165
 166 are simultaneously performed.
 167
 168 The Floating-Point operand in register FRT is added to the floating-point
 169 operand in register FRB and the result stored in FRS.
 170
 171 Using the exact same operand input register values from FRT and FRB
 172 that were used to create FRS, the Floating-Point operand in register
 173 FRB is subtracted from the floating-point operand in register FRT and
 174 the result then rounded before being multiplied by FRA to create an
 175 intermediate result that is stored in FRT.
 176
 177 The add into FRS is treated exactly as `fadds`.  The creation of the
 178 result FRT is **not** the same as that of `fmsubs`, but is instead as if
 179 `fsubs` were performed first followed by `fmuls`.  The creation of FRS
 180 and FRT are treated as parallel independent operations which occur at
 181 the same time.
 182
 183 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 184
 185 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 186 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 187 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 188 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 189
 190 Special Registers Altered:
 191
 192 ```
 193     FPRF FR FI
 194     FX OX UX XX
 195     VXSNAN VXISI VXIMZ
 196 ```
 197
 198 ## Floating-Point Multiply-Add FFT [Single]
 199
 200 X-Form
 201
 202 ```
 203     |0     |6     |11      |16     |21      |31 |
 204     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 205 ```
 206
 207 * ffmadds FRT,FRA,FRB (Rc=0)
 208
 209 Pseudo-code:
 210
 211 ```
 212     FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
 213     FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
 214 ```
 215
 216 The two operations
 217
 218 ```
 219     FRS <- -([(FRT) * (FRA)] - (FRB))
 220     FRT <-   [(FRT) * (FRA)] + (FRB)
 221 ```
 222
 223 are performed.
 224
 225 The floating-point operand in register FRT is multiplied by the
 226 floating-point operand in register FRA. The floating-point operand in
 227 register FRB is added to this intermediate result, and the intermediate
 228 stored in FRS.
 229
 230 Using the exact same values of FRT, FRT and FRB as used to create
 231 FRS, the floating-point operand in register FRT is multiplied by the
 232 floating-point operand in register FRA. The float- ing-point operand
 233 in register FRB is subtracted from this intermediate result, and the
 234 intermediate stored in FRT.
 235
 236 FRT is created as if a `fmadds` operation had been performed. FRS is
 237 created as if a `fnmsubs` operation had simultaneously been performed
 238 with the exact same register operands, in parallel, independently,
 239 at exactly the same time.
 240
 241 FRT is a Read-Modify-Write operation.
 242
 243 Note that if Rc=1 an Illegal Instruction is raised.
 244 Rc=1 is `RESERVED`
 245
 246 Similar to `FRTp`, this instruction produces an implicit result,
 247 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
 248 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
 249 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
 250 (Max Vector Length).
 251
 252
 253 Special Registers Altered:
 254
 255 ```
 256     FPRF FR FI
 257     FX OX UX XX
 258     VXSNAN VXISI VXIMZ
 259 ```
 260 ## Floating-Point Twin Multiply-Add DCT
 261
 262 X-Form
 263
 264 ```
 265     |0     |6     |11      |16     |21      |31 |
 266     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 267 ```
 268
 269 * fdmadd FRT,FRA,FRB (Rc=0)
 270
 271 Pseudo-code:
 272
 273 ```
 274     FRS <- FPADD64(FRT, FRB)
 275     sub <- FPSUB64(FRT, FRB)
 276     FRT <- FPMUL64(FRA, sub)
 277 ```
 278
 279 The two IEEE754-FP64 operations
 280
 281 ```
 282     FRS <- [(FRT) + (FRB)]
 283     FRT <- [(FRT) - (FRB)] * (FRA)
 284 ```
 285
 286 are simultaneously performed.
 287
 288 The Floating-Point operand in register FRT is added to the floating-point
 289 operand in register FRB and the result stored in FRS.
 290
 291 Using the exact same operand input register values from FRT and FRB
 292 that were used to create FRS, the Floating-Point operand in register
 293 FRB is subtracted from the floating-point operand in register FRT and
 294 the result then rounded before being multiplied by FRA to create an
 295 intermediate result that is stored in FRT.
 296
 297 The add into FRS is treated exactly as `fadd`.  The creation of the
 298 result FRT is **not** the same as that of `fmsub`, but is instead as if
 299 `fsub` were performed first followed by `fmuls.  The creation of FRS
 300 and FRT are treated as parallel independent operations which occur at
 301 the same time.
 302
 303 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 304
 305 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 306 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 307 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 308 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 309
 310 Special Registers Altered:
 311
 312 ```
 313     FPRF FR FI
 314     FX OX UX XX
 315     VXSNAN VXISI VXIMZ
 316 ```
 317
 318 ## Floating-Point Twin Multiply-Add FFT
 319
 320 X-Form
 321
 322 ```
 323     |0     |6     |11      |16     |21      |31 |
 324     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 325 ```
 326
 327 * ffmadd FRT,FRA,FRB (Rc=0)
 328
 329 Pseudo-code:
 330
 331 ```
 332     FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
 333     FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
 334 ```
 335
 336 The two operations
 337
 338 ```
 339     FRS <- -([(FRT) * (FRA)] - (FRB))
 340     FRT <-   [(FRT) * (FRA)] + (FRB)
 341 ```
 342
 343 are performed.
 344
 345 The floating-point operand in register FRT is multiplied by the
 346 floating-point operand in register FRA. The float- ing-point operand in
 347 register FRB is added to this intermediate result, and the intermediate
 348 stored in FRS.
 349
 350 Using the exact same values of FRT, FRT and FRB as used to create
 351 FRS, the floating-point operand in register FRT is multiplied by the
 352 floating-point operand in register FRA. The float- ing-point operand
 353 in register FRB is subtracted from this intermediate result, and the
 354 intermediate stored in FRT.
 355
 356 FRT is created as if a `fmadd` operation had been performed. FRS is
 357 created as if a `fnmsub` operation had simultaneously been performed
 358 with the exact same register operands, in parallel, independently,
 359 at exactly the same time.
 360
 361 FRT is a Read-Modify-Write operation.
 362
 363 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 364
 365 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 366 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 367 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 368 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 369
 370 Special Registers Altered:
 371
 372 ```
 373     FPRF FR FI
 374     FX OX UX XX
 375     VXSNAN VXISI VXIMZ
 376 ```
 377
 378
 379 ## [DRAFT] Floating-Point Add FFT/DCT [Single]
 380
 381 A-Form
 382
 383 * ffadds FRT,FRA,FRB (Rc=0)
 384 * ffadds. FRT,FRA,FRB (Rc=1)
 385
 386 Pseudo-code:
 387
 388 ```
 389     FRT <- FPADD32(FRA, FRB)
 390     FRS <- FPSUB32(FRB, FRA)
 391 ```
 392
 393 Special Registers Altered:
 394
 395 ```
 396     FPRF FR FI
 397     FX OX UX XX
 398     VXSNAN VXISI
 399     CR1          (if Rc=1)
 400 ```
 401
 402 ## [DRAFT] Floating-Point Add FFT/DCT [Double]
 403
 404 A-Form
 405
 406 * ffadd FRT,FRA,FRB (Rc=0)
 407 * ffadd. FRT,FRA,FRB (Rc=1)
 408
 409 Pseudo-code:
 410
 411 ```
 412     FRT <- FPADD64(FRA, FRB)
 413     FRS <- FPSUB64(FRB, FRA)
 414 ```
 415
 416 Special Registers Altered:
 417
 418 ```
 419     FPRF FR FI
 420     FX OX UX XX
 421     VXSNAN VXISI
 422     CR1          (if Rc=1)
 423 ```
 424
 425 ## [DRAFT] Floating-Point Subtract FFT/DCT [Single]
 426
 427 A-Form
 428
 429 * ffsubs FRT,FRA,FRB (Rc=0)
 430 * ffsubs. FRT,FRA,FRB (Rc=1)
 431
 432 Pseudo-code:
 433
 434 ```
 435     FRT <- FPSUB32(FRB, FRA)
 436     FRS <- FPADD32(FRA, FRB)
 437 ```
 438
 439 Special Registers Altered:
 440
 441 ```
 442     FPRF FR FI
 443     FX OX UX XX
 444     VXSNAN VXISI
 445     CR1          (if Rc=1)
 446 ```
 447
 448 ## [DRAFT] Floating-Point Subtract FFT/DCT [Double]
 449
 450 A-Form
 451
 452 * ffsub FRT,FRA,FRB (Rc=0)
 453 * ffsub. FRT,FRA,FRB (Rc=1)
 454
 455 Pseudo-code:
 456
 457 ```
 458     FRT <- FPSUB64(FRB, FRA)
 459     FRS <- FPADD64(FRA, FRB)
 460 ```
 461
 462 Special Registers Altered:
 463
 464 ```
 465     FPRF FR FI
 466     FX OX UX XX
 467     VXSNAN VXISI
 468     CR1          (if Rc=1)
 469 ```