openpower/sv/twin_butterfly.mdwn

   1 # Introduction
   2
   3 <!-- hide -->
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
   5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
   6   information about implicit RS/FRS
   7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
   8 * [[openpower/isa/svfparith]]
   9 * [[openpower/isa/svfixedarith]]
  10 * [[openpower/sv/rfc/ls016]]
  11
  12 <!-- show -->
  13
  14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
  15
  16 The number of general-purpose uses for DCT is huge. The number of
  17 instructions needed instead of these Twin-Butterfly instructions is also
  18 huge (**eight**) and given that it is extremely common to explicitly
  19 loop-unroll them quantity hundreds to thousands of instructions are
  20 dismayingly common (for all ISAs).
  21
  22 The goal is to implement instructions that calculate the expression:
  23
  24 ```
  25     fdct_round_shift((a +/- b) * c)
  26 ```
  27
  28 For the single-coefficient butterfly instruction, and:
  29
  30 ```
  31     fdct_round_shift(a * c1  +/- b * c2)
  32 ```
  33
  34 For the double-coefficient butterfly instruction.
  35
  36 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
  37
  38 ```
  39     #define ROUND_POWER_OF_TWO(value, n) \
  40             (((value) + (1 << ((n)-1))) >> (n))
  41 ```
  42
  43 These instructions are at the core of **ALL** FDCT calculations in many
  44 major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
  45 Arm includes special instructions to optimize these operations, although
  46 they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
  47
  48 The suggestion is to have a single instruction to calculate both values
  49 `((a + b) * c) >> N`, and `((a - b) * c) >> N`.  The instruction will
  50 run in accumulate mode, so in order to calculate the 2-coeff version
  51 one would just have to call the same instruction with different order a,
  52 b and a different constant c.
  53
  54 Example taken from libvpx
  55 <https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
  56
  57 ```
  58     #include <stdint.h>
  59     #define ROUND_POWER_OF_TWO(value, n) \
  60             (((value) + (1 << ((n)-1))) >> (n))
  61     void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
  62         t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
  63         t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
  64     }
  65 ```
  66
  67 8 instructions are required  - replaced by just the one (maddsubrs):
  68
  69 ```
  70     add 9,5,4
  71     subf 5,5,4
  72     mullw 9,9,6
  73     mullw 5,5,6
  74     addi 9,9,8192
  75     addi 5,5,8192
  76     srawi 9,9,14
  77     srawi 5,5,14
  78 ```
  79
  80 -------
  81
  82 \newpage{}
  83
  84 ## Integer Butterfly Multiply Add/Sub FFT/DCT
  85
  86 **Add the following to Book I Section 3.3.9.1**
  87
  88 A-Form
  89
  90 ```
  91     |0     |6     |11      |16     |21      |26    |31 |
  92     | PO   |  RT  |   RA   |   RB  |   SH   |   XO |Rc |
  93
  94 ```
  95
  96 * maddsubrs  RT,RA,SH,RB
  97
  98 Pseudo-code:
  99
 100 ```
 101     n <- SH
 102     sum <- (RT) + (RA)
 103     diff <- (RT) - (RA)
 104     prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1] + 1
 105     prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1] + 1
 106     res1 <- ROTL64(prod1, XLEN-n)
 107     res2 <- ROTL64(prod2, XLEN-n)
 108     m <- MASK(n, (XLEN-1))
 109     signbit1 <- res1[0]
 110     signbit2 <- res2[0]
 111     smask1 <- ([signbit1]*XLEN) & ¬m
 112     smask2 <- ([signbit2]*XLEN) & ¬m
 113     s64_1 <- [0]*(XLEN-1) || signbit1
 114     s64_2 <- [0]*(XLEN-1) || signbit2
 115     RT <- (res1 & m | smask1) + s64_1
 116     RS <- (res2 & m | smask2) + s64_2
 117 ```
 118
 119 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 120
 121 Similar to `RTp`, this instruction produces an implicit result, `RS`,
 122 which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
 123 `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
 124 the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 125
 126 Special Registers Altered:
 127
 128 ```
 129     None
 130 ```
 131
 132 -------
 133
 134 \newpage{}
 135
 136 # Twin Butterfly Floating-Point DCT Instruction(s)
 137
 138 **Add the following to Book I Section 4.6.6.3**
 139
 140 ## Floating-Point Twin Multiply-Add DCT [Single]
 141
 142 X-Form
 143
 144 ```
 145     |0     |6     |11      |16     |21      |31 |
 146     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 147 ```
 148
 149 * fdmadds FRT,FRA,FRB (Rc=0)
 150
 151 Pseudo-code:
 152
 153 ```
 154     FRS <- FPADD32(FRT, FRB)
 155     sub <- FPSUB32(FRT, FRB)
 156     FRT <- FPMUL32(FRA, sub)
 157 ```
 158
 159 The two IEEE754-FP32 operations
 160
 161 ```
 162     FRS <- [(FRT) + (FRB)]
 163     FRT <- [(FRT) - (FRB)] * (FRA)
 164 ```
 165
 166 are simultaneously performed.
 167
 168 The Floating-Point operand in register FRT is added to the floating-point
 169 operand in register FRB and the result stored in FRS.
 170
 171 Using the exact same operand input register values from FRT and FRB
 172 that were used to create FRS, the Floating-Point operand in register
 173 FRB is subtracted from the floating-point operand in register FRT and
 174 the result then multiplied by FRA to create an intermediate result that
 175 is stored in FRT.
 176
 177 The add into FRS is treated exactly as `fadds`.  The creation of the
 178 result FRT is **not** the same as that of `fmsubs`.
 179 The creation of FRS and FRT are treated as parallel independent operations
 180 which occur at the same time.
 181
 182 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 183
 184 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 185 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 186 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 187 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 188
 189 Special Registers Altered:
 190
 191 ```
 192     FPRF FR FI
 193     FX OX UX XX
 194     VXSNAN VXISI VXIMZ
 195 ```
 196
 197 ## Floating-Point Multiply-Add FFT [Single]
 198
 199 X-Form
 200
 201 ```
 202     |0     |6     |11      |16     |21      |31 |
 203     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 204 ```
 205
 206 * ffmadds FRT,FRA,FRB (Rc=0)
 207
 208 Pseudo-code:
 209
 210 ```
 211     FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
 212     FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
 213 ```
 214
 215 The two operations
 216
 217 ```
 218     FRS <- -([(FRT) * (FRA)] - (FRB))
 219     FRT <-   [(FRT) * (FRA)] + (FRB)
 220 ```
 221
 222 are performed.
 223
 224 The floating-point operand in register FRT is multiplied by the
 225 floating-point operand in register FRA. The floating-point operand in
 226 register FRB is added to this intermediate result, and the intermediate
 227 stored in FRS.
 228
 229 Using the exact same values of FRT, FRT and FRB as used to create
 230 FRS, the floating-point operand in register FRT is multiplied by the
 231 floating-point operand in register FRA. The float- ing-point operand
 232 in register FRB is subtracted from this intermediate result, and the
 233 intermediate stored in FRT.
 234
 235 FRT is created as if a `fmadds` operation had been performed. FRS is
 236 created as if a `fnmsubs` operation had simultaneously been performed
 237 with the exact same register operands, in parallel, independently,
 238 at exactly the same time.
 239
 240 FRT is a Read-Modify-Write operation.
 241
 242 Note that if Rc=1 an Illegal Instruction is raised.
 243 Rc=1 is `RESERVED`
 244
 245 Similar to `FRTp`, this instruction produces an implicit result,
 246 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
 247 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
 248 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
 249 (Max Vector Length).
 250
 251
 252 Special Registers Altered:
 253
 254 ```
 255     FPRF FR FI
 256     FX OX UX XX
 257     VXSNAN VXISI VXIMZ
 258 ```
 259 ## Floating-Point Twin Multiply-Add DCT
 260
 261 X-Form
 262
 263 ```
 264     |0     |6     |11      |16     |21      |31 |
 265     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 266 ```
 267
 268 * fdmadd FRT,FRA,FRB (Rc=0)
 269
 270 Pseudo-code:
 271
 272 ```
 273     FRS <- FPADD64(FRT, FRB)
 274     sub <- FPSUB64(FRT, FRB)
 275     FRT <- FPMUL64(FRA, sub)
 276 ```
 277
 278 The two IEEE754-FP64 operations
 279
 280 ```
 281     FRS <- [(FRT) + (FRB)]
 282     FRT <- [(FRT) - (FRB)] * (FRA)
 283 ```
 284
 285 are simultaneously performed.
 286
 287 The Floating-Point operand in register FRT is added to the floating-point
 288 operand in register FRB and the result stored in FRS.
 289
 290 Using the exact same operand input register values from FRT and FRB
 291 that were used to create FRS, the Floating-Point operand in register
 292 FRB is subtracted from the floating-point operand in register FRT and
 293 the result then multiplied by FRA to create an intermediate result that
 294 is stored in FRT.
 295
 296 The add into FRS is treated exactly as `fadd`.  The creation of the
 297 result FRT is **not** the same as that of `fmsub`.
 298 The creation of FRS and FRT are treated as parallel independent operations
 299 which occur at the same time.
 300
 301 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 302
 303 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 304 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 305 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 306 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 307
 308 Special Registers Altered:
 309
 310 ```
 311     FPRF FR FI
 312     FX OX UX XX
 313     VXSNAN VXISI VXIMZ
 314 ```
 315
 316 ## Floating-Point Twin Multiply-Add FFT
 317
 318 X-Form
 319
 320 ```
 321     |0     |6     |11      |16     |21      |31 |
 322     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 323 ```
 324
 325 * ffmadd FRT,FRA,FRB (Rc=0)
 326
 327 Pseudo-code:
 328
 329 ```
 330     FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
 331     FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
 332 ```
 333
 334 The two operations
 335
 336 ```
 337     FRS <- -([(FRT) * (FRA)] - (FRB))
 338     FRT <-   [(FRT) * (FRA)] + (FRB)
 339 ```
 340
 341 are performed.
 342
 343 The floating-point operand in register FRT is multiplied by the
 344 floating-point operand in register FRA. The float- ing-point operand in
 345 register FRB is added to this intermediate result, and the intermediate
 346 stored in FRS.
 347
 348 Using the exact same values of FRT, FRT and FRB as used to create
 349 FRS, the floating-point operand in register FRT is multiplied by the
 350 floating-point operand in register FRA. The float- ing-point operand
 351 in register FRB is subtracted from this intermediate result, and the
 352 intermediate stored in FRT.
 353
 354 FRT is created as if a `fmadd` operation had been performed. FRS is
 355 created as if a `fnmsub` operation had simultaneously been performed
 356 with the exact same register operands, in parallel, independently,
 357 at exactly the same time.
 358
 359 FRT is a Read-Modify-Write operation.
 360
 361 Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
 362
 363 Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
 364 which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
 365 `FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
 366 where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
 367
 368 Special Registers Altered:
 369
 370 ```
 371     FPRF FR FI
 372     FX OX UX XX
 373     VXSNAN VXISI VXIMZ
 374 ```
 375
 376
 377 ## [DRAFT] Floating-Point Add FFT/DCT [Single]
 378
 379 A-Form
 380
 381 * ffadds FRT,FRA,FRB (Rc=0)
 382 * ffadds. FRT,FRA,FRB (Rc=1)
 383
 384 Pseudo-code:
 385
 386 ```
 387     FRT <- FPADD32(FRA, FRB)
 388     FRS <- FPSUB32(FRB, FRA)
 389 ```
 390
 391 Special Registers Altered:
 392
 393 ```
 394     FPRF FR FI
 395     FX OX UX XX
 396     VXSNAN VXISI
 397     CR1          (if Rc=1)
 398 ```
 399
 400 ## [DRAFT] Floating-Point Add FFT/DCT [Double]
 401
 402 A-Form
 403
 404 * ffadd FRT,FRA,FRB (Rc=0)
 405 * ffadd. FRT,FRA,FRB (Rc=1)
 406
 407 Pseudo-code:
 408
 409 ```
 410     FRT <- FPADD64(FRA, FRB)
 411     FRS <- FPSUB64(FRB, FRA)
 412 ```
 413
 414 Special Registers Altered:
 415
 416 ```
 417     FPRF FR FI
 418     FX OX UX XX
 419     VXSNAN VXISI
 420     CR1          (if Rc=1)
 421 ```
 422
 423 ## [DRAFT] Floating-Point Subtract FFT/DCT [Single]
 424
 425 A-Form
 426
 427 * ffsubs FRT,FRA,FRB (Rc=0)
 428 * ffsubs. FRT,FRA,FRB (Rc=1)
 429
 430 Pseudo-code:
 431
 432 ```
 433     FRT <- FPSUB32(FRB, FRA)
 434     FRS <- FPADD32(FRA, FRB)
 435 ```
 436
 437 Special Registers Altered:
 438
 439 ```
 440     FPRF FR FI
 441     FX OX UX XX
 442     VXSNAN VXISI
 443     CR1          (if Rc=1)
 444 ```
 445
 446 ## [DRAFT] Floating-Point Subtract FFT/DCT [Double]
 447
 448 A-Form
 449
 450 * ffsub FRT,FRA,FRB (Rc=0)
 451 * ffsub. FRT,FRA,FRB (Rc=1)
 452
 453 Pseudo-code:
 454
 455 ```
 456     FRT <- FPSUB64(FRB, FRA)
 457     FRS <- FPADD64(FRA, FRB)
 458 ```
 459
 460 Special Registers Altered:
 461
 462 ```
 463     FPRF FR FI
 464     FX OX UX XX
 465     VXSNAN VXISI
 466     CR1          (if Rc=1)
 467 ```