openpower/sv/twin_butterfly.mdwn

   1 # Introduction
   2
   3 <!-- hide -->
   4 * <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
   5 * <https://libre-soc.org/openpower/sv/biginteger/> for format and
   6   information about implicit RS/FRS
   7 * <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
   8 * [[openpower/isa/svfparith]]
   9 * [[openpower/isa/svfixedarith]]
  10 * [[ls016]]
  11
  12 <!-- show -->
  13
  14 # Rationale for Twin Butterfly Integer DCT Instruction(s)
  15
  16 The number of general-purpose uses for DCT is huge. The
  17 number of instructions needed instead of these Twin-Butterfly
  18 instructions is also huge (**eight**) and given that it is
  19 extremely common to explicitly loop-unroll them quantity
  20 hundreds to thousands of instructions are dismayingly common
  21 (for all ISAs).
  22
  23 The goal is to implement instructions that calculate the expression:
  24
  25 ```
  26     fdct_round_shift((a +/- b) * c)
  27 ```
  28
  29 For the single-coefficient butterfly instruction, and:
  30
  31 ```
  32     fdct_round_shift(a * c1  +/- b * c2)
  33 ```
  34
  35 For the double-coefficient butterfly instruction.
  36
  37 `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
  38
  39 ```
  40     #define ROUND_POWER_OF_TWO(value, n) \
  41             (((value) + (1 << ((n)-1))) >> (n))
  42 ```
  43
  44 These instructions are at the core of **ALL** FDCT calculations in many major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
  45 Arm includes special instructions to optimize these operations, although they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
  46
  47 The suggestion is to have a single instruction to calculate both values `((a + b) * c) >> N`, and `((a - b) * c) >> N`.
  48 The instruction will run in accumulate mode, so in order to calculate the 2-coeff version one would just have to call the same instruction with different order a, b and a different constant c.
  49
  50 ## Integer Butterfly Multiply Add/Sub FFT/DCT
  51
  52 **Add the following to Book I Section 3.3.9.1**
  53
  54 A-Form
  55
  56 ```
  57     |0     |6     |11      |16     |21      |26    |31 |
  58     | PO   |  RT  |   RA   |   RB  |   SH   |   XO |Rc |
  59
  60 ```
  61
  62 * maddsubrs  RT,RA,SH,RB
  63
  64 Pseudo-code:
  65
  66 ```
  67     n <- SH
  68     sum <- (RT) + (RA)
  69     diff <- (RT) - (RA)
  70     prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1]
  71     prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1]
  72     res1 <- ROTL64(prod1, XLEN-n)
  73     res2 <- ROTL64(prod2, XLEN-n)
  74     m <- MASK(n, (XLEN-1))
  75     signbit1 <- res1[0]
  76     signbit2 <- res2[0]
  77     smask1 <- ([signbit1]*XLEN) & ¬m
  78     smask2 <- ([signbit2]*XLEN) & ¬m
  79     s64_1 <- [0]*(XLEN-1) || signbit1
  80     s64_2 <- [0]*(XLEN-1) || signbit2
  81     RT <- (res1 & m | smask1) + s64_1
  82     RS <- (res2 & m | smask2) + s64_2
  83 ```
  84
  85 Note that if Rc=1 an Illegal Instruction is raised.
  86 Rc=1 is `RESERVED`
  87
  88 Similar to `RTp`, this instruction produces an implicit result,
  89 `RS`, which under Scalar circumstances is defined as `RT+1`.
  90 For SVP64 if `RT` is a Vector, `RS` begins immediately after the
  91 Vector `RT` where the length of `RT` is set by `SVSTATE.MAXVL`
  92 (Max Vector Length).
  93
  94 Special Registers Altered:
  95
  96 ```
  97     None
  98 ```
  99
 100 # Twin Butterfly Integer DCT Instruction(s)
 101
 102 ## Floating Twin Multiply-Add DCT [Single]
 103
 104 **Add the following to Book I Section 4.6.6.3**
 105
 106 X-Form
 107
 108 ```
 109     |0     |6     |11      |16     |21      |31 |
 110     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 111 ```
 112
 113 * fdmadds FRT,FRA,FRB (Rc=0)
 114
 115 Pseudo-code:
 116
 117 ```
 118     FRS <- FPADD32(FRT, FRB)
 119     sub <- FPSUB32(FRT, FRB)
 120     FRT <- FPMUL32(FRA, sub)
 121 ```
 122
 123 The Floating-Point operand in register FRT is added to the floating-point
 124 operand in register FRB and the result stored in FRS.
 125
 126 Using the exact same operand input register values from FRT and FRB that
 127 were used to create FRS, the Floating-Point operand in register FRB
 128 is subtracted from the floating-point operand in register FRT and the
 129 result then multiplied by FRA to create an intermediate result that is
 130 stored in FRT.
 131
 132 The subtraction and multiply are treated as if they were `fsub`
 133 followed by `fmul`, not `fmsub`.  The creation of FRS and FRT are
 134 treated as parallel independent operations.
 135
 136 Note that if Rc=1 an Illegal Instruction is raised.
 137 Rc=1 is `RESERVED`
 138
 139 Similar to `FRTp`, this instruction produces an implicit result,
 140 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
 141 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
 142 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
 143 (Max Vector Length).
 144
 145 Special Registers Altered:
 146
 147 ```
 148     FPRF FR FI
 149     FX OX UX XX
 150     VXSNAN VXISI VXIMZ
 151 ```
 152
 153 ## Floating Multiply-Add FFT [Single]
 154
 155 **Add the following to Book I Section 4.6.6.3**
 156
 157 X-Form
 158
 159 ```
 160     |0     |6     |11      |16     |21      |31 |
 161     | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
 162 ```
 163
 164 * ffmadds FRT,FRA,FRB (Rc=0)
 165
 166 Pseudo-code:
 167
 168 ```
 169     FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
 170     FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
 171 ```
 172
 173 The two operations
 174
 175 ```
 176     FRS <- -([(FRT) * (FRA)] - (FRB))
 177     FRT <-   [(FRT) * (FRA)] + (FRB)
 178 ```
 179
 180 are performed.
 181
 182 The floating-point operand in register FRT is multiplied
 183 by the floating-point operand in register FRA. The float-
 184 ing-point operand in register FRB is added to
 185 this intermediate result, and the intermediate stored in FRS.
 186
 187 Using the exact same values of FRT, FRT and FRB as used to create FRS,
 188 the floating-point operand in register FRT is multiplied
 189 by the floating-point operand in register FRA. The float-
 190 ing-point operand in register FRB is subtracted from
 191 this intermediate result, and the intermediate stored in FRT.
 192
 193 FRT is created as if
 194 a `fmadds` operation had been performed. FRS is created as if
 195 a `fnmsubs` operation had simultaneously been performed with
 196 the exact same register operands, in parallel, independently,
 197 at exactly the same time.
 198
 199 FRT is a Read-Modify-Write operation.
 200
 201 Note that if Rc=1 an Illegal Instruction is raised.
 202 Rc=1 is `RESERVED`
 203
 204 Similar to `FRTp`, this instruction produces an implicit result,
 205 `FRS`, which under Scalar circumstances is defined as `FRT+1`.
 206 For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
 207 Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
 208 (Max Vector Length).
 209
 210
 211 Special Registers Altered:
 212
 213 ```
 214     FPRF FR FI
 215     FX OX UX XX
 216     VXSNAN VXISI VXIMZ
 217 ```