(no commit message)

[libreriscv.git] / openpower / sv / twin_butterfly.mdwn
diff --git a/openpower/sv/twin_butterfly.mdwn b/openpower/sv/twin_butterfly.mdwn

index c8571bd549ad4ab9953f495f751368f304804463..a2a7b8d7f3e964fd63b350cb78a82c8e6dc77c52 100644 (file)
--- a/openpower/sv/twin_butterfly.mdwn
+++ b/openpower/sv/twin_butterfly.mdwn
@@ -8,17 +8,21 @@
  * [[openpower/isa/svfparith]]
  * [[openpower/isa/svfixedarith]]
  * [[openpower/sv/rfc/ls016]]
-
  <!-- show -->
  
+Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
+context to save considerably on DCT, DFT and FFT processing.  Whilst some hardware
+implementations may not necessarily implement them efficiently (slower Micro-coding)
+savings still come from the reduction in temporary registers as well as instruction
+count.
+
  # Rationale for Twin Butterfly Integer DCT Instruction(s)
  
-The number of general-purpose uses for DCT is huge. The
-number of instructions needed instead of these Twin-Butterfly
-instructions is also huge (**eight**) and given that it is
-extremely common to explicitly loop-unroll them quantity
-hundreds to thousands of instructions are dismayingly common
-(for all ISAs).
+The number of general-purpose uses for DCT is huge. The number of
+instructions needed instead of these Twin-Butterfly instructions is also
+huge (**eight**) and given that it is extremely common to explicitly
+loop-unroll them quantity hundreds to thousands of instructions are
+dismayingly common (for all ISAs).
  
  The goal is to implement instructions that calculate the expression:
  
@@ -34,18 +38,53 @@ For the single-coefficient butterfly instruction, and:
  
  For the double-coefficient butterfly instruction.
  
-`fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
+In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
  
  ```
      #define ROUND_POWER_OF_TWO(value, n) \
              (((value) + (1 << ((n)-1))) >> (n))
  ```
  
-These instructions are at the core of **ALL** FDCT calculations in many major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
-Arm includes special instructions to optimize these operations, although they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
+These instructions are at the core of **ALL** FDCT calculations in many
+major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
+ARM includes special instructions to optimize these operations, although
+they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
+
+The suggestion is to have a single instruction to calculate both values
+`((a + b) * c) >> N`, and `((a - b) * c) >> N`.  The instruction will
+run in accumulate mode, so in order to calculate the 2-coeff version
+one would just have to call the same instruction with different order a,
+b and a different constant c.
  
-The suggestion is to have a single instruction to calculate both values `((a + b) * c) >> N`, and `((a - b) * c) >> N`.
-The instruction will run in accumulate mode, so in order to calculate the 2-coeff version one would just have to call the same instruction with different order a, b and a different constant c.
+Example taken from libvpx
+<https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
+
+```
+    #include <stdint.h>
+    #define ROUND_POWER_OF_TWO(value, n) \
+            (((value) + (1 << ((n)-1))) >> (n))
+    void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
+        t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
+        t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
+    }
+```
+
+8 instructions are required  - replaced by just the one (maddsubrs):
+
+```
+    add 9,5,4
+    subf 5,5,4
+    mullw 9,9,6
+    mullw 5,5,6
+    addi 9,9,8192
+    addi 5,5,8192
+    srawi 9,9,14
+    srawi 5,5,14
+```
+
+-------
+
+\newpage{}
  
  ## Integer Butterfly Multiply Add/Sub FFT/DCT
  
@@ -56,7 +95,6 @@ A-Form
  ```
      |0     |6     |11      |16     |21      |26    |31 |
      | PO   |  RT  |   RA   |   RB  |   SH   |   XO |Rc |
-
  ```
  
  * maddsubrs  RT,RA,SH,RB
@@ -67,29 +105,35 @@ Pseudo-code:
      n <- SH
      sum <- (RT) + (RA)
      diff <- (RT) - (RA)
-    prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1]
-    prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1]
-    res1 <- ROTL64(prod1, XLEN-n)
-    res2 <- ROTL64(prod2, XLEN-n)
-    m <- MASK(n, (XLEN-1))
-    signbit1 <- res1[0]
-    signbit2 <- res2[0]
-    smask1 <- ([signbit1]*XLEN) & ¬m
-    smask2 <- ([signbit2]*XLEN) & ¬m
-    s64_1 <- [0]*(XLEN-1) || signbit1
-    s64_2 <- [0]*(XLEN-1) || signbit2
-    RT <- (res1 & m | smask1) + s64_1
-    RS <- (res2 & m | smask2) + s64_2
-```
-
-Note that if Rc=1 an Illegal Instruction is raised.
-Rc=1 is `RESERVED`
-
-Similar to `RTp`, this instruction produces an implicit result,
-`RS`, which under Scalar circumstances is defined as `RT+1`.
-For SVP64 if `RT` is a Vector, `RS` begins immediately after the
-Vector `RT` where the length of `RT` is set by `SVSTATE.MAXVL`
-(Max Vector Length).
+    prod1 <- MULS(RB, sum)
+    prod1_lo <- prod1[XLEN:(XLEN*2)-1]
+    prod2 <- MULS(RB, diff)
+    prod2_lo <- prod2[XLEN:(XLEN*2)-1]
+    if n = 0 then
+        RT <- prod1_lo
+        RS <- prod2_lo
+    else
+        round <- [0]*XLEN
+        round[XLEN -n] <- 1
+        prod1_lo <- prod1_lo + round
+        prod2_lo <- prod2_lo + round
+        m <- MASK(n, (XLEN-1))
+        res1 <- ROTL64(prod1_lo, XLEN-n) & m
+        res2 <- ROTL64(prod2_lo, XLEN-n) & m
+        signbit1 <- prod1_lo[0]
+        signbit2 <- prod2_lo[0]
+        smask1 <- ([signbit1]*XLEN) & ¬m
+        smask2 <- ([signbit2]*XLEN) & ¬m
+        RT <- (res1 | smask1)
+        RS <- (res2 | smask2)
+```
+
+Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
+
+Similar to `RTp`, this instruction produces an implicit result, `RS`,
+which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
+`RT` is a Vector, `RS` begins immediately after the Vector `RT` where
+the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).
  
  Special Registers Altered:
  
@@ -97,12 +141,16 @@ Special Registers Altered:
      None
  ```
  
-# Twin Butterfly Integer DCT Instruction(s)
+-------
  
-## Floating Twin Multiply-Add DCT [Single]
+\newpage{}
+
+# Twin Butterfly Floating-Point DCT and FFT Instruction(s)
  
  **Add the following to Book I Section 4.6.6.3**
  
+## Floating-Point Twin Multiply-Add DCT [Single]
+
  X-Form
  
  ```
@@ -116,30 +164,40 @@ Pseudo-code:
  
  ```
      FRS <- FPADD32(FRT, FRB)
-    FRT <- FPMULADD32(FRT, FRA, FRB, 1, -1)
+    sub <- FPSUB32(FRT, FRB)
+    FRT <- FPMUL32(FRA, sub)
+```
+
+The two IEEE754-FP32 operations
+
  ```
+    FRS <- [(FRT) + (FRB)]
+    FRT <- [(FRT) - (FRB)] * (FRA)
+```
+
+are simultaneously performed.
  
  The Floating-Point operand in register FRT is added to the floating-point
  operand in register FRB and the result stored in FRS.
  
-Using the exact same operand input register values from FRT and FRB that
-were used to create FRS, the Floating-Point operand in register FRB
-is subtracted from the floating-point operand in register FRT and the
-result then multiplied by FRA to create an intermediate result that is
-stored in FRT.
+Using the exact same operand input register values from FRT and FRB
+that were used to create FRS, the Floating-Point operand in register
+FRB is subtracted from the floating-point operand in register FRT and
+the result then rounded before being multiplied by FRA to create an
+intermediate result that is stored in FRT.
  
-The add into FRS is treated exactly as `fadd`.  The creation
-of the result FRT is exact!y that of `fmsub`.  The creation of FRS and FRT are
-treated as parallel independent operations which occur at the same time.
+The add into FRS is treated exactly as `fadds`.  The creation of the
+result FRT is **not** the same as that of `fmsubs`, but is instead as if
+`fsubs` were performed first followed by `fmuls`.  The creation of FRS
+and FRT are treated as parallel independent operations which occur at
+the same time.
  
-Note that if Rc=1 an Illegal Instruction is raised.
-Rc=1 is `RESERVED`
+Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
  
-Similar to `FRTp`, this instruction produces an implicit result,
-`FRS`, which under Scalar circumstances is defined as `FRT+1`.
-For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
-Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
-(Max Vector Length).
+Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
+which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
+`FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
+where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
  
  Special Registers Altered:
  
@@ -149,9 +207,7 @@ Special Registers Altered:
      VXSNAN VXISI VXIMZ
  ```
  
-## Floating Multiply-Add FFT [Single]
-
-**Add the following to Book I Section 4.6.6.3**
+## Floating-Point Multiply-Add FFT [Single]
  
  X-Form
  
@@ -178,21 +234,20 @@ The two operations
  
  are performed.
  
-The floating-point operand in register FRT is multiplied
-by the floating-point operand in register FRA. The float-
-ing-point operand in register FRB is added to
-this intermediate result, and the intermediate stored in FRS.
-
-Using the exact same values of FRT, FRT and FRB as used to create FRS,
-the floating-point operand in register FRT is multiplied
-by the floating-point operand in register FRA. The float-
-ing-point operand in register FRB is subtracted from
-this intermediate result, and the intermediate stored in FRT.
-
-FRT is created as if
-a `fmadds` operation had been performed. FRS is created as if
-a `fnmsubs` operation had simultaneously been performed with
-the exact same register operands, in parallel, independently,
+The floating-point operand in register FRT is multiplied by the
+floating-point operand in register FRA. The floating-point operand in
+register FRB is added to this intermediate result, and the intermediate
+stored in FRS.
+
+Using the exact same values of FRT, FRT and FRB as used to create
+FRS, the floating-point operand in register FRT is multiplied by the
+floating-point operand in register FRA. The floating-point operand
+in register FRB is subtracted from this intermediate result, and the
+intermediate stored in FRT.
+
+FRT is created as if a `fmadds` operation had been performed. FRS is
+created as if a `fnmsubs` operation had simultaneously been performed
+with the exact same register operands, in parallel, independently,
  at exactly the same time.
  
  FRT is a Read-Modify-Write operation.  
@@ -206,7 +261,6 @@ For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
  Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
  (Max Vector Length).
  
-
  Special Registers Altered:
  
  ```
@@ -214,9 +268,8 @@ Special Registers Altered:
      FX OX UX XX
      VXSNAN VXISI VXIMZ
  ```
-## Floating Twin Multiply-Add DCT
  
-**Add the following to Book I Section 4.6.6.3**
+## Floating-Point Twin Multiply-Add DCT
  
  X-Form
  
@@ -231,30 +284,40 @@ Pseudo-code:
  
  ```
      FRS <- FPADD64(FRT, FRB)
-    FRT <- FPMULADD64(FRT, FRA, FRB, 1, -1)
+    sub <- FPSUB64(FRT, FRB)
+    FRT <- FPMUL64(FRA, sub)
  ```
  
+The two IEEE754-FP64 operations
+
+```
+    FRS <- [(FRT) + (FRB)]
+    FRT <- [(FRT) - (FRB)] * (FRA)
+```
+
+are simultaneously performed.
+
  The Floating-Point operand in register FRT is added to the floating-point
  operand in register FRB and the result stored in FRS.
  
-Using the exact same operand input register values from FRT and FRB that
-were used to create FRS, the Floating-Point operand in register FRB
-is subtracted from the floating-point operand in register FRT and the
-result then multiplied by FRA to create an intermediate result that is
-stored in FRT.
+Using the exact same operand input register values from FRT and FRB
+that were used to create FRS, the Floating-Point operand in register
+FRB is subtracted from the floating-point operand in register FRT and
+the result then rounded before being multiplied by FRA to create an
+intermediate result that is stored in FRT.
  
-The add into FRS is treated exactly as `fadd`.  The creation
-of the result FRT is exact!y that of `fmsub`.  The creation of FRS and FRT are
-treated as parallel independent operations which occur at the same time.
+The add into FRS is treated exactly as `fadd`.  The creation of the
+result FRT is **not** the same as that of `fmsub`, but is instead as if
+`fsub` were performed first followed by `fmuls.  The creation of FRS
+and FRT are treated as parallel independent operations which occur at
+the same time.
  
-Note that if Rc=1 an Illegal Instruction is raised.
-Rc=1 is `RESERVED`
+Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
  
-Similar to `FRTp`, this instruction produces an implicit result,
-`FRS`, which under Scalar circumstances is defined as `FRT+1`.
-For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
-Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
-(Max Vector Length).
+Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
+which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
+`FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
+where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
  
  Special Registers Altered:
  
@@ -264,9 +327,7 @@ Special Registers Altered:
      VXSNAN VXISI VXIMZ
  ```
  
-## Floating Twin Multiply-Add FFT
-
-**Add the following to Book I Section 4.6.6.3**
+## Floating-Point Twin Multiply-Add FFT
  
  X-Form
  
@@ -293,33 +354,30 @@ The two operations
  
  are performed.
  
-The floating-point operand in register FRT is multiplied
-by the floating-point operand in register FRA. The float-
-ing-point operand in register FRB is added to
-this intermediate result, and the intermediate stored in FRS.
-
-Using the exact same values of FRT, FRT and FRB as used to create FRS,
-the floating-point operand in register FRT is multiplied
-by the floating-point operand in register FRA. The float-
-ing-point operand in register FRB is subtracted from
-this intermediate result, and the intermediate stored in FRT.
-
-FRT is created as if
-a `fmadd` operation had been performed. FRS is created as if
-a `fnmsub` operation had simultaneously been performed with
-the exact same register operands, in parallel, independently,
+The floating-point operand in register FRT is multiplied by the
+floating-point operand in register FRA. The float- ing-point operand in
+register FRB is added to this intermediate result, and the intermediate
+stored in FRS.
+
+Using the exact same values of FRT, FRT and FRB as used to create
+FRS, the floating-point operand in register FRT is multiplied by the
+floating-point operand in register FRA. The float- ing-point operand
+in register FRB is subtracted from this intermediate result, and the
+intermediate stored in FRT.
+
+FRT is created as if a `fmadd` operation had been performed. FRS is
+created as if a `fnmsub` operation had simultaneously been performed
+with the exact same register operands, in parallel, independently,
  at exactly the same time.
  
-FRT is a Read-Modify-Write operation.  
+FRT is a Read-Modify-Write operation.
  
-Note that if Rc=1 an Illegal Instruction is raised.
-Rc=1 is `RESERVED`
+Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
  
-Similar to `FRTp`, this instruction produces an implicit result,
-`FRS`, which under Scalar circumstances is defined as `FRT+1`.
-For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
-Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
-(Max Vector Length).
+Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
+which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
+`FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
+where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).
  
  Special Registers Altered:
  
@@ -330,12 +388,16 @@ Special Registers Altered:
  ```
  
  
-## [DRAFT] Floating Add FFT/DCT [Single]
+## Floating-Point Add FFT/DCT [Single]
  
  A-Form
  
+```
+    |0     |6     |11      |16     |21      |26    |31 |
+    | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
+```
+
  * ffadds FRT,FRA,FRB (Rc=0)
-* ffadds. FRT,FRA,FRB (Rc=1)
  
  Pseudo-code:
  
@@ -350,15 +412,18 @@ Special Registers Altered:
      FPRF FR FI
      FX OX UX XX
      VXSNAN VXISI
-    CR1          (if Rc=1)
  ```
  
-## [DRAFT] Floating Add FFT/DCT [Double]
+## Floating-Point Add FFT/DCT [Double]
  
  A-Form
  
+```
+    |0     |6     |11      |16     |21      |26    |31 |
+    | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
+```
+
  * ffadd FRT,FRA,FRB (Rc=0)
-* ffadd. FRT,FRA,FRB (Rc=1)
  
  Pseudo-code:
  
@@ -373,15 +438,18 @@ Special Registers Altered:
      FPRF FR FI
      FX OX UX XX
      VXSNAN VXISI
-    CR1          (if Rc=1)
  ```
  
-## [DRAFT] Floating Subtract FFT/DCT [Single]
+## Floating-Point Subtract FFT/DCT [Single]
  
  A-Form
  
+```
+    |0     |6     |11      |16     |21      |26    |31 |
+    | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
+```
+
  * ffsubs FRT,FRA,FRB (Rc=0)
-* ffsubs. FRT,FRA,FRB (Rc=1)
  
  Pseudo-code:
  
@@ -396,15 +464,18 @@ Special Registers Altered:
      FPRF FR FI
      FX OX UX XX
      VXSNAN VXISI
-    CR1          (if Rc=1)
  ```
  
-## [DRAFT] Floating Subtract FFT/DCT [Double]
+## Floating-Point Subtract FFT/DCT [Double]
  
  A-Form
  
+```
+    |0     |6     |11      |16     |21      |26    |31 |
+    | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
+```
+
  * ffsub FRT,FRA,FRB (Rc=0)
-* ffsub. FRT,FRA,FRB (Rc=1)
  
  Pseudo-code:
  
@@ -419,5 +490,4 @@ Special Registers Altered:
      FPRF FR FI
      FX OX UX XX
      VXSNAN VXISI
-    CR1          (if Rc=1)
  ```