https://bugs.libre-soc.org/show_bug.cgi?id=985

[libreriscv.git] / openpower / sv / twin_butterfly.mdwn
diff --git a/openpower/sv/twin_butterfly.mdwn b/openpower/sv/twin_butterfly.mdwn

index a125cea0011706b498d6aa38cc54fe918b57b69d..516f601f3c2fd8b9df945a95ac76e896842ea898 100644 (file)
--- a/openpower/sv/twin_butterfly.mdwn
+++ b/openpower/sv/twin_butterfly.mdwn
@@ -8,9 +8,14 @@
  * [[openpower/isa/svfparith]]
  * [[openpower/isa/svfixedarith]]
  * [[openpower/sv/rfc/ls016]]
-
  <!-- show -->
  
+Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
+context to save considerably on DCT, DFT and FFT processing.  Whilst some hardware
+implementations may not necessarily implement them efficiently (slower Micro-coding)
+savings still come from the reduction in temporary registers as well as instruction
+count.
+
  # Rationale for Twin Butterfly Integer DCT Instruction(s)
  
  The number of general-purpose uses for DCT is huge. The number of
@@ -33,7 +38,7 @@ For the single-coefficient butterfly instruction, and:
  
  For the double-coefficient butterfly instruction.
  
-`fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
+In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
  
  ```
      #define ROUND_POWER_OF_TWO(value, n) \
@@ -42,7 +47,7 @@ For the double-coefficient butterfly instruction.
  
  These instructions are at the core of **ALL** FDCT calculations in many
  major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
-Arm includes special instructions to optimize these operations, although
+ARM includes special instructions to optimize these operations, although
  they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
  
  The suggestion is to have a single instruction to calculate both values
@@ -52,7 +57,7 @@ one would just have to call the same instruction with different order a,
  b and a different constant c.
  
  Example taken from libvpx
-<https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
+<https://chromium.googlesource.com/webm/libvpx/+/refs/tags/v1.13.0/vpx_dsp/fwd_txfm.c#132>:
  
  ```
      #include <stdint.h>
@@ -90,34 +95,34 @@ A-Form
  ```
      |0     |6     |11      |16     |21      |26    |31 |
      | PO   |  RT  |   RA   |   RB  |   SH   |   XO |Rc |
-
  ```
  
-* maddsubrs  RT,RA,SH,RB
+* maddsubrs  RT,RA,RB,SH
  
  Pseudo-code:
  
  ```
      n <- SH
-    sum <- (RT) + (RA)
-    diff <- (RT) - (RA)
-    prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1] + 1
-    prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1] + 1
-    res1 <- ROTL64(prod1, XLEN-n)
-    res2 <- ROTL64(prod2, XLEN-n)
-    m <- MASK(n, (XLEN-1))
-    signbit1 <- res1[0]
-    signbit2 <- res2[0]
-    smask1 <- ([signbit1]*XLEN) & ¬m
-    smask2 <- ([signbit2]*XLEN) & ¬m
-    s64_1 <- [0]*(XLEN-1) || signbit1
-    s64_2 <- [0]*(XLEN-1) || signbit2
-    RT <- (res1 & m | smask1) + s64_1
-    RS <- (res2 & m | smask2) + s64_2
+    sum <- (RT[0] || RT) + (RA[0] || RA)
+    diff <- (RT[0] || RT) - (RA[0] || RA)
+    prod1 <- MULS(RB, sum)
+    prod2 <- MULS(RB, diff)
+    if n = 0 then
+        prod1_lo <- prod1[XLEN+1:(XLEN*2)]
+        prod2_lo <- prod2[XLEN+1:(XLEN*2)]
+        RT <- prod1_lo
+        RS <- prod2_lo
+    else
+        round <- [0]*(XLEN*2 + 1)
+        round[XLEN*2 - n + 1] <- 1
+        prod1 <- prod1 + round
+        prod2 <- prod2 + round
+        res1 <- prod1[XLEN - n + 1:XLEN*2 - n]
+        res2 <- prod2[XLEN - n + 1:XLEN*2 - n]
+        RT <- res1
+        RS <- res2
  ```
  
-Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
-
  Similar to `RTp`, this instruction produces an implicit result, `RS`,
  which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
  `RT` is a Vector, `RS` begins immediately after the Vector `RT` where
@@ -129,11 +134,86 @@ Special Registers Altered:
      None
  ```
  
+# [DRAFT] Integer Butterfly Multiply Add and Round Shift FFT/DCT
+
+A-Form
+
+* maddrs  RT,RA,RB,SH
+
+Pseudo-code:
+
+```
+    n <- SH
+    prod <- MULS(RB, RA)
+    if n = 0 then
+        prod_lo <- prod[XLEN:(XLEN*2) - 1]
+        RT <- (RT) + prod_lo
+    else
+        res[0:XLEN*2-1] <- (EXTSXL((RT)[0], 1) || (RT)) + prod
+        round <- [0]*XLEN*2
+        round[XLEN*2 - n] <- 1
+        res <- res + round
+        RT <- res[XLEN - n:XLEN*2 - n -1]
+```
+
+Special Registers Altered:
+
+    None
+
+# [DRAFT] Integer Butterfly Multiply Sub and Round Shift FFT/DCT
+
+A-Form
+
+* msubrs  RT,RA,RB,SH
+
+Pseudo-code:
+
+```
+    n <- SH
+    prod <- MULS(RB, RA)
+    if n = 0 then
+        prod_lo <- prod[XLEN:(XLEN*2) - 1]
+        RT <- (RT) - prod_lo
+    else
+        res[0:XLEN*2-1] <- (EXTSXL((RT)[0], 1) || (RT)) - prod
+        round <- [0]*XLEN*2
+        round[XLEN*2 - n] <- 1
+        res <- res + round
+        RT <- res[XLEN - n:XLEN*2 - n -1]
+```
+
+Special Registers Altered:
+
+    None
+
+
+This pair of instructions is supposed to be used in complement to the maddsubrs
+to produce the double-coefficient butterfly instruction. In order for that
+to work, instead of passing c2 as coefficient, we have to pass c2-c1 instead.
+
+In essence, we are calculating the quantity `a * c1 +/- b * c1` first, with
+`maddsubrs` *without* shifting (so `SH=0`) and then we add/sub `b * (c2-c1)`
+from the previous `RT`, and *then* do the shifting.
+
+In the following example, assume `a` in `R1`, `b` in `R10`, `c1` in `R11` and `c2 - c1` in `R12`.
+The first instruction will put `a * c1 + b * c1` in `R1` (`RT`), `a * c1 - b * c1` in `RS`
+(here, `RS = RT +1`, so `R2`).
+Then, `maddrs` will add `b * (c2 - c1)` to `R1` (`RT`), and `msubrs` will subtract it from `R2` (`RS`), and then
+round shift right both quantities 14 bits:
+
+```
+    maddsubrs 1,10,0,11
+    maddrs 1,10,12,14
+    msubrs 2,10,12,14
+```
+
+In scalar code, that would take ~16 instructions for both operations.
+
  -------
  
  \newpage{}
  
-# Twin Butterfly Floating-Point DCT Instruction(s)
+# Twin Butterfly Floating-Point DCT and FFT Instruction(s)
  
  **Add the following to Book I Section 4.6.6.3**
  
@@ -171,13 +251,14 @@ operand in register FRB and the result stored in FRS.
  Using the exact same operand input register values from FRT and FRB
  that were used to create FRS, the Floating-Point operand in register
  FRB is subtracted from the floating-point operand in register FRT and
-the result then multiplied by FRA to create an intermediate result that
-is stored in FRT.
+the result then rounded before being multiplied by FRA to create an
+intermediate result that is stored in FRT.
  
  The add into FRS is treated exactly as `fadds`.  The creation of the
-result FRT is **not** the same as that of `fmsubs`.
-The creation of FRS and FRT are treated as parallel independent operations
-which occur at the same time.
+result FRT is **not** the same as that of `fmsubs`, but is instead as if
+`fsubs` were performed first followed by `fmuls`.  The creation of FRS
+and FRT are treated as parallel independent operations which occur at
+the same time.
  
  Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
  
@@ -228,7 +309,7 @@ stored in FRS.
  
  Using the exact same values of FRT, FRT and FRB as used to create
  FRS, the floating-point operand in register FRT is multiplied by the
-floating-point operand in register FRA. The float- ing-point operand
+floating-point operand in register FRA. The floating-point operand
  in register FRB is subtracted from this intermediate result, and the
  intermediate stored in FRT.
  
@@ -248,7 +329,6 @@ For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
  Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
  (Max Vector Length).
  
-
  Special Registers Altered:
  
  ```
@@ -256,6 +336,7 @@ Special Registers Altered:
      FX OX UX XX
      VXSNAN VXISI VXIMZ
  ```
+
  ## Floating-Point Twin Multiply-Add DCT
  
  X-Form
@@ -290,13 +371,14 @@ operand in register FRB and the result stored in FRS.
  Using the exact same operand input register values from FRT and FRB
  that were used to create FRS, the Floating-Point operand in register
  FRB is subtracted from the floating-point operand in register FRT and
-the result then multiplied by FRA to create an intermediate result that
-is stored in FRT.
+the result then rounded before being multiplied by FRA to create an
+intermediate result that is stored in FRT.
  
  The add into FRS is treated exactly as `fadd`.  The creation of the
-result FRT is **not** the same as that of `fmsub`.
-The creation of FRS and FRT are treated as parallel independent operations
-which occur at the same time.
+result FRT is **not** the same as that of `fmsub`, but is instead as if
+`fsub` were performed first followed by `fmuls.  The creation of FRS
+and FRT are treated as parallel independent operations which occur at
+the same time.
  
  Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`
  
@@ -374,12 +456,16 @@ Special Registers Altered:
  ```
  
  
-## [DRAFT] Floating-Point Add FFT/DCT [Single]
+## Floating-Point Add FFT/DCT [Single]
  
  A-Form
  
+```
+    |0     |6     |11      |16     |21      |26    |31 |
+    | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
+```
+
  * ffadds FRT,FRA,FRB (Rc=0)
-* ffadds. FRT,FRA,FRB (Rc=1)
  
  Pseudo-code:
  
@@ -394,15 +480,18 @@ Special Registers Altered:
      FPRF FR FI
      FX OX UX XX
      VXSNAN VXISI
-    CR1          (if Rc=1)
  ```
  
-## [DRAFT] Floating-Point Add FFT/DCT [Double]
+## Floating-Point Add FFT/DCT [Double]
  
  A-Form
  
+```
+    |0     |6     |11      |16     |21      |26    |31 |
+    | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
+```
+
  * ffadd FRT,FRA,FRB (Rc=0)
-* ffadd. FRT,FRA,FRB (Rc=1)
  
  Pseudo-code:
  
@@ -417,15 +506,18 @@ Special Registers Altered:
      FPRF FR FI
      FX OX UX XX
      VXSNAN VXISI
-    CR1          (if Rc=1)
  ```
  
-## [DRAFT] Floating-Point Subtract FFT/DCT [Single]
+## Floating-Point Subtract FFT/DCT [Single]
  
  A-Form
  
+```
+    |0     |6     |11      |16     |21      |26    |31 |
+    | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
+```
+
  * ffsubs FRT,FRA,FRB (Rc=0)
-* ffsubs. FRT,FRA,FRB (Rc=1)
  
  Pseudo-code:
  
@@ -440,15 +532,18 @@ Special Registers Altered:
      FPRF FR FI
      FX OX UX XX
      VXSNAN VXISI
-    CR1          (if Rc=1)
  ```
  
-## [DRAFT] Floating-Point Subtract FFT/DCT [Double]
+## Floating-Point Subtract FFT/DCT [Double]
  
  A-Form
  
+```
+    |0     |6     |11      |16     |21      |26    |31 |
+    | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
+```
+
  * ffsub FRT,FRA,FRB (Rc=0)
-* ffsub. FRT,FRA,FRB (Rc=1)
  
  Pseudo-code:
  
@@ -463,5 +558,4 @@ Special Registers Altered:
      FPRF FR FI
      FX OX UX XX
      VXSNAN VXISI
-    CR1          (if Rc=1)
  ```