<!-- show -->
Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
-context to save considerably on DCT, DFT and FFT processing.
+context to save considerably on DCT, DFT and FFT processing. Whilst some hardware
+implementations may not necessarily implement them efficiently (slower Micro-coding)
+savings still come from the reduction in temporary registers as well as instruction
+count.
# Rationale for Twin Butterfly Integer DCT Instruction(s)
For the double-coefficient butterfly instruction.
-`fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
+In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`
```
#define ROUND_POWER_OF_TWO(value, n) \
These instructions are at the core of **ALL** FDCT calculations in many
major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
-Arm includes special instructions to optimize these operations, although
+ARM includes special instructions to optimize these operations, although
they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.
The suggestion is to have a single instruction to calculate both values
b and a different constant c.
Example taken from libvpx
-<https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:
+<https://chromium.googlesource.com/webm/libvpx/+/refs/tags/v1.13.0/vpx_dsp/fwd_txfm.c#132>:
```
#include <stdint.h>
| PO | RT | RA | RB | SH | XO |Rc |
```
-* maddsubrs RT,RA,SH,RB
+* maddsubrs RT,RA,RB,SH
Pseudo-code:
```
n <- SH
- sum <- (RT) + (RA)
- diff <- (RT) - (RA)
- prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1]
- prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1]
+ sum <- (RT[0] || RT) + (RA[0] || RA)
+ diff <- (RT[0] || RT) - (RA[0] || RA)
+ prod1 <- MULS(RB, sum)
+ prod2 <- MULS(RB, diff)
if n = 0 then
- #round <- EXTS([0]*(XLEN-1) || [1]*1)
- #prod1 <- ROTL64(prod1, 1)
- #prod2 <- ROTL64(prod2, 1)
- #prod1 <- prod1 + round
- #prod2 <- prod2 + round
- #res1 <- ROTL64(prod1, XLEN-1)
- #res2 <- ROTL64(prod2, XLEN-1)
- #m <- MASK(1, (XLEN-1))
- RT <- prod1
- RS <- prod2
+ prod1_lo <- prod1[XLEN+1:(XLEN*2)]
+ prod2_lo <- prod2[XLEN+1:(XLEN*2)]
+ RT <- prod1_lo
+ RS <- prod2_lo
else
- round <- EXTS([0]*(XLEN -n -1) || [1]*1 || [0]*(n-1))
+ round <- [0]*(XLEN*2 + 1)
+ round[XLEN*2 - n + 1] <- 1
prod1 <- prod1 + round
prod2 <- prod2 + round
- res1 <- ROTL64(prod1, XLEN-n)
- res2 <- ROTL64(prod2, XLEN-n)
- m <- MASK(n, (XLEN-1))
- signbit1 <- prod1[0]
- signbit2 <- prod2[0]
- smask1 <- ([signbit1]*XLEN) & ¬m
- smask2 <- ([signbit2]*XLEN) & ¬m
- RT <- (res1 & m | smask1)
- RS <- (res2 & m | smask2)
+ res1 <- prod1[XLEN - n + 1:XLEN*2 - n]
+ res2 <- prod2[XLEN - n + 1:XLEN*2 - n]
+ RT <- res1
+ RS <- res2
```
-Note that if Rc=1 an Illegal Instruction is raised. Rc=1 is `RESERVED`
-
Similar to `RTp`, this instruction produces an implicit result, `RS`,
which under Scalar circumstances is defined as `RT+1`. For SVP64 if
`RT` is a Vector, `RS` begins immediately after the Vector `RT` where
None
```
+# [DRAFT] Integer Butterfly Multiply Add and Round Shift FFT/DCT
+
+A-Form
+
+* maddrs RT,RA,RB,SH
+
+Pseudo-code:
+
+```
+ n <- SH
+ prod <- MULS(RB, RA)
+ if n = 0 then
+ prod_lo <- prod[XLEN:(XLEN*2) - 1]
+ RT <- (RT) + prod_lo
+ else
+ res[0:XLEN*2-1] <- (EXTSXL((RT)[0], 1) || (RT)) + prod
+ round <- [0]*XLEN*2
+ round[XLEN*2 - n] <- 1
+ res <- res + round
+ RT <- res[XLEN - n:XLEN*2 - n -1]
+```
+
+Special Registers Altered:
+
+ None
+
+# [DRAFT] Integer Butterfly Multiply Sub and Round Shift FFT/DCT
+
+A-Form
+
+* msubrs RT,RA,RB,SH
+
+Pseudo-code:
+
+```
+ n <- SH
+ prod <- MULS(RB, RA)
+ if n = 0 then
+ prod_lo <- prod[XLEN:(XLEN*2) - 1]
+ RT <- (RT) - prod_lo
+ else
+ res[0:XLEN*2-1] <- (EXTSXL((RT)[0], 1) || (RT)) - prod
+ round <- [0]*XLEN*2
+ round[XLEN*2 - n] <- 1
+ res <- res + round
+ RT <- res[XLEN - n:XLEN*2 - n -1]
+```
+
+Special Registers Altered:
+
+ None
+
+
+This pair of instructions is supposed to be used in complement to the maddsubrs
+to produce the double-coefficient butterfly instruction. In order for that
+to work, instead of passing c2 as coefficient, we have to pass c2-c1 instead.
+
+In essence, we are calculating the quantity `a * c1 +/- b * c1` first, with
+`maddsubrs` *without* shifting (so `SH=0`) and then we add/sub `b * (c2-c1)`
+from the previous `RT`, and *then* do the shifting.
+
+In the following example, assume `a` in `R1`, `b` in `R10`, `c1` in `R11` and `c2 - c1` in `R12`.
+The first instruction will put `a * c1 + b * c1` in `R1` (`RT`), `a * c1 - b * c1` in `RS`
+(here, `RS = RT +1`, so `R2`).
+Then, `maddrs` will add `b * (c2 - c1)` to `R1` (`RT`), and `msubrs` will subtract it from `R2` (`RS`), and then
+round shift right both quantities 14 bits:
+
+```
+ maddsubrs 1,10,0,11
+ maddrs 1,10,12,14
+ msubrs 2,10,12,14
+```
+
+In scalar code, that would take ~16 instructions for both operations.
+
-------
\newpage{}