# Introduction

<!-- hide -->
* <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
* <https://libre-soc.org/openpower/sv/biginteger/> for format and
  information about implicit RS/FRS
* <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
* [[openpower/isa/svfparith]]
* [[openpower/isa/svfixedarith]]
* [[openpower/sv/rfc/ls016]]
<!-- show -->

Although best used with SVP64 REMAP these instructions may be used in a Scalar-only
context to save considerably on DCT, DFT and FFT processing.  Whilst some hardware
implementations may not necessarily implement them efficiently (slower Micro-coding)
savings still come from the reduction in temporary registers as well as instruction
count.

# Rationale for Twin Butterfly Integer DCT Instruction(s)

The number of general-purpose uses for DCT is huge. The number of
instructions needed instead of these Twin-Butterfly instructions is also
huge (**eight**) and given that it is extremely common to explicitly
loop-unroll them quantity hundreds to thousands of instructions are
dismayingly common (for all ISAs).

The goal is to implement instructions that calculate the expression:

```
    fdct_round_shift((a +/- b) * c)
```

For the single-coefficient butterfly instruction, and:

```
    fdct_round_shift(a * c1  +/- b * c2)
```

For the double-coefficient butterfly instruction.

In a 32-bit context `fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`

```
    #define ROUND_POWER_OF_TWO(value, n) \
            (((value) + (1 << ((n)-1))) >> (n))
```

These instructions are at the core of **ALL** FDCT calculations in many
major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
ARM includes special instructions to optimize these operations, although
they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.

The suggestion is to have a single instruction to calculate both values
`((a + b) * c) >> N`, and `((a - b) * c) >> N`.  The instruction will
run in accumulate mode, so in order to calculate the 2-coeff version
one would just have to call the same instruction with different order a,
b and a different constant c.

Example taken from libvpx
<https://chromium.googlesource.com/webm/libvpx/+/refs/heads/main/vpx_dsp/fwd_txfm.c#132>:

```
    #include <stdint.h>
    #define ROUND_POWER_OF_TWO(value, n) \
            (((value) + (1 << ((n)-1))) >> (n))
    void twin_int(int16_t *t, int16_t x0, int16_t x1, int16_t cospi_16_64) {
        t[0] = ROUND_POWER_OF_TWO((x0 + x1) * cospi_16_64, 14);
        t[1] = ROUND_POWER_OF_TWO((x0 - x1) * cospi_16_64, 14);
    }
```

8 instructions are required  - replaced by just the one (maddsubrs):

```
    add 9,5,4
    subf 5,5,4
    mullw 9,9,6
    mullw 5,5,6
    addi 9,9,8192
    addi 5,5,8192
    srawi 9,9,14
    srawi 5,5,14
```

-------

\newpage{}

## Integer Butterfly Multiply Add/Sub FFT/DCT

**Add the following to Book I Section 3.3.9.1**

A-Form

```
    |0     |6     |11      |16     |21      |26    |31 |
    | PO   |  RT  |   RA   |   RB  |   SH   |   XO |Rc |
```

* maddsubrs  RT,RA,SH,RB

Pseudo-code:

```
    n <- SH
    sum <- (RT) + (RA)
    diff <- (RT) - (RA)
    prod1 <- MULS(RB, sum)
    prod1_lo <- prod1[XLEN:(XLEN*2)-1]
    prod2 <- MULS(RB, diff)
    prod2_lo <- prod2[XLEN:(XLEN*2)-1]
    if n = 0 then
        RT <- prod1_lo
        RS <- prod2_lo
    else
        round <- [0]*XLEN
        round[XLEN -n] <- 1
        prod1_lo <- prod1_lo + round
        prod2_lo <- prod2_lo + round
        m <- MASK(n, (XLEN-1))
        res1 <- ROTL64(prod1_lo, XLEN-n) & m
        res2 <- ROTL64(prod2_lo, XLEN-n) & m
        signbit1 <- prod1_lo[0]
        signbit2 <- prod2_lo[0]
        smask1 <- ([signbit1]*XLEN) & ¬m
        smask2 <- ([signbit2]*XLEN) & ¬m
        RT <- (res1 | smask1)
        RS <- (res2 | smask2)
```

Similar to `RTp`, this instruction produces an implicit result, `RS`,
which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
`RT` is a Vector, `RS` begins immediately after the Vector `RT` where
the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).

Special Registers Altered:

```
    None
```

# [DRAFT] Integer Butterfly Multiply Add/Sub and Accumulate FFT/DCT

A-Form

* maddrs  RT,RA,SH,RB

Pseudo-code:

    n <- SH
    prod <- MULS(RB, RA)
    prod_lo <- prod[XLEN:(XLEN*2)-1]
    if n = 0 then
        RT <- (RT) + prod_lo
        RS <- (RS) - prod_lo
    else
        res1 <- (RT) + prod_lo
        res2 <- (RS) - prod_lo
        round <- [0]*XLEN
        round[XLEN -n] <- 1
        res1 <- res1 + round
        res2 <- res2 + round
        signbit1 <- res1[0]
        signbit2 <- res2[0]
        m <- MASK(n, (XLEN-1))
        res1 <- ROTL64(res1, XLEN-n) & m
        res2 <- ROTL64(res2, XLEN-n) & m
        smask1 <- ([signbit1]*XLEN) & ¬m
        smask2 <- ([signbit2]*XLEN) & ¬m
        RT <- (res1 | smask1)
        RS <- (res2 | smask2)

Special Registers Altered:

    None

Similar to `RTp`, this instruction produces an implicit result, `RS`,
which under Scalar circumstances is defined as `RT+1`.  For SVP64 if
`RT` is a Vector, `RS` begins immediately after the Vector `RT` where
the length of `RT` is set by `SVSTATE.MAXVL` (Max Vector Length).

This instruction is supposed to be used in complement to the maddsubrs
to produce the double-coefficient butterfly instruction. In order for that
to work, instead of passing c2 as coefficient, we have to pass c2-c1 instead.

In essence, we are calculating the quantity `a * c1 +/- b * c1` first, with
`maddsubrs` *without* shifting (so `SH=0`) and then we add/sub `b * (c2-c1)`
from the previous `RT`/`RS`, and *then* do the shifting.

In the following example, assume `a` in `R1`, `b` in `R10`, `c1` in `R11` and `c2 - c1` in `R12`.
The first instruction will put `a * c1 + b * c1` in `R1` (`RT`), `a * c1 - b * c1` in `RS`
(here, `RS = RT +1`, so `R2`).
Then, `maddrs` will add `b * (c2 - c1)` to `R1` (`RT`), and subtract it from `R2` (`RS`), and then
round shift right both quantities 14 bits:

```
    maddsubrs 1,10,0,11
    maddrs 1,10,14,12
```

In scalar code, that would take ~16 instructions for both operations.

-------

\newpage{}

# Twin Butterfly Floating-Point DCT and FFT Instruction(s)

**Add the following to Book I Section 4.6.6.3**

## Floating-Point Twin Multiply-Add DCT [Single]

X-Form

```
    |0     |6     |11      |16     |21      |31 |
    | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
```

* fdmadds FRT,FRA,FRB (Rc=0)

Pseudo-code:

```
    FRS <- FPADD32(FRT, FRB)
    sub <- FPSUB32(FRT, FRB)
    FRT <- FPMUL32(FRA, sub)
```

The two IEEE754-FP32 operations

```
    FRS <- [(FRT) + (FRB)]
    FRT <- [(FRT) - (FRB)] * (FRA)
```

are simultaneously performed.

The Floating-Point operand in register FRT is added to the floating-point
operand in register FRB and the result stored in FRS.

Using the exact same operand input register values from FRT and FRB
that were used to create FRS, the Floating-Point operand in register
FRB is subtracted from the floating-point operand in register FRT and
the result then rounded before being multiplied by FRA to create an
intermediate result that is stored in FRT.

The add into FRS is treated exactly as `fadds`.  The creation of the
result FRT is **not** the same as that of `fmsubs`, but is instead as if
`fsubs` were performed first followed by `fmuls`.  The creation of FRS
and FRT are treated as parallel independent operations which occur at
the same time.

Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`

Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
`FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI VXIMZ
```

## Floating-Point Multiply-Add FFT [Single]

X-Form

```
    |0     |6     |11      |16     |21      |31 |
    | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
```

* ffmadds FRT,FRA,FRB (Rc=0)

Pseudo-code:

```
    FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
    FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
```

The two operations

```
    FRS <- -([(FRT) * (FRA)] - (FRB))
    FRT <-   [(FRT) * (FRA)] + (FRB)
```

are performed.

The floating-point operand in register FRT is multiplied by the
floating-point operand in register FRA. The floating-point operand in
register FRB is added to this intermediate result, and the intermediate
stored in FRS.

Using the exact same values of FRT, FRT and FRB as used to create
FRS, the floating-point operand in register FRT is multiplied by the
floating-point operand in register FRA. The floating-point operand
in register FRB is subtracted from this intermediate result, and the
intermediate stored in FRT.

FRT is created as if a `fmadds` operation had been performed. FRS is
created as if a `fnmsubs` operation had simultaneously been performed
with the exact same register operands, in parallel, independently,
at exactly the same time.

FRT is a Read-Modify-Write operation.  

Note that if Rc=1 an Illegal Instruction is raised.
Rc=1 is `RESERVED`

Similar to `FRTp`, this instruction produces an implicit result,
`FRS`, which under Scalar circumstances is defined as `FRT+1`.
For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
(Max Vector Length).

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI VXIMZ
```

## Floating-Point Twin Multiply-Add DCT

X-Form

```
    |0     |6     |11      |16     |21      |31 |
    | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
```

* fdmadd FRT,FRA,FRB (Rc=0)

Pseudo-code:

```
    FRS <- FPADD64(FRT, FRB)
    sub <- FPSUB64(FRT, FRB)
    FRT <- FPMUL64(FRA, sub)
```

The two IEEE754-FP64 operations

```
    FRS <- [(FRT) + (FRB)]
    FRT <- [(FRT) - (FRB)] * (FRA)
```

are simultaneously performed.

The Floating-Point operand in register FRT is added to the floating-point
operand in register FRB and the result stored in FRS.

Using the exact same operand input register values from FRT and FRB
that were used to create FRS, the Floating-Point operand in register
FRB is subtracted from the floating-point operand in register FRT and
the result then rounded before being multiplied by FRA to create an
intermediate result that is stored in FRT.

The add into FRS is treated exactly as `fadd`.  The creation of the
result FRT is **not** the same as that of `fmsub`, but is instead as if
`fsub` were performed first followed by `fmuls.  The creation of FRS
and FRT are treated as parallel independent operations which occur at
the same time.

Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`

Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
`FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI VXIMZ
```

## Floating-Point Twin Multiply-Add FFT

X-Form

```
    |0     |6     |11      |16     |21      |31 |
    | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
```

* ffmadd FRT,FRA,FRB (Rc=0)

Pseudo-code:

```
    FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
    FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
```

The two operations

```
    FRS <- -([(FRT) * (FRA)] - (FRB))
    FRT <-   [(FRT) * (FRA)] + (FRB)
```

are performed.

The floating-point operand in register FRT is multiplied by the
floating-point operand in register FRA. The float- ing-point operand in
register FRB is added to this intermediate result, and the intermediate
stored in FRS.

Using the exact same values of FRT, FRT and FRB as used to create
FRS, the floating-point operand in register FRT is multiplied by the
floating-point operand in register FRA. The float- ing-point operand
in register FRB is subtracted from this intermediate result, and the
intermediate stored in FRT.

FRT is created as if a `fmadd` operation had been performed. FRS is
created as if a `fnmsub` operation had simultaneously been performed
with the exact same register operands, in parallel, independently,
at exactly the same time.

FRT is a Read-Modify-Write operation.

Note that if Rc=1 an Illegal Instruction is raised.  Rc=1 is `RESERVED`

Similar to `FRTp`, this instruction produces an implicit result, `FRS`,
which under Scalar circumstances is defined as `FRT+1`.  For SVP64 if
`FRT` is a Vector, `FRS` begins immediately after the Vector `FRT`
where the length of `FRT` is set by `SVSTATE.MAXVL` (Max Vector Length).

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI VXIMZ
```


## Floating-Point Add FFT/DCT [Single]

A-Form

```
    |0     |6     |11      |16     |21      |26    |31 |
    | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
```

* ffadds FRT,FRA,FRB (Rc=0)

Pseudo-code:

```
    FRT <- FPADD32(FRA, FRB)
    FRS <- FPSUB32(FRB, FRA)
```

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI
```

## Floating-Point Add FFT/DCT [Double]

A-Form

```
    |0     |6     |11      |16     |21      |26    |31 |
    | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
```

* ffadd FRT,FRA,FRB (Rc=0)

Pseudo-code:

```
    FRT <- FPADD64(FRA, FRB)
    FRS <- FPSUB64(FRB, FRA)
```

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI
```

## Floating-Point Subtract FFT/DCT [Single]

A-Form

```
    |0     |6     |11      |16     |21      |26    |31 |
    | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
```

* ffsubs FRT,FRA,FRB (Rc=0)

Pseudo-code:

```
    FRT <- FPSUB32(FRB, FRA)
    FRS <- FPADD32(FRA, FRB)
```

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI
```

## Floating-Point Subtract FFT/DCT [Double]

A-Form

```
    |0     |6     |11      |16     |21      |26    |31 |
    | PO   | FRT  |  FRA   |  FRB  |     /  |   XO |Rc |
```

* ffsub FRT,FRA,FRB (Rc=0)

Pseudo-code:

```
    FRT <- FPSUB64(FRB, FRA)
    FRS <- FPADD64(FRA, FRB)
```

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI
```