# Introduction

<!-- hide -->
* <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
* <https://libre-soc.org/openpower/sv/biginteger/> for format and
  information about implicit RS/FRS
* <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
* [[openpower/isa/svfparith]]
* [[openpower/isa/svfixedarith]]
* [[openpower/sv/rfc/ls016]]

<!-- show -->

# Rationale for Twin Butterfly Integer DCT Instruction(s)

The number of general-purpose uses for DCT is huge. The
number of instructions needed instead of these Twin-Butterfly
instructions is also huge (**eight**) and given that it is
extremely common to explicitly loop-unroll them quantity
hundreds to thousands of instructions are dismayingly common
(for all ISAs).

The goal is to implement instructions that calculate the expression:

```
    fdct_round_shift((a +/- b) * c)
```

For the single-coefficient butterfly instruction, and:

```
    fdct_round_shift(a * c1  +/- b * c2)
```

For the double-coefficient butterfly instruction.

`fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`

```
    #define ROUND_POWER_OF_TWO(value, n) \
            (((value) + (1 << ((n)-1))) >> (n))
```

These instructions are at the core of **ALL** FDCT calculations in many major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
Arm includes special instructions to optimize these operations, although they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.

The suggestion is to have a single instruction to calculate both values `((a + b) * c) >> N`, and `((a - b) * c) >> N`.
The instruction will run in accumulate mode, so in order to calculate the 2-coeff version one would just have to call the same instruction with different order a, b and a different constant c.

## Integer Butterfly Multiply Add/Sub FFT/DCT

**Add the following to Book I Section 3.3.9.1**

A-Form

```
    |0     |6     |11      |16     |21      |26    |31 |
    | PO   |  RT  |   RA   |   RB  |   SH   |   XO |Rc |

```

* maddsubrs  RT,RA,SH,RB

Pseudo-code:

```
    n <- SH
    sum <- (RT) + (RA)
    diff <- (RT) - (RA)
    prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1]
    prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1]
    res1 <- ROTL64(prod1, XLEN-n)
    res2 <- ROTL64(prod2, XLEN-n)
    m <- MASK(n, (XLEN-1))
    signbit1 <- res1[0]
    signbit2 <- res2[0]
    smask1 <- ([signbit1]*XLEN) & ¬m
    smask2 <- ([signbit2]*XLEN) & ¬m
    s64_1 <- [0]*(XLEN-1) || signbit1
    s64_2 <- [0]*(XLEN-1) || signbit2
    RT <- (res1 & m | smask1) + s64_1
    RS <- (res2 & m | smask2) + s64_2
```

Note that if Rc=1 an Illegal Instruction is raised.
Rc=1 is `RESERVED`

Similar to `RTp`, this instruction produces an implicit result,
`RS`, which under Scalar circumstances is defined as `RT+1`.
For SVP64 if `RT` is a Vector, `RS` begins immediately after the
Vector `RT` where the length of `RT` is set by `SVSTATE.MAXVL`
(Max Vector Length).

Special Registers Altered:

```
    None
```

# Twin Butterfly Integer DCT Instruction(s)

## Floating Twin Multiply-Add DCT [Single]

**Add the following to Book I Section 4.6.6.3**

X-Form

```
    |0     |6     |11      |16     |21      |31 |
    | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
```

* fdmadds FRT,FRA,FRB (Rc=0)

Pseudo-code:

```
    FRS <- FPADD32(FRT, FRB)
    FRT <- FPMULADD32(FRT, FRA, FRB, 1, -1)
```

The Floating-Point operand in register FRT is added to the floating-point
operand in register FRB and the result stored in FRS.

Using the exact same operand input register values from FRT and FRB that
were used to create FRS, the Floating-Point operand in register FRB
is subtracted from the floating-point operand in register FRT and the
result then multiplied by FRA to create an intermediate result that is
stored in FRT.

The add into FRS is treated exactly as `fadd`.  The creation
of the result FRT is exact!y that of `fmsub`.  The creation of FRS and FRT are
treated as parallel independent operations which occur at the same time.

Note that if Rc=1 an Illegal Instruction is raised.
Rc=1 is `RESERVED`

Similar to `FRTp`, this instruction produces an implicit result,
`FRS`, which under Scalar circumstances is defined as `FRT+1`.
For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
(Max Vector Length).

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI VXIMZ
```

## Floating Multiply-Add FFT [Single]

**Add the following to Book I Section 4.6.6.3**

X-Form

```
    |0     |6     |11      |16     |21      |31 |
    | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
```

* ffmadds FRT,FRA,FRB (Rc=0)

Pseudo-code:

```
    FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
    FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
```

The two operations

```
    FRS <- -([(FRT) * (FRA)] - (FRB))
    FRT <-   [(FRT) * (FRA)] + (FRB)
```

are performed.

The floating-point operand in register FRT is multiplied
by the floating-point operand in register FRA. The float-
ing-point operand in register FRB is added to
this intermediate result, and the intermediate stored in FRS.

Using the exact same values of FRT, FRT and FRB as used to create FRS,
the floating-point operand in register FRT is multiplied
by the floating-point operand in register FRA. The float-
ing-point operand in register FRB is subtracted from
this intermediate result, and the intermediate stored in FRT.

FRT is created as if
a `fmadds` operation had been performed. FRS is created as if
a `fnmsubs` operation had simultaneously been performed with
the exact same register operands, in parallel, independently,
at exactly the same time.

FRT is a Read-Modify-Write operation.  

Note that if Rc=1 an Illegal Instruction is raised.
Rc=1 is `RESERVED`

Similar to `FRTp`, this instruction produces an implicit result,
`FRS`, which under Scalar circumstances is defined as `FRT+1`.
For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
(Max Vector Length).


Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI VXIMZ
```
## Floating Twin Multiply-Add DCT

**Add the following to Book I Section 4.6.6.3**

X-Form

```
    |0     |6     |11      |16     |21      |31 |
    | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
```

* fdmadd FRT,FRA,FRB (Rc=0)

Pseudo-code:

```
    FRS <- FPADD64(FRT, FRB)
    FRT <- FPMULADD64(FRT, FRA, FRB, 1, -1)
```

The Floating-Point operand in register FRT is added to the floating-point
operand in register FRB and the result stored in FRS.

Using the exact same operand input register values from FRT and FRB that
were used to create FRS, the Floating-Point operand in register FRB
is subtracted from the floating-point operand in register FRT and the
result then multiplied by FRA to create an intermediate result that is
stored in FRT.

The add into FRS is treated exactly as `fadd`.  The creation
of the result FRT is exact!y that of `fmsub`.  The creation of FRS and FRT are
treated as parallel independent operations which occur at the same time.

Note that if Rc=1 an Illegal Instruction is raised.
Rc=1 is `RESERVED`

Similar to `FRTp`, this instruction produces an implicit result,
`FRS`, which under Scalar circumstances is defined as `FRT+1`.
For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
(Max Vector Length).

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI VXIMZ
```

## Floating Twin Multiply-Add FFT

**Add the following to Book I Section 4.6.6.3**

X-Form

```
    |0     |6     |11      |16     |21      |31 |
    | PO   |  FRT |  FRA   |  FRB  |   XO   |Rc |
```

* ffmadd FRT,FRA,FRB (Rc=0)

Pseudo-code:

```
    FRS <- FPMULADD64(FRT, FRA, FRB, -1, 1)
    FRT <- FPMULADD64(FRT, FRA, FRB, 1, 1)
```

The two operations

```
    FRS <- -([(FRT) * (FRA)] - (FRB))
    FRT <-   [(FRT) * (FRA)] + (FRB)
```

are performed.

The floating-point operand in register FRT is multiplied
by the floating-point operand in register FRA. The float-
ing-point operand in register FRB is added to
this intermediate result, and the intermediate stored in FRS.

Using the exact same values of FRT, FRT and FRB as used to create FRS,
the floating-point operand in register FRT is multiplied
by the floating-point operand in register FRA. The float-
ing-point operand in register FRB is subtracted from
this intermediate result, and the intermediate stored in FRT.

FRT is created as if
a `fmadd` operation had been performed. FRS is created as if
a `fnmsub` operation had simultaneously been performed with
the exact same register operands, in parallel, independently,
at exactly the same time.

FRT is a Read-Modify-Write operation.  

Note that if Rc=1 an Illegal Instruction is raised.
Rc=1 is `RESERVED`

Similar to `FRTp`, this instruction produces an implicit result,
`FRS`, which under Scalar circumstances is defined as `FRT+1`.
For SVP64 if `FRT` is a Vector, `FRS` begins immediately after the
Vector `FRT` where the length of `FRT` is set by `SVSTATE.MAXVL`
(Max Vector Length).

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI VXIMZ
```


## [DRAFT] Floating Add FFT/DCT [Single]

A-Form

* ffadds FRT,FRA,FRB (Rc=0)
* ffadds. FRT,FRA,FRB (Rc=1)

Pseudo-code:

```
    FRT <- FPADD32(FRA, FRB)
    FRS <- FPSUB32(FRB, FRA)
```

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI
    CR1          (if Rc=1)
```

## [DRAFT] Floating Add FFT/DCT [Double]

A-Form

* ffadd FRT,FRA,FRB (Rc=0)
* ffadd. FRT,FRA,FRB (Rc=1)

Pseudo-code:

```
    FRT <- FPADD64(FRA, FRB)
    FRS <- FPSUB64(FRB, FRA)
```

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI
    CR1          (if Rc=1)
```

## [DRAFT] Floating Subtract FFT/DCT [Single]

A-Form

* ffsubs FRT,FRA,FRB (Rc=0)
* ffsubs. FRT,FRA,FRB (Rc=1)

Pseudo-code:

```
    FRT <- FPSUB32(FRB, FRA)
    FRS <- FPADD32(FRA, FRB)
```

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI
    CR1          (if Rc=1)
```

## [DRAFT] Floating Subtract FFT/DCT [Double]

A-Form

* ffsub FRT,FRA,FRB (Rc=0)
* ffsub. FRT,FRA,FRB (Rc=1)

Pseudo-code:

```
    FRT <- FPSUB64(FRB, FRA)
    FRS <- FPADD64(FRA, FRB)
```

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI
    CR1          (if Rc=1)
```