# Introduction

<!-- hide -->
* <https://bugs.libre-soc.org/show_bug.cgi?id=1074>
* <https://libre-soc.org/openpower/sv/biginteger/> for format and
  information about implicit RS/FRS
* <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_dct.py;hb=HEAD>
* [[openpower/isa/svfparith]]
* [[openpower/isa/svfixedarith]]
<!-- show -->

# Rationale for Twin Butterfly Integer DCT Instruction(s)

The number of general-purpose uses for DCT is huge. The
number of instructions needed instead of these Twin-Butterfly
instructions is also huge (**eight**) and given that it is
extremely common to explicitly loop-unroll them quantity
hundreds to thousands of instructions are dismayingly common
(for all ISAs).

The goal is to implement instructions that calculate the expression:

```
    fdct_round_shift((a +/- b) * c)
```

For the single-coefficient butterfly instruction, and:

```
    fdct_round_shift(a * c1  +/- b * c2)
```

For the double-coefficient butterfly instruction.

`fdct_round_shift` is defined as `ROUND_POWER_OF_TWO(x, 14)`

```
    #define ROUND_POWER_OF_TWO(value, n) (((value) + (1 << ((n)-1))) >> (n))
```

These instructions are at the core of **ALL** FDCT calculations in many major video codecs, including -but not limited to- VP8/VP9, AV1, etc.
Arm includes special instructions to optimize these operations, although they are limited in precision: `vqrdmulhq_s16`/`vqrdmulhq_s32`.

The suggestion is to have a single instruction to calculate both values `((a + b) * c) >> N`, and `((a - b) * c) >> N`.
The instruction will run in accumulate mode, so in order to calculate the 2-coeff version one would just have to call the same instruction with different order a, b and a different constant c.

## Integer Butterfly Multiply Add/Sub FFT/DCT

**Add the following to Book I Section 3.3.9.1**

A-Form

```
    |0     |6     |11      |16     |21      |26    |31 |
    | PO   |  RT  |   RA   |   RB  |   SH   |   XO |/  |

```

* maddsubrs  RT,RA,SH,RB

Pseudo-code:

```
    n <- SH
    sum <- (RT) + (RA)
    diff <- (RT) - (RA)
    prod1 <- MULS(RB, sum)[XLEN:(XLEN*2)-1]
    prod2 <- MULS(RB, diff)[XLEN:(XLEN*2)-1]
    res1 <- ROTL64(prod1, XLEN-n)
    res2 <- ROTL64(prod2, XLEN-n)
    m <- MASK(n, (XLEN-1))
    signbit1 <- res1[0]
    signbit2 <- res2[0]
    smask1 <- ([signbit1]*XLEN) & ¬m
    smask2 <- ([signbit2]*XLEN) & ¬m
    s64_1 <- [0]*(XLEN-1) || signbit1
    s64_2 <- [0]*(XLEN-1) || signbit2
    RT <- (res1 & m | smask1) + s64_1
    RS <- (res2 & m | smask2) + s64_2
```

Special Registers Altered:

```
    None
```

# Twin Butterfly Integer DCT Instruction(s)

## Floating Twin Multiply-Add DCT [Single]

**Add the following to Book I Section 4.6.6.3 **

X-Form

```
    |0     |6     |11      |16     |21      |31 |
    | PO   |  FRT |  FRA   |  FRB  |   XO   |/  |
```

* fdmadds FRT,FRA,FRB (Rc=0)

Pseudo-code:

```
    FRS <- FPADD32(FRT, FRB)
    sub <- FPSUB32(FRT, FRB)
    FRT <- FPMUL32(FRA, sub)
```

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI VXIMZ
```

## Floating Multiply-Add FFT [Single]

**Add the following to Book I Section 4.6.6.3 **

X-Form

```
    |0     |6     |11      |16     |21      |31 |
    | PO   |  FRT |  FRA   |  FRB  |   XO   |/  |
```

* ffmadds FRT,FRA,FRB (Rc=0)

Pseudo-code:

```
    FRS <- FPMULADD32(FRT, FRA, FRB, -1, 1)
    FRT <- FPMULADD32(FRT, FRA, FRB, 1, 1)
```

Special Registers Altered:

```
    FPRF FR FI
    FX OX UX XX
    VXSNAN VXISI VXIMZ
```