The list of uses for DCT is enormous - well over a hundred.
<https://en.wikipedia.org/wiki/Discrete_cosine_transform#General_applications>
-The number of uses for FFT is also equally known to be extremely high
+The number of uses for FFT, DFT, NTT is also equally known to be extremely high
<https://en.wikipedia.org/wiki/Fast_Fourier_transform#Applications>
ARM has already added `vqrdmulhq_s16/32` instructions as their inclusion
-in any ISA replaces **eight** non-Twin-Butterfly instructions, which
+in any ISA replaces **eight** equivalent non-Twin-Butterfly instructions, which
are often loop-unrolled, resulting in L1 I-Cache stripmining as well
as requiring far greater resources (double the number of intermediate
Vector registers) or much more complex hardware to
**Notes and Observations**:
1. Whilst it is easy to justify these high-value instructions they are
- sufficiently complex as to warrant placement as optional SFFS in
- the new EXT2xx area (marked as Vectoriseable).
+ sufficiently complex as to warrant placement as optional SFFS in the
+ new EXT2xx area (marked as Vectoriseable).
2. Although they are 3-in 2-out the actual encoding is as a double-overwrite
reducing the actual number of operands down to three (RT RA and RB)
- where RT is a Read-Modify-Write and an additional RS (normally RT+1) is implicit.
+ where RT is a Read-Modify-Write and an additional RS (normally RT+1)
+ is implicit.
3. As with the biginteger set of 3-in 2-out instructions if Power ISA did not
- already have LD/ST-with-Update, Load/Store-Quad, and other RTp and RAp instructions,
- these instructions would not be proposed.
+ already have LD/ST-with-Update, Load/Store-Quad, and other RTp and
+ RAp instructions, these instructions would not be proposed.
4. The read and write of two overlapping registers normally requires
- an intermediate register (similar to the justifcation for CAS - Compare-and-Swap).
- When Vectorised the situation becomes even worse: an entire *Vector*
- of intermediate temporaries is required.
- Thus *even if implemented inefficiently* requiring more cycles to complete
- (taking an extra cycle to write the second result) these instructions still
- save on resources.
+ an intermediate register (similar to the justifcation for CAS -
+ Compare-and-Swap). When Vectorised the situation becomes even
+ worse: an entire *Vector* of intermediate temporaries is required.
+ Thus *even if implemented inefficiently* requiring more cycles to
+ complete (taking an extra cycle to write the second result) these
+ instructions still save on resources.
5. Macro-op fusion equivalents of these instructions is *not possible* for
- exactly the same reason that the equivalent CAS sequence may not be macro-op
- fused. Full in-place Vectorised FFT and DCT algorithms *only* become
- possible due to these instructions atomically reading **both** operands
- into internal Reservation Stations (exactly like CAS).
+ exactly the same reason that the equivalent CAS sequence may not be
+ macro-op fused. Full in-place Vectorised FFT and DCT algorithms *only*
+ become possible due to these instructions atomically reading **both**
+ operands into internal Reservation Stations (exactly like CAS).
5. Although desirable (particularly to detect overflow) Rc=1 is hard to
conceptualise. It is likely that instead, Simple-V "saturation" if
enabled will create an Rc=1 CR.SO flag (including SVP64Single).
| Form | Book | Page | Version | Mnemonic | Description |
|------|------|------|---------|----------|-------------|
-| A | I | # | 3.2B | maddsubrs | Integer DCT/FFT Twin-Butterfly |
-| X | I | # | 3.2B | fdmadds | FP DCT Twin-Butterfly Single |
-| X | I | # | 3.2B | ffmadds | FP FFT Twin-Butterfly Single |
-| X | I | # | 3.2B | fdmadds | FP DCT Twin-Butterfly Double |
-| X | I | # | 3.2B | ffmadds | FP FFT Twin-Butterfly Double |
-| X | I | # | 3.2B | ffadds | FP FFT Twin-Butterfly Single |
-| X | I | # | 3.2B | ffadd | FP FFT Twin-Butterfly Double |
-| X | I | # | 3.2B | ffsubs | FP FFT Twin-Butterfly Single |
-| X | I | # | 3.2B | ffsub | FP FFT Twin-Butterfly Double |
+| A | I | # | 3.2B |maddsubrs | Integer DCT/FFT Twin-Butterfly |
+| X | I | # | 3.2B |fdmadds | FP DCT Twin-Butterfly Single |
+| X | I | # | 3.2B |ffmadds | FP FFT Twin-Butterfly Single |
+| X | I | # | 3.2B |fdmadds | FP DCT Twin-Butterfly Double |
+| X | I | # | 3.2B |ffmadds | FP FFT Twin-Butterfly Double |
+| X | I | # | 3.2B |ffadds | FP FFT Twin-Butterfly Single |
+| X | I | # | 3.2B |ffadd | FP FFT Twin-Butterfly Double |
+| X | I | # | 3.2B |ffsubs | FP FFT Twin-Butterfly Single |
+| X | I | # | 3.2B |ffsub | FP FFT Twin-Butterfly Double |
[[!tag opf_rfc]]