From b4c37145b5045b1b54fc032656fef26676c5e186 Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Fri, 19 May 2023 19:36:44 +0100
Subject: [PATCH]

---
 openpower/sv/remap.mdwn | 150 ++++++++++++++++------------------------
 1 file changed, 58 insertions(+), 92 deletions(-)

diff --git a/openpower/sv/remap.mdwn b/openpower/sv/remap.mdwn
index 42ec1d269..a9d56ff2d 100644
--- a/openpower/sv/remap.mdwn
+++ b/openpower/sv/remap.mdwn
@@ -117,100 +117,8 @@ and briefly goes over their characteristics and limitations.
 Further details on the Deterministic Precise-Interruptible algorithms
 used in these Schedules is found in the [[sv/remap/appendix]].
 
-### Matrix (1D/2D/3D shaping)
 
 
-
-
-### FFT/DCT Triple Loop
-
-DCT and FFT are some of the most astonishingly used algorithms in
-Computer Science.  Radar, Audio, Video, R.F. Baseband and dozens more.  At least
-two DSPs, TMS320 and Hexagon, have VLIW instructions specially tailored
-to FFT.
-
-An in-depth analysis showed that it is possible to do in-place in-register
-DCT and FFT as long as twin-result "butterfly" instructions are provided.
-These can be found in the [[openpower/isa/svfparith]] page if performing
-IEEE754 FP transforms. *(For fixed-point transforms, equivalent 3-in 2-out
-integer operations would be required)*. These "butterfly" instructions
-avoid the need for a temporary register because the two array positions
-being overwritten will be "in-flight" in any In-Order or Out-of-Order
-micro-architecture.
-
-DCT and FFT Schedules are currently limited to RADIX2 sizes and do not
-accept predicate masks.  Given that it is common to perform recursive
-convolutions combining smaller Power-2 DCT/FFT to create larger DCT/FFTs
-in practice the RADIX2 limit is not a problem.  A Bluestein convolution
-to compute arbitrary length is demonstrated by
-[Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py)
-
-### Indexed
-
-The purpose of Indexing is to provide a generalised version of
-Vector ISA "Permute" instructions, such as VSX `vperm`.  The
-Indexing is abstracted out and may be applied to much more
-than an element move/copy, and is not limited for example
-to the number of bytes that can fit into a VSX register.
-Indexing may be applied to LD/ST (even on Indexed LD/ST
-instructions such as `sv.lbzx`), arithmetic operations,
-extsw: there is no artificial limit.
-
-The only major caveat is that the registers to be used as
-Indices must not be modified by any instruction after Indexed Mode
-is established, and neither must MAXVL be altered. Additionally,
-no register used as an Index may exceed MAXVL-1.
-
-Failure to observe
-these conditions results in `UNDEFINED` behaviour.
-These conditions allow a Read-After-Write (RAW) Hazard to be created on
-the entire range of Indices to be subsequently used, but a corresponding
-Write-After-Read Hazard by any instruction that modifies the Indices
-**does not have to be created**. Given the large number of registers
-involved in Indexing this is a huge resource saving and reduction
-in micro-architectural complexity. MAXVL is likewise
-included in the RAW Hazards because it is involved in calculating
-how many registers are to be considered Indices.
-
-With these Hazard Mitigations in place, high-performance implementations
-may read-cache the Indices at the point where a given `svindex` instruction
-is called (or SVSHAPE SPRs - and MAXVL - directly altered) by issuing
-background GPR register file reads whilst other instructions are being
-issued and executed.
-
-The original motivation for Indexed REMAP was to mitigate the need to add
-an expensive `mv.x` to the Scalar ISA, which was likely to be rejected as
-a stand-alone instruction
-(`GPR(RT) <- GPR(GPR(RA))`).  Usually a Vector ISA would add a non-conflicting
-variant (as in VSX `vperm`) but it is common to need to permute by source,
-with the risk of conflict, that has to be resolved, for example, in AVX-512
-with `conflictd`.
-
-Indexed REMAP on the other hand **does not prevent conflicts** (overlapping
-destinations), which on a superficial analysis may be perceived to be a
-problem, until it is recalled that, firstly, Simple-V is designed specifically
-to require Program Order to be respected, and that Matrix, DCT and FFT
-all *already* critically depend on overlapping Reads/Writes: Matrix
-uses overlapping registers as accumulators.  Thus the Register Hazard
-Management needed by Indexed REMAP *has* to be in place anyway.
-
-*Programmer's Note: `hphint` may be used to help hardware identify
-parallelism opportunities but it is critical to remember that the
-groupings are by `FLOOR(step/MAXVL)` not `FLOOR(REMAP(step)/MAXVL)`.*
-
-The cost compared to Matrix and other REMAPs (and Pack/Unpack) is
-clearly that of the additional reading of the GPRs to be used as Indices,
-plus the setup cost associated with creating those same Indices.
-If any Deterministic REMAP can cover the required task, clearly it
-is adviseable to use it instead.
-
-*Programmer's note: some algorithms may require skipping of Indices exceeding
-VL-1, not MAXVL-1. This may be achieved programmatically by performing
-an `sv.cmp *BF,*RA,RB` where RA is the same GPRs used in the Indexed REMAP,
-and RB contains the value of VL returned from `setvl`. The resultant
-CR Fields may then be used as Predicate Masks to exclude those operations
-with an Index exceeding VL-1.*
-
 ### Parallel Reduction
 
 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture.  Like Scalar reduction, the "Scalar Base"
@@ -740,6 +648,64 @@ SVSHAPEs to simultaneously use the same
 Indices (use the same GPRs), even if one SVSHAPE has different
 2D dimensions and ordering from the others.
 
+**Caveats and Limitations**
+
+The purpose of Indexing is to provide a generalised version of
+Vector ISA "Permute" instructions, such as VSX `vperm`.  The
+Indexing is abstracted out and may be applied to much more
+than an element move/copy, and is not limited for example
+to the number of bytes that can fit into a VSX register.
+Indexing may be applied to LD/ST (even on Indexed LD/ST
+instructions such as `sv.lbzx`), arithmetic operations,
+extsw: there is no artificial limit.
+
+The only major caveat is that the registers to be used as
+Indices must not be modified by any instruction after Indexed Mode
+is established, and neither must MAXVL be altered. Additionally,
+no register used as an Index may exceed MAXVL-1.
+
+Failure to observe
+these conditions results in `UNDEFINED` behaviour.
+These conditions allow a Read-After-Write (RAW) Hazard to be created on
+the entire range of Indices to be subsequently used, but a corresponding
+Write-After-Read Hazard by any instruction that modifies the Indices
+**does not have to be created**. Given the large number of registers
+involved in Indexing this is a huge resource saving and reduction
+in micro-architectural complexity. MAXVL is likewise
+included in the RAW Hazards because it is involved in calculating
+how many registers are to be considered Indices.
+
+With these Hazard Mitigations in place, high-performance implementations
+may read-cache the Indices at the point where a given `svindex` instruction
+is called (or SVSHAPE SPRs - and MAXVL - directly altered) by issuing
+background GPR register file reads whilst other instructions are being
+issued and executed.
+
+Indexed REMAP **does not prevent conflicts** (overlapping
+destinations), which on a superficial analysis may be perceived to be a
+problem, until it is recalled that, firstly, Simple-V is designed specifically
+to require Program Order to be respected, and that Matrix, DCT and FFT
+all *already* critically depend on overlapping Reads/Writes: Matrix
+uses overlapping registers as accumulators.  Thus the Register Hazard
+Management needed by Indexed REMAP *has* to be in place anyway.
+
+*Programmer's Note: `hphint` may be used to help hardware identify
+parallelism opportunities but it is critical to remember that the
+groupings are by `FLOOR(step/MAXVL)` not `FLOOR(REMAP(step)/MAXVL)`.*
+
+The cost compared to Matrix and other REMAPs (and Pack/Unpack) is
+clearly that of the additional reading of the GPRs to be used as Indices,
+plus the setup cost associated with creating those same Indices.
+If any Deterministic REMAP can cover the required task, clearly it
+is adviseable to use it instead.
+
+*Programmer's note: some algorithms may require skipping of Indices exceeding
+VL-1, not MAXVL-1. This may be achieved programmatically by performing
+an `sv.cmp *BF,*RA,RB` where RA is the same GPRs used in the Indexed REMAP,
+and RB contains the value of VL returned from `setvl`. The resultant
+CR Fields may then be used as Predicate Masks to exclude those operations
+with an Index exceeding VL-1.*
+
 -------------
 
 \newpage{}
-- 
2.30.2