From b4c37145b5045b1b54fc032656fef26676c5e186 Mon Sep 17 00:00:00 2001 From: lkcl Date: Fri, 19 May 2023 19:36:44 +0100 Subject: [PATCH] --- openpower/sv/remap.mdwn | 150 ++++++++++++++++------------------------ 1 file changed, 58 insertions(+), 92 deletions(-) diff --git a/openpower/sv/remap.mdwn b/openpower/sv/remap.mdwn index 42ec1d269..a9d56ff2d 100644 --- a/openpower/sv/remap.mdwn +++ b/openpower/sv/remap.mdwn @@ -117,100 +117,8 @@ and briefly goes over their characteristics and limitations. Further details on the Deterministic Precise-Interruptible algorithms used in these Schedules is found in the [[sv/remap/appendix]]. -### Matrix (1D/2D/3D shaping) - - -### FFT/DCT Triple Loop - -DCT and FFT are some of the most astonishingly used algorithms in -Computer Science. Radar, Audio, Video, R.F. Baseband and dozens more. At least -two DSPs, TMS320 and Hexagon, have VLIW instructions specially tailored -to FFT. - -An in-depth analysis showed that it is possible to do in-place in-register -DCT and FFT as long as twin-result "butterfly" instructions are provided. -These can be found in the [[openpower/isa/svfparith]] page if performing -IEEE754 FP transforms. *(For fixed-point transforms, equivalent 3-in 2-out -integer operations would be required)*. These "butterfly" instructions -avoid the need for a temporary register because the two array positions -being overwritten will be "in-flight" in any In-Order or Out-of-Order -micro-architecture. - -DCT and FFT Schedules are currently limited to RADIX2 sizes and do not -accept predicate masks. Given that it is common to perform recursive -convolutions combining smaller Power-2 DCT/FFT to create larger DCT/FFTs -in practice the RADIX2 limit is not a problem. A Bluestein convolution -to compute arbitrary length is demonstrated by -[Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py) - -### Indexed - -The purpose of Indexing is to provide a generalised version of -Vector ISA "Permute" instructions, such as VSX `vperm`. The -Indexing is abstracted out and may be applied to much more -than an element move/copy, and is not limited for example -to the number of bytes that can fit into a VSX register. -Indexing may be applied to LD/ST (even on Indexed LD/ST -instructions such as `sv.lbzx`), arithmetic operations, -extsw: there is no artificial limit. - -The only major caveat is that the registers to be used as -Indices must not be modified by any instruction after Indexed Mode -is established, and neither must MAXVL be altered. Additionally, -no register used as an Index may exceed MAXVL-1. - -Failure to observe -these conditions results in `UNDEFINED` behaviour. -These conditions allow a Read-After-Write (RAW) Hazard to be created on -the entire range of Indices to be subsequently used, but a corresponding -Write-After-Read Hazard by any instruction that modifies the Indices -**does not have to be created**. Given the large number of registers -involved in Indexing this is a huge resource saving and reduction -in micro-architectural complexity. MAXVL is likewise -included in the RAW Hazards because it is involved in calculating -how many registers are to be considered Indices. - -With these Hazard Mitigations in place, high-performance implementations -may read-cache the Indices at the point where a given `svindex` instruction -is called (or SVSHAPE SPRs - and MAXVL - directly altered) by issuing -background GPR register file reads whilst other instructions are being -issued and executed. - -The original motivation for Indexed REMAP was to mitigate the need to add -an expensive `mv.x` to the Scalar ISA, which was likely to be rejected as -a stand-alone instruction -(`GPR(RT) <- GPR(GPR(RA))`). Usually a Vector ISA would add a non-conflicting -variant (as in VSX `vperm`) but it is common to need to permute by source, -with the risk of conflict, that has to be resolved, for example, in AVX-512 -with `conflictd`. - -Indexed REMAP on the other hand **does not prevent conflicts** (overlapping -destinations), which on a superficial analysis may be perceived to be a -problem, until it is recalled that, firstly, Simple-V is designed specifically -to require Program Order to be respected, and that Matrix, DCT and FFT -all *already* critically depend on overlapping Reads/Writes: Matrix -uses overlapping registers as accumulators. Thus the Register Hazard -Management needed by Indexed REMAP *has* to be in place anyway. - -*Programmer's Note: `hphint` may be used to help hardware identify -parallelism opportunities but it is critical to remember that the -groupings are by `FLOOR(step/MAXVL)` not `FLOOR(REMAP(step)/MAXVL)`.* - -The cost compared to Matrix and other REMAPs (and Pack/Unpack) is -clearly that of the additional reading of the GPRs to be used as Indices, -plus the setup cost associated with creating those same Indices. -If any Deterministic REMAP can cover the required task, clearly it -is adviseable to use it instead. - -*Programmer's note: some algorithms may require skipping of Indices exceeding -VL-1, not MAXVL-1. This may be achieved programmatically by performing -an `sv.cmp *BF,*RA,RB` where RA is the same GPRs used in the Indexed REMAP, -and RB contains the value of VL returned from `setvl`. The resultant -CR Fields may then be used as Predicate Masks to exclude those operations -with an Index exceeding VL-1.* - ### Parallel Reduction Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base" @@ -740,6 +648,64 @@ SVSHAPEs to simultaneously use the same Indices (use the same GPRs), even if one SVSHAPE has different 2D dimensions and ordering from the others. +**Caveats and Limitations** + +The purpose of Indexing is to provide a generalised version of +Vector ISA "Permute" instructions, such as VSX `vperm`. The +Indexing is abstracted out and may be applied to much more +than an element move/copy, and is not limited for example +to the number of bytes that can fit into a VSX register. +Indexing may be applied to LD/ST (even on Indexed LD/ST +instructions such as `sv.lbzx`), arithmetic operations, +extsw: there is no artificial limit. + +The only major caveat is that the registers to be used as +Indices must not be modified by any instruction after Indexed Mode +is established, and neither must MAXVL be altered. Additionally, +no register used as an Index may exceed MAXVL-1. + +Failure to observe +these conditions results in `UNDEFINED` behaviour. +These conditions allow a Read-After-Write (RAW) Hazard to be created on +the entire range of Indices to be subsequently used, but a corresponding +Write-After-Read Hazard by any instruction that modifies the Indices +**does not have to be created**. Given the large number of registers +involved in Indexing this is a huge resource saving and reduction +in micro-architectural complexity. MAXVL is likewise +included in the RAW Hazards because it is involved in calculating +how many registers are to be considered Indices. + +With these Hazard Mitigations in place, high-performance implementations +may read-cache the Indices at the point where a given `svindex` instruction +is called (or SVSHAPE SPRs - and MAXVL - directly altered) by issuing +background GPR register file reads whilst other instructions are being +issued and executed. + +Indexed REMAP **does not prevent conflicts** (overlapping +destinations), which on a superficial analysis may be perceived to be a +problem, until it is recalled that, firstly, Simple-V is designed specifically +to require Program Order to be respected, and that Matrix, DCT and FFT +all *already* critically depend on overlapping Reads/Writes: Matrix +uses overlapping registers as accumulators. Thus the Register Hazard +Management needed by Indexed REMAP *has* to be in place anyway. + +*Programmer's Note: `hphint` may be used to help hardware identify +parallelism opportunities but it is critical to remember that the +groupings are by `FLOOR(step/MAXVL)` not `FLOOR(REMAP(step)/MAXVL)`.* + +The cost compared to Matrix and other REMAPs (and Pack/Unpack) is +clearly that of the additional reading of the GPRs to be used as Indices, +plus the setup cost associated with creating those same Indices. +If any Deterministic REMAP can cover the required task, clearly it +is adviseable to use it instead. + +*Programmer's note: some algorithms may require skipping of Indices exceeding +VL-1, not MAXVL-1. This may be achieved programmatically by performing +an `sv.cmp *BF,*RA,RB` where RA is the same GPRs used in the Indexed REMAP, +and RB contains the value of VL returned from `setvl`. The resultant +CR Fields may then be used as Predicate Masks to exclude those operations +with an Index exceeding VL-1.* + ------------- \newpage{} -- 2.30.2