From: lkcl Date: Fri, 19 May 2023 16:49:58 +0000 (+0100) Subject: (no commit message) X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=e4a3d83a5472d2521e77050ce19a2fa790b09578;p=libreriscv.git --- diff --git a/openpower/sv/rfc/ls009.mdwn b/openpower/sv/rfc/ls009.mdwn index 533f9ce50..8559f2b11 100644 --- a/openpower/sv/rfc/ls009.mdwn +++ b/openpower/sv/rfc/ls009.mdwn @@ -96,6 +96,158 @@ Add the following entries to: \newpage{} +# Rationale and background + +In certain common algorithms there are clear patterns of behaviour which +typically require inline loop-unrolled instructions comprising interleaved +permute and arithmetic operations. REMAP was invented to split the permuting +from the arithmetic, and to allow the Indexing to be done as a hardware Schedule. + +A normal Vector Add: + +``` +  for i in range(VL): +    GPR[RT+i] <= GPR[RA+i] + + GPR[RB+i]; +``` + +A Hardware-assisted REMAP Vector Add: + +``` +  for i in range(VL): +    GPR[RT+remap1(i)] <= GPR[RA+remap2(i)] + + GPR[RB+remap3(i)]; +``` + +The result is a huge saving on register file accesses (no need to calculate Indices +then use Permutation instructions), instruction count (Matrix Multiply up to 127 FMACs +is 3 instructions), and programmer sanity. + +# REMAP types + +This section summarises the motivation for each REMAP Schedule +and briefly goes over their characteristics and limitations. +Further details on the Deterministic Precise-Interruptible algorithms +used in these Schedules is found in the [[sv/remap/appendix]]. + +## Matrix (1D/2D/3D shaping) + +Matrix Multiplication is a huge part of High-Performance Compute, +and 3D. +In many PackedSIMD as well as Scalable Vector ISAs, non-power-of-two +Matrix sizes are a serious challenge. PackedSIMD ISAs, in order to +cope with for example 3x4 Matrices, recommend rolling data-repetition and loop-unrolling. +Aside from the cost of the load on the L1 I-Cache, the trick only +works if one of the dimensions X or Y are power-two. Prime Numbers +(5x7, 3x5) become deeply problematic to unroll. + +Even traditional Scalable Vector ISAs have issues with Matrices, often +having to perform data Transpose by pushing out through Memory and back +(costly), +or computing Transposition Indices (costly) then copying to another +Vector (costly). + +Matrix REMAP was thus designed to solve these issues by providing Hardware +Assisted +"Schedules" that can view what would otherwise be limited to a strictly +linear Vector as instead being 2D (even 3D) *in-place* reordered. +With both Transposition and non-power-two being supported the issues +faced by other ISAs are mitigated. + +Limitations of Matrix REMAP are that the Vector Length (VL) is currently +restricted to 127: up to 127 FMAs (or other operation) +may be performed in total. +Also given that it is in-registers only at present some care has to be +taken on regfile resource utilisation. However it is perfectly possible +to utilise Matrix REMAP to perform the three inner-most "kernel" loops of +the usual 6-level "Tiled" large Matrix Multiply, without the usual +difficulties associated with SIMD. + +Also the `svshape` instruction only provides access to part of the +Matrix REMAP capability. Rotation and mirroring need to be done by +programming the SVSHAPE SPRs directly, which can take a lot more +instructions. Future versions of SVP64 will include EXT1xx prefixed +variants (`psvshape`) which provide more comprehensive capacity and +mitigate the need to write direct to the SVSHAPE SPRs. + +## FFT/DCT Triple Loop + +DCT and FFT are some of the most astonishingly used algorithms in +Computer Science. Radar, Audio, Video, R.F. Baseband and dozens more. At least +two DSPs, TMS320 and Hexagon, have VLIW instructions specially tailored +to FFT. + +An in-depth analysis showed that it is possible to do in-place in-register +DCT and FFT as long as twin-result "butterfly" instructions are provided. +These can be found in the [[openpower/isa/svfparith]] page if performing +IEEE754 FP transforms. *(For fixed-point transforms, equivalent 3-in 2-out +integer operations would be required)*. These "butterfly" instructions +avoid the need for a temporary register because the two array positions +being overwritten will be "in-flight" in any In-Order or Out-of-Order +micro-architecture. + +DCT and FFT Schedules are currently limited to RADIX2 sizes and do not +accept predicate masks. Given that it is common to perform recursive +convolutions combining smaller Power-2 DCT/FFT to create larger DCT/FFTs +in practice the RADIX2 limit is not a problem. A Bluestein convolution +to compute arbitrary length is demonstrated by +[Project Nayuki](https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py) + +## Indexed + +The purpose of Indexing is to provide a generalised version of +Vector ISA "Permute" instructions, such as VSX `vperm`. The +Indexing is abstracted out and may be applied to much more +than an element move/copy, and is not limited for example +to the number of bytes that can fit into a VSX register. +Indexing may be applied to LD/ST (even on Indexed LD/ST +instructions such as `sv.lbzx`), arithmetic operations, +extsw: there is no artificial limit. + +The original motivation for Indexed REMAP was to mitigate the need to add +an expensive `mv.x` to the Scalar ISA, which was highly likely to be rejected as +a stand-alone instruction +(`GPR(RT) <- GPR(GPR(RA))`). Usually a Vector ISA would add a non-conflicting +variant (as in VSX `vperm`) but it is common to need to permute by source, +with the risk of conflict, that has to be resolved, for example, in AVX-512 +with `conflictd`. + +Indexed REMAP is the "Get out of Jail Free" card which (for a price) +allows any general-purpose arbitrary Element Permutation not covered by +the other Hardware Schedules. + +## Parallel Reduction + +Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base" +(Power ISA v3.0B) operation is leveraged, unmodified, to give the +*appearance* and *effect* of Reduction. Parallel Reduction is not limited +to Power-of-two but is limited as usual by the total number of +element operations (127) as well as available register file size. + +Parallel Reduction is normally done explicitly as a loop-unrolled operation, +taking up a significant number of instructions. With REMAP it is just three +instructions: two for setup and one Scalar Base. + +## Parallel Prefix Sum + +This is a work-efficient Parallel Schedule that for example produces Trangular +or Factorial number sequences. Half of the Prefix Sum Schedule is near-identical +to Parallel Reduction. Whilst the Arithmetic mapreduce Mode (`/mr`) may achieve the same +end-result, implementations may only implement Mapreduce in serial form (or give +the appearance to Programmers of the same). The Parallel Prefix Schedule is +*required* to be implemented in such a way that its Deterministic Schedule may be +parallelised. Like the Reduction Schedule it is 100% Deterministic and consequently +may be used with non-commutative operations. + +Parallel Reduction combined with Parallel Prefix Sum can be combined to help +perform AI non-linear interpolation in around twelve to fifteen instructions. + +---------------- + +\newpage{} + +Add the following to SV Book + [[!inline pages="openpower/sv/remap" raw=yes ]] # Forms