From 2c92e0f6f29a90bcb87bdf3742a58232135b6afe Mon Sep 17 00:00:00 2001 From: lkcl Date: Fri, 19 May 2023 19:30:28 +0100 Subject: [PATCH] --- openpower/sv/remap.mdwn | 84 ++++++++++++++++++----------------------- 1 file changed, 36 insertions(+), 48 deletions(-) diff --git a/openpower/sv/remap.mdwn b/openpower/sv/remap.mdwn index 0d5c6eacc..42ec1d269 100644 --- a/openpower/sv/remap.mdwn +++ b/openpower/sv/remap.mdwn @@ -119,54 +119,8 @@ used in these Schedules is found in the [[sv/remap/appendix]]. ### Matrix (1D/2D/3D shaping) -Matrix Multiplication is a huge part of High-Performance Compute, -and 3D. -In many PackedSIMD as well as Scalable Vector ISAs, non-power-of-two -Matrix sizes are a serious challenge. PackedSIMD ISAs, in order to -cope with for example 3x4 Matrices, recommend rolling data-repetition and loop-unrolling. -Aside from the cost of the load on the L1 I-Cache, the trick only -works if one of the dimensions X or Y are power-two. Prime Numbers -(5x7, 3x5) become deeply problematic to unroll. - -Even traditional Scalable Vector ISAs have issues with Matrices, often -having to perform data Transpose by pushing out through Memory and back -(costly), -or computing Transposition Indices (costly) then copying to another -Vector (costly). - -Matrix REMAP was thus designed to solve these issues by providing Hardware -Assisted -"Schedules" that can view what would otherwise be limited to a strictly -linear Vector as instead being 2D (even 3D) *in-place* reordered. -With both Transposition and non-power-two being supported the issues -faced by other ISAs are mitigated. -Limitations of Matrix REMAP are that the Vector Length (VL) is currently -restricted to 127: up to 127 FMAs (or other operation) -may be performed in total. -Also given that it is in-registers only at present some care has to be -taken on regfile resource utilisation. However it is perfectly possible -to utilise Matrix REMAP to perform the three inner-most "kernel" loops of -the usual 6-level "Tiled" large Matrix Multiply, without the usual -difficulties associated with SIMD. -Also the `svshape` instruction only provides access to part of the -Matrix REMAP capability. Rotation and mirroring need to be done by -programming the SVSHAPE SPRs directly, which can take a lot more -instructions. Future versions of SVP64 will include EXT1xx prefixed -variants (`psvshape`) which provide more comprehensive capacity and -mitigate the need to write direct to the SVSHAPE SPRs. - -*Hardware Architectural note: with the Scheduling applying as a Phase between -Decode and Issue in a Deterministic fashion the Register Hazards may be -easily computed and a standard Out-of-Order Micro-Architecture exploited to good -effect. Even an In-Order system may observe that for large Outer Product -Schedules there will be no stalls, but if the Matrices are particularly -small size an In-Order system would have to stall, just as it would if -the operations were loop-unrolled without Simple-V. Thus: regardless -of the Micro-Architecture the Hardware Engineer should first consider -how best to process the exact same equivalent loop-unrolled instruction -stream.* ### FFT/DCT Triple Loop @@ -621,7 +575,7 @@ column-based in-place DCT/FFT. ## Matrix Mode In Matrix Mode, skip allows dimensions to be skipped from being included -in the resultant output index. this allows sequences to be repeated: +in the resultant output index. This allows sequences to be repeated: ```0 0 0 1 1 1 2 2 2 ...``` or in the case of skip=0b11 this results in modulo ```0 1 2 0 1 2 ...``` @@ -641,7 +595,7 @@ offset will have the effect of offsetting the result by ```offset``` elements: GPR(RT + remap(i) + SVSHAPE.offset) = .... ``` -this appears redundant because the register RT could simply be changed by a compiler, until element width overrides are introduced. also +This appears redundant because the register RT could simply be changed by a compiler, until element width overrides are introduced. Also bear in mind that unlike a static compiler SVSHAPE.offset may be set dynamically at runtime. @@ -685,6 +639,40 @@ With all these options it is possible to support in-place transpose, in-place rotate, Matrix Multiply and Convolutions, without being limited to Power-of-Two dimension sizes. +**Limitations and caveats** + +Limitations of Matrix REMAP are that the Vector Length (VL) is currently +restricted to 127: up to 127 FMAs (or other operation) +may be performed in total. +Also given that it is in-registers only at present some care has to be +taken on regfile resource utilisation. However it is perfectly possible +to utilise Matrix REMAP to perform the three inner-most "kernel" loops of +the usual 6-level "Tiled" large Matrix Multiply, without the usual +difficulties associated with SIMD. + +Also the `svshape` instruction only provides access to *part* of the +Matrix REMAP capability. Rotation and mirroring need to be done by +programming the SVSHAPE SPRs directly, which can take a lot more +instructions. Future versions of SVP64 will include EXT1xx prefixed +variants (`psvshape`) which provide more comprehensive capacity and +mitigate the need to write direct to the SVSHAPE SPRs. + +Additionally there is not yet a way to set Matrix sizes from registers +with `svshape`: this was an intentional decision to simplify Hardware, that +may be corrected in a future version of SVP64. The limitation may presently +be overcome by direct programming of the SVSHAPE SPRs. + +*Hardware Architectural note: with the Scheduling applying as a Phase between +Decode and Issue in a Deterministic fashion the Register Hazards may be +easily computed and a standard Out-of-Order Micro-Architecture exploited to good +effect. Even an In-Order system may observe that for large Outer Product +Schedules there will be no stalls, but if the Matrices are particularly +small size an In-Order system would have to stall, just as it would if +the operations were loop-unrolled without Simple-V. Thus: regardless +of the Micro-Architecture the Hardware Engineer should first consider +how best to process the exact same equivalent loop-unrolled instruction +stream. Once solved Matrix REMAP will fit naturally.* + ## Indexed Mode Indexed Mode activates reading of the element indices from the GPR -- 2.30.2