From b1e46b7244727bd2dfd29cadc6a82dbd44222db5 Mon Sep 17 00:00:00 2001 From: lkcl Date: Fri, 19 May 2023 18:00:31 +0100 Subject: [PATCH] --- openpower/sv/remap.mdwn | 106 ++++++++-------------------------------- 1 file changed, 20 insertions(+), 86 deletions(-) diff --git a/openpower/sv/remap.mdwn b/openpower/sv/remap.mdwn index e3d46aa34..ef9d4be37 100644 --- a/openpower/sv/remap.mdwn +++ b/openpower/sv/remap.mdwn @@ -21,63 +21,38 @@ abstracted analog to `xxperm`). REMAP allows the usual sequential vector loop `0..VL-1` to be "reshaped" (re-mapped) from a linear form to a 2D or 3D transposed form, or "offset" to permit arbitrary access to elements (when elwidth overrides are -used), independently on each Vector src or dest register. Aside from +used), independently on each Vector src or dest register. + +A normal Vector Add: + +``` +  for i in range(VL): +  GPR[RT+i] <= GPR[RA+i] + GPR[RB+i]; +``` + +A Hardware-assisted REMAP Vector Add: + +``` + for i in range(VL): + GPR[RT+remap1(i)] <= GPR[RA+remap2(i)] + GPR[RB+remap3(i)]; +``` + +Aside from Indexed REMAP this is entirely Hardware-accelerated reordering and -consequently not costly in terms of register access. It will however +consequently not costly in terms of register access for the Indices. It will however place a burden on Multi-Issue systems but no more than if the equivalent Scalar instructions were explicitly loop-unrolled without SVP64, and some advanced implementations may even find the Deterministic nature of the Scheduling to be easier on resources. -The initial primary motivation of REMAP was for Matrix Multiplication, -reordering of sequential data in-place: in-place DCT and FFT were -easily justified given the exceptionally high usage in Computer Science. -Four SPRs are provided which may be applied to any GPR, FPR or CR Field so -that for example a single FMAC may be used in a single hardware-controlled -100% Deterministic loop to perform 5x3 times 3x4 Matrix multiplication, -generating 60 FMACs *without needing explicit assembler unrolling*. -Additional uses include regular "Structure Packing" such as RGB pixel -data extraction and reforming (although less costly vec2/3/4 reshaping -is achievable with `PACK/UNPACK`). - -Even once designed as an independent RISC-paradigm abstraction system -it was realised that Matrix REMAP could be applied to min/max instructions to -achieve Floyd-Warshall Graph computations, or to AND/OR Ternary -bitmanipulation to compute Warshall Transitive Closure, or -to perform Cryptographic Matrix operations with Galois Field -variants of Multiply-Accumulate and many more uses expected to be -discovered. This *without -adding actual explicit Vector opcodes for any of the same*. - -Thus it should be very clear: -REMAP, like all of SV, is abstracted out, meaning that unlike traditional -Vector ISAs which would typically only have a limited set of instructions -that can be structure-packed (LD/ST and Move operations -being the most common), REMAP may be applied to -literally any instruction: CRs, Arithmetic, Logical, LD/ST, even -Vectorised Branch-Conditional. - -When SUBVL is greater than 1 a given group of Subvector -elements are kept together: effectively the group becomes the -element, and with REMAP applying to elements -(not sub-elements) each group is REMAPed together. -Swizzle *can* however be applied to the same -instruction as REMAP, providing re-sequencing of -Subvector elements which REMAP cannot. Also as explained in [[sv/mv.swizzle]], [[sv/mv.vec]] and the [[svp64/appendix]], Pack and Unpack Mode bits -can extend down into Sub-vector elements to influence vec2/vec3/vec4 -sequential reordering, but even here, REMAP reordering is not *individually* -extended down to the actual sub-vector elements themselves. -This keeps the relevant Predicate Mask bit applicable to the Subvector -group, just as it does when REMAP is not active. - -In its general form, REMAP is quite expensive to set up, and on some +*Hardware note: in its general form, REMAP is quite expensive to set up, and on some implementations may introduce latency, so should realistically be used only where it is worthwhile. Given that even with latency the fact that up to 127 operations can be Deterministically issued (from a single instruction) it should be clear that REMAP should not be dismissed for *possible* latency alone. Commonly-used patterns such as Matrix Multiply, DCT and FFT have helper instruction options which make REMAP -easier to use. +easier to use.* *Future specification note: future versions of the REMAP Management instructions will extend to EXT1xx Prefixed variants. This will overcome some of the limitations @@ -125,47 +100,6 @@ it is possible to service Precise Interrupts without affecting latency Parallel Reduction instructions, because their Architectural State cannot hold the partial results). - -## Example Usage - -* `svshape` to set the type of reordering to be applied to an - otherwise usual `0..VL-1` hardware for-loop -* `svremap` to set which registers a given reordering is to apply to - (RA, RT etc) -* `sv.{instruction}` where any Vectorised register marked by `svremap` - will have its ordering REMAPPED according to the schedule set - by `svshape`. - -The following illustrative example multiplies a 3x4 and a 5x3 -matrix to create -a 5x4 result: - -``` - svshape 5,4,3,0,0 # Outer Product 5x4 by 4x3 - svremap 15,1,2,3,0,0,0,0 # link Schedule to registers - sv.fmadds *0,*32,*64,*0 # 60 FMACs get executed here -``` - -* svshape sets up the four SVSHAPE SPRS for a Matrix Schedule -* svremap activates four out of five registers RA RB RC RT RS (15) -* svremap requests: - - RA to use SVSHAPE1 - - RB to use SVSHAPE2 - - RC to use SVSHAPE3 - - RT to use SVSHAPE0 - - RS Remapping to not be activated -* sv.fmadds has vectors RT=0, RA=32, RB=64, RC=0 -* With REMAP being active each register's element index is - *independently* transformed using the specified SHAPEs. - -Thus the Vector Loop is arranged such that the use of -the multiply-and-accumulate instruction executes precisely the required -Schedule to perform an in-place in-registers Outer Product -Matrix Multiply with no -need to perform additional Transpose or register copy instructions. -The example above may be executed as a unit test and demo, -[here](https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_matrix.py;h=c15479db9a36055166b6b023c7495f9ca3637333;hb=a17a252e474d5d5bf34026c25a19682e3f2015c3#l94) - *Hardware Architectural note: with the Scheduling applying as a Phase between Decode and Issue in a Deterministic fashion the Register Hazards may be easily computed and a standard Out-of-Order Micro-Architecture exploited to good -- 2.30.2