### Matrix (1D/2D/3D shaping)
-Matrix Multiplication is a huge part of High-Performance Compute,
-and 3D.
-In many PackedSIMD as well as Scalable Vector ISAs, non-power-of-two
-Matrix sizes are a serious challenge. PackedSIMD ISAs, in order to
-cope with for example 3x4 Matrices, recommend rolling data-repetition and loop-unrolling.
-Aside from the cost of the load on the L1 I-Cache, the trick only
-works if one of the dimensions X or Y are power-two. Prime Numbers
-(5x7, 3x5) become deeply problematic to unroll.
-
-Even traditional Scalable Vector ISAs have issues with Matrices, often
-having to perform data Transpose by pushing out through Memory and back
-(costly),
-or computing Transposition Indices (costly) then copying to another
-Vector (costly).
-
-Matrix REMAP was thus designed to solve these issues by providing Hardware
-Assisted
-"Schedules" that can view what would otherwise be limited to a strictly
-linear Vector as instead being 2D (even 3D) *in-place* reordered.
-With both Transposition and non-power-two being supported the issues
-faced by other ISAs are mitigated.
-Limitations of Matrix REMAP are that the Vector Length (VL) is currently
-restricted to 127: up to 127 FMAs (or other operation)
-may be performed in total.
-Also given that it is in-registers only at present some care has to be
-taken on regfile resource utilisation. However it is perfectly possible
-to utilise Matrix REMAP to perform the three inner-most "kernel" loops of
-the usual 6-level "Tiled" large Matrix Multiply, without the usual
-difficulties associated with SIMD.
-Also the `svshape` instruction only provides access to part of the
-Matrix REMAP capability. Rotation and mirroring need to be done by
-programming the SVSHAPE SPRs directly, which can take a lot more
-instructions. Future versions of SVP64 will include EXT1xx prefixed
-variants (`psvshape`) which provide more comprehensive capacity and
-mitigate the need to write direct to the SVSHAPE SPRs.
-
-*Hardware Architectural note: with the Scheduling applying as a Phase between
-Decode and Issue in a Deterministic fashion the Register Hazards may be
-easily computed and a standard Out-of-Order Micro-Architecture exploited to good
-effect. Even an In-Order system may observe that for large Outer Product
-Schedules there will be no stalls, but if the Matrices are particularly
-small size an In-Order system would have to stall, just as it would if
-the operations were loop-unrolled without Simple-V. Thus: regardless
-of the Micro-Architecture the Hardware Engineer should first consider
-how best to process the exact same equivalent loop-unrolled instruction
-stream.*
### FFT/DCT Triple Loop
## Matrix Mode
In Matrix Mode, skip allows dimensions to be skipped from being included
-in the resultant output index. this allows sequences to be repeated:
+in the resultant output index. This allows sequences to be repeated:
```0 0 0 1 1 1 2 2 2 ...``` or in the case of skip=0b11 this results in
modulo ```0 1 2 0 1 2 ...```
GPR(RT + remap(i) + SVSHAPE.offset) = ....
```
-this appears redundant because the register RT could simply be changed by a compiler, until element width overrides are introduced. also
+This appears redundant because the register RT could simply be changed by a compiler, until element width overrides are introduced. Also
bear in mind that unlike a static compiler SVSHAPE.offset may
be set dynamically at runtime.
in-place rotate, Matrix Multiply and Convolutions, without being
limited to Power-of-Two dimension sizes.
+**Limitations and caveats**
+
+Limitations of Matrix REMAP are that the Vector Length (VL) is currently
+restricted to 127: up to 127 FMAs (or other operation)
+may be performed in total.
+Also given that it is in-registers only at present some care has to be
+taken on regfile resource utilisation. However it is perfectly possible
+to utilise Matrix REMAP to perform the three inner-most "kernel" loops of
+the usual 6-level "Tiled" large Matrix Multiply, without the usual
+difficulties associated with SIMD.
+
+Also the `svshape` instruction only provides access to *part* of the
+Matrix REMAP capability. Rotation and mirroring need to be done by
+programming the SVSHAPE SPRs directly, which can take a lot more
+instructions. Future versions of SVP64 will include EXT1xx prefixed
+variants (`psvshape`) which provide more comprehensive capacity and
+mitigate the need to write direct to the SVSHAPE SPRs.
+
+Additionally there is not yet a way to set Matrix sizes from registers
+with `svshape`: this was an intentional decision to simplify Hardware, that
+may be corrected in a future version of SVP64. The limitation may presently
+be overcome by direct programming of the SVSHAPE SPRs.
+
+*Hardware Architectural note: with the Scheduling applying as a Phase between
+Decode and Issue in a Deterministic fashion the Register Hazards may be
+easily computed and a standard Out-of-Order Micro-Architecture exploited to good
+effect. Even an In-Order system may observe that for large Outer Product
+Schedules there will be no stalls, but if the Matrices are particularly
+small size an In-Order system would have to stall, just as it would if
+the operations were loop-unrolled without Simple-V. Thus: regardless
+of the Micro-Architecture the Hardware Engineer should first consider
+how best to process the exact same equivalent loop-unrolled instruction
+stream. Once solved Matrix REMAP will fit naturally.*
+
## Indexed Mode
Indexed Mode activates reading of the element indices from the GPR