Parallel Reduction instructions, because their Architectural State cannot
hold the partial results).
-## Basic principle
-
-The following illustrates why REMAP was added.
-
-* normal vector element read/write of operands would be sequential
- (0 1 2 3 ....)
-* this is not appropriate for (e.g.) Matrix multiply which requires
- accessing elements in alternative sequences (0 3 6 1 4 7 ...)
-* normal Vector ISAs use either Indexed-MV or Indexed-LD/ST to "cope"
- with this. both are expensive (copy large vectors, spill through memory)
- and very few Packed SIMD ISAs cope with non-Power-2
- (Duplicate-data inline-loop-unrolling is the costly solution)
-* REMAP **redefines** the order of access according to set
- (Deterministic) "Schedules".
-* Matrix Schedules are not at all restricted to power-of-two boundaries
- making it unnecessary to have for example specialised 3x4 transpose
- instructions of other Vector ISAs.
-* DCT and FFT REMAP are RADIX-2 limited but this is the case in existing Packed/Predicated
- SIMD ISAs anyway (and Bluestein Convolution is typically deployed to
- solve that).
-
-Only the most commonly-used algorithms in computer science have REMAP
-support, due to the high cost in both the ISA and in hardware. For
-arbitrary remapping the `Indexed` REMAP may be used.
## Example Usage