\newpage{}
# SVP64 24-bit Prefixes
-The SVP64 24-bit Prefix (RM) provides several options,
-all fitting within the 24-bit space (and no other).
+The SVP64 24-bit Prefix (RM) options aim to reduce instruction count
+and assembler complexity.
These Modes do not interact with SVSTATE per se. SVSTATE
primarily controls the looping (quantity, order), RM
influences the *elements* (the Suffix). There is however
some close interaction when it comes to predication.
REMAP is outlined separately.
-The primary options all of which are aimed at reducing instruction
-count and reducing assembler complexity are:
* **element-width overrides**, which dynamically redefine each SFFS or SFS
Scalar prefixed instruction to be 8-bit, 16-bit, 32-bit or 64-bit
CTR by the number of bits set in a GPR, if that GPR is given as the predicate
mask `sv.bc/pm=r3`.
+# LD/ST RM Modes
+
+Traditional Vector ISAs have vastly more (and more complex) addressing
+modes: unit strided, element strided, Indexed, Structure Packing. All
+of these had to be jammed in on top of **existing Scalar instructions
+without modifying the Scalar instructions**. A small conceptual
+"cheat" was therefore needed. The Immediate (D) is in some Modes
+multiplied by the element index, which gives us element-strided.
+For unit-strided the width of the operation (`ld`, 8 byte) is taken
+as the multiplier. Hardware-level modifications to support this
+"cheat" on top of pre-existing Scalar HDL (and Simulators)
+have both turned out to be minimal.
+
+Also added was the option to perform signed or unsigned Effective
+Address calculation, which comes into play only on LD/ST Indexed,
+when elwidth overrides are used. Another quirk: `RA` is never
+allowed to have its width altered: it remains 64-bit, as it is
+the Base Address.
+
+One confusing thing is the unfortunate naming of LD/ST Indexed and
+REMAP Indexed: some care is taken in the spec to discern the two.
+LD/ST Indexed is Scalar `EA=RA+RB` (where **either** RA or RB
+may be marked as Vectorised), where obviously the order in which
+that Vector of RA (or RB) is read in the usual linear sequential
+fashion. REMAP Indexed affects the
+**order** in which the Vector of RA (or RB) is accessed,
+according to a schedule determined by *another* vector of offsets
+in the register file. Effectively this combines VSX `vperm`
+back-to-back with LD/ST operations *in the calculation of each
+Effective Address* in one instruction.
+
+For DCT and FFT, normally it is very expensive to perform the
+"bit-inversion" needed for address calculation and/or reordering
+of elements. DCT in particular needs both bit-inversion *and
+Gray-Coding* offsets. DCT/FFT REMAP **automatically** performs
+the required offset adjustment to get data loaded and stored in
+the required order. Matrix REMAP can likewise perform up to 3
+Dimensions of reordering (on both Immediate and Indexed), and
+when combined with vec2/3/4 the reordering can even go as far as
+four dimensions (four nested fixed size loops).
+
+Overall the LD/ST Modes available are extremely powerful, especially
+when combining arithmetic (lharx) with saturation, element-width overrides,
+vec2/3/4 Structure Packing *and* REMAP, the combinations far exceed anything
+seen in any other Vector ISA in history.
+
# SVP64Single 24-bits
The `SVP64-Single` 24-bit encoding focusses primarily on ensuring that