From b6afa537cef112bd3e12bad665f79edc5805196b Mon Sep 17 00:00:00 2001
From: lkcl <lkcl@web>
Date: Sun, 18 Sep 2022 10:25:48 +0100
Subject: [PATCH]

---
 openpower/sv/rfc/ls001.mdwn | 52 ++++++++++++++++++++++++++++++++++---
 1 file changed, 48 insertions(+), 4 deletions(-)

diff --git a/openpower/sv/rfc/ls001.mdwn b/openpower/sv/rfc/ls001.mdwn
index 1ef398dd0..6f02468e1 100644
--- a/openpower/sv/rfc/ls001.mdwn
+++ b/openpower/sv/rfc/ls001.mdwn
@@ -221,15 +221,13 @@ the same space):
 \newpage{}
 # SVP64 24-bit Prefixes
 
-The SVP64 24-bit Prefix (RM) provides several options,
-all fitting within the 24-bit space (and no other).
+The SVP64 24-bit Prefix (RM) options aim to reduce instruction count
+and assembler complexity.
 These Modes do not interact with SVSTATE per se.  SVSTATE
 primarily controls the looping (quantity, order), RM
 influences the *elements* (the Suffix).  There is however
 some close interaction when it comes to predication.
 REMAP is outlined separately.
-The primary options all of which are aimed at reducing instruction
-count and reducing assembler complexity are:
 
 * **element-width overrides**, which dynamically redefine each SFFS or SFS
   Scalar prefixed instruction to be 8-bit, 16-bit, 32-bit or 64-bit
@@ -342,6 +340,52 @@ set to the next instruction (CIA+8). For example it may be used to reduce
 CTR by the number of bits set in a GPR, if that GPR is given as the predicate
 mask `sv.bc/pm=r3`.
 
+# LD/ST RM Modes
+
+Traditional Vector ISAs have vastly more (and more complex) addressing
+modes: unit strided, element strided, Indexed, Structure Packing. All
+of these had to be jammed in on top of **existing Scalar instructions
+without modifying the Scalar instructions**.  A small conceptual
+"cheat" was therefore needed.  The Immediate (D) is in some Modes
+multiplied by the element index, which gives us element-strided.
+For unit-strided the width of the operation (`ld`, 8 byte) is taken
+as the multiplier.  Hardware-level modifications to support this
+"cheat" on top of pre-existing Scalar HDL (and Simulators)
+have both turned out to be minimal.
+
+Also added was the option to perform signed or unsigned Effective
+Address calculation, which comes into play only on LD/ST Indexed,
+when elwidth overrides are used.  Another quirk: `RA` is never
+allowed to have its width altered: it remains 64-bit, as it is
+the Base Address.
+
+One confusing thing is the unfortunate naming of LD/ST Indexed and
+REMAP Indexed: some care is taken in the spec to discern the two.
+LD/ST Indexed is Scalar `EA=RA+RB` (where **either** RA or RB
+may be marked as Vectorised), where obviously the order in which
+that Vector of RA (or RB) is read in the usual linear sequential
+fashion. REMAP Indexed affects the
+**order** in which the Vector of RA (or RB) is accessed,
+according to a schedule determined by *another* vector of offsets
+in the register file.  Effectively this combines VSX `vperm`
+back-to-back with LD/ST operations *in the calculation of each
+Effective Address* in one instruction.
+
+For DCT and FFT, normally it is very expensive to perform the
+"bit-inversion" needed for address calculation and/or reordering
+of elements.  DCT in particular needs both bit-inversion *and
+Gray-Coding* offsets.  DCT/FFT REMAP **automatically** performs
+the required offset adjustment to get data loaded and stored in
+the required order.  Matrix REMAP can likewise perform up to 3
+Dimensions of reordering (on both Immediate and Indexed), and
+when combined with vec2/3/4 the reordering can even go as far as
+four dimensions (four nested fixed size loops).
+
+Overall the LD/ST Modes available are extremely powerful, especially
+when combining arithmetic (lharx) with saturation, element-width overrides,
+vec2/3/4 Structure Packing *and* REMAP, the combinations far exceed anything
+seen in any other Vector ISA in history.
+
 # SVP64Single 24-bits
 
 The `SVP64-Single` 24-bit encoding focusses primarily on ensuring that
-- 
2.30.2