From b6afa537cef112bd3e12bad665f79edc5805196b Mon Sep 17 00:00:00 2001 From: lkcl Date: Sun, 18 Sep 2022 10:25:48 +0100 Subject: [PATCH] --- openpower/sv/rfc/ls001.mdwn | 52 ++++++++++++++++++++++++++++++++++--- 1 file changed, 48 insertions(+), 4 deletions(-) diff --git a/openpower/sv/rfc/ls001.mdwn b/openpower/sv/rfc/ls001.mdwn index 1ef398dd0..6f02468e1 100644 --- a/openpower/sv/rfc/ls001.mdwn +++ b/openpower/sv/rfc/ls001.mdwn @@ -221,15 +221,13 @@ the same space): \newpage{} # SVP64 24-bit Prefixes -The SVP64 24-bit Prefix (RM) provides several options, -all fitting within the 24-bit space (and no other). +The SVP64 24-bit Prefix (RM) options aim to reduce instruction count +and assembler complexity. These Modes do not interact with SVSTATE per se. SVSTATE primarily controls the looping (quantity, order), RM influences the *elements* (the Suffix). There is however some close interaction when it comes to predication. REMAP is outlined separately. -The primary options all of which are aimed at reducing instruction -count and reducing assembler complexity are: * **element-width overrides**, which dynamically redefine each SFFS or SFS Scalar prefixed instruction to be 8-bit, 16-bit, 32-bit or 64-bit @@ -342,6 +340,52 @@ set to the next instruction (CIA+8). For example it may be used to reduce CTR by the number of bits set in a GPR, if that GPR is given as the predicate mask `sv.bc/pm=r3`. +# LD/ST RM Modes + +Traditional Vector ISAs have vastly more (and more complex) addressing +modes: unit strided, element strided, Indexed, Structure Packing. All +of these had to be jammed in on top of **existing Scalar instructions +without modifying the Scalar instructions**. A small conceptual +"cheat" was therefore needed. The Immediate (D) is in some Modes +multiplied by the element index, which gives us element-strided. +For unit-strided the width of the operation (`ld`, 8 byte) is taken +as the multiplier. Hardware-level modifications to support this +"cheat" on top of pre-existing Scalar HDL (and Simulators) +have both turned out to be minimal. + +Also added was the option to perform signed or unsigned Effective +Address calculation, which comes into play only on LD/ST Indexed, +when elwidth overrides are used. Another quirk: `RA` is never +allowed to have its width altered: it remains 64-bit, as it is +the Base Address. + +One confusing thing is the unfortunate naming of LD/ST Indexed and +REMAP Indexed: some care is taken in the spec to discern the two. +LD/ST Indexed is Scalar `EA=RA+RB` (where **either** RA or RB +may be marked as Vectorised), where obviously the order in which +that Vector of RA (or RB) is read in the usual linear sequential +fashion. REMAP Indexed affects the +**order** in which the Vector of RA (or RB) is accessed, +according to a schedule determined by *another* vector of offsets +in the register file. Effectively this combines VSX `vperm` +back-to-back with LD/ST operations *in the calculation of each +Effective Address* in one instruction. + +For DCT and FFT, normally it is very expensive to perform the +"bit-inversion" needed for address calculation and/or reordering +of elements. DCT in particular needs both bit-inversion *and +Gray-Coding* offsets. DCT/FFT REMAP **automatically** performs +the required offset adjustment to get data loaded and stored in +the required order. Matrix REMAP can likewise perform up to 3 +Dimensions of reordering (on both Immediate and Indexed), and +when combined with vec2/3/4 the reordering can even go as far as +four dimensions (four nested fixed size loops). + +Overall the LD/ST Modes available are extremely powerful, especially +when combining arithmetic (lharx) with saturation, element-width overrides, +vec2/3/4 Structure Packing *and* REMAP, the combinations far exceed anything +seen in any other Vector ISA in history. + # SVP64Single 24-bits The `SVP64-Single` 24-bit encoding focusses primarily on ensuring that -- 2.30.2