* <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
* [[simple_v_extension/specification/ld.x]]
+# Rationale
+
+All Vector ISAs dating back fifty years have extensive and comprehensive
+Load and Store operations that go far beyond the capabilities of Scalar
+RISC or CISC processors, yet at their heart on an individual element
+basis may be found to be no different from RISC Scalar equivalents.
+
+The resource savings from Vector LD/ST are significant and stem from
+the fact that one single instruction can trigger a dozen (or in some
+microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
+
+Additionally, and simply: if the Arithmetic side of an ISA supports
+Vector Operations, then in order to keep the ALUs 100% occupied the
+Memory infrastructure (and the ISA itself) correspondingly needs Vector
+Memory Operations as well.
+
+Vectorised Load and Store also presents an extra dimension (literally)
+which creates scenarios unique to Vector applications, that a Scalar
+(and even a SIMD) ISA simply never encounters. SVP64 endeavours to
+add such modes without changing the behaviour of the underlying Base
+(Scalar) v3.0B operations.
+
+# Modes overview
+
Vectorisation of Load and Store requires creation, from scalar operations,
a number of different modes:
* fixed stride (contiguous sequence with no gaps) aka "unit" stride
* element strided (sequential but regularly offset, with gaps)
* vector indexed (vector of base addresses and vector of offsets)
-* fail-first on the same (where it makes sense to do so)
+* Speculative fail-first (where it makes sense to do so)
* Structure Packing (covered in SV by [[sv/remap]]).
+Also included in SVP64 LD/ST is both signed and unsigned Saturation,
+as well as Element-width overrides and Twin-Predication.
+
+# Vectorisation of Scalar Power ISA v3.0B
+
OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
[[isa/fixedstore]] pseudocode to be of the form:
ireg[RT+j] <= MEM[EA];
if (!RT.isvec)
break # destination scalar, end immediately
- if (!RA.isvec && !RB.isvec)
- break # scalar-scalar
+ if svctx.ldstmode != elementstride:
+ if (!RA.isvec && !RB.isvec)
+ break # scalar-scalar
if (RA.isvec) i++;
if (RAupdate.isvec) u++;
if (RB.isvec) k++;
| 0-1 | 2 | 3 4 | description |
| --- | --- |---------|--------------------------- |
-| 00 | 0 | dz els | normal mode |
-| 00 | 1 | dz shf | shift mode |
+| 00 | 0 | zz els | normal mode |
+| 00 | 1 | zz shf | shift mode |
| 01 | inv | CR-bit | Rc=1: ffirst CR sel |
| 01 | inv | els RC1 | Rc=0: ffirst z/nonz |
-| 10 | N | dz els | sat mode: N=0/1 u/s |
+| 10 | N | zz els | sat mode: N=0/1 u/s |
| 11 | inv | CR-bit | Rc=1: pred-result CR sel |
| 11 | inv | els RC1 | Rc=0: pred-result z/nonz |
| 01 | SEA | dz sz | Strided (scalar only source) |
| 10 | N | dz sz | sat mode: N=0/1 u/s |
| 11 | inv | CR-bit | Rc=1: pred-result CR sel |
-| 11 | inv | dz RC1 | Rc=0: pred-result z/nonz |
+| 11 | inv | zz RC1 | Rc=0: pred-result z/nonz |
Vector Indexed Strided Mode is qualified as follows:
RA in order to calculate the Effective Address, if SEA is
set RB is sign-extended from elwidth bits to the full 64
bits. For other Modes (ffirst, saturate),
-all EA computation is unsigned.
+all EA computation with elwidth overrides is unsigned.
Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially. `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers. This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
## LD/ST ffirst
+LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
+ordinary one. Exceptions occur "as normal". However for elements 1
+and above, if an exception would occur, then VL is **truncated** to the
+previous element: the exception is **not** then raised because the
+LD/ST was effectively speculative.
+
ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail. Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST. See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
for(i = 0; i < VL; i++)
at least not require applications to be rewritten.
Low-performance simpler hardware implementations may
-choose to also set VL=1 as the bare minimum compliant implementation of
+choose (always) to also set VL=1 as the bare minimum compliant implementation of
LD/ST Fail-First. It is however critically important to remember that
the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
**MUST** raise exceptions exactly like an ordinary LD/ST.
+For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins subsequent ffirst LD/ST operations on an aligned boundary
+such as the beginning of a cache line, or beginning of a Virtual Memory
+page. Likewise, to reduce workloads or balance resources.
+
+Vertical-First Mode is slightly strange in that only one element
+at a time is ever executed anyway. Given that programmers may
+legitimately choose to alter srcstep and dststep in non-sequential
+order as part of explicit loops, it is neither possible nor
+safe to make speculative assumptions about future LD/STs.
+Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
+This is very different from Arithmetic (Data-dependent) FFirst
+where Vertical-First Mode is deterministic, not speculative.
+
# LOAD/STORE Elwidths <a name="elwidth"></a>
Loads and Stores are almost unique in that the OpenPOWER Scalar ISA