[[!tag standards]]

# SV Load and Store

Links:

* <https://bugs.libre-soc.org/show_bug.cgi?id=561>
* <https://bugs.libre-soc.org/show_bug.cgi?id=572>
* <https://bugs.libre-soc.org/show_bug.cgi?id=571>
* <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
* <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>

# Rationale

All Vector ISAs dating back fifty years have extensive and comprehensive
Load and Store operations that go far beyond the capabilities of Scalar
RISC or CISC processors, yet at their heart on an individual element
basis may be found to be no different from RISC Scalar equivalents.

The resource savings from Vector LD/ST are significant and stem from
the fact that one single instruction can trigger a dozen (or in some
microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.

Additionally, and simply: if the Arithmetic side of an ISA supports
Vector Operations, then in order to keep the ALUs 100% occupied the
Memory infrastructure (and the ISA itself) correspondingly needs Vector
Memory Operations as well.

Vectorised Load and Store also presents an extra dimension (literally)
which creates scenarios unique to Vector applications, that a Scalar
(and even a SIMD) ISA simply never encounters.  SVP64 endeavours to
add such modes without changing the behaviour of the underlying Base
(Scalar) v3.0B operations.

# Modes overview

Vectorisation of Load and Store requires creation, from scalar operations,
a number of different modes:

* fixed aka "unit" stride (contiguous sequence with no gaps)
* element strided (sequential but regularly offset, with gaps)
* vector indexed (vector of base addresses and vector of offsets)
* Speculative fail-first (where it makes sense to do so)
* Structure Packing (covered in SV by [[sv/remap]]).

Also included in SVP64 LD/ST is both signed and unsigned Saturation,
as well as Element-width overrides and Twin-Predication.

*Despite being constructed from Scalar LD/ST none of these Modes
exist or make sense in any Scalar ISA. They **only** exist in Vector ISAs*

# Determining the LD/ST Modes

A minor complication (caused by the retro-fitting of modern Vector
features to a Scalar ISA) is that certain features do not exactly make
sense or are considered a security risk.  Fail-first on Vector Indexed
would allow attackers to probe large numbers of pages from userspace, where
strided fail-first (by creating contiguous sequential LDs) does not.

In addition, reduce mode makes no sense, and for LD/ST with immediates
 Vector source RA makes no sense either (or, is a quirk). Realistically we need
an alternative table meaning for [[sv/svp64]] mode.  The following modes make sense:

* saturation
* predicate-result (mostly for cache-inhibited LD/ST)
* normal
* fail-first (where Vector Indexed is banned)
* Signed Effective Address computation (Vector Indexed only)

More than that however it is necessary to fit the usual Vector ISA
capabilities onto both Power ISA LD/ST with immediate and to
LD/ST Indexed. They present subtly different Mode tables.

**LD/ST immediate**

The table for [[sv/svp64]] for `immed(RA)` is:

| 0-1 |  2  |  3   4  |  description               |
| --- | --- |---------|--------------------------- |
| 00  | 0   |  zz els | normal mode                |
| 00  | 1   |  rsvd   | reserved                   |
| 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
| 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
| 10  |   N | zz  els |  sat mode: N=0/1 u/s       |
| 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
| 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |

The `els` bit is only relevant when `RA.isvec` is clear: this indicates
whether stride is unit or element:

    if RA.isvec:
        svctx.ldstmode = indexed
    elif els == 0:
        svctx.ldstmode = unitstride
    elif immediate != 0:
        svctx.ldstmode = elementstride

An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
in effect the multiplication of the immediate-offset by zero results
in reading from the exact same memory location.

For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
just the once and be copied, rather than hitting the Data Cache
multiple times with the same memory read at the same location.
This would allow for memory-mapped peripherals to have multiple
data values read in quick succession and stored in sequentially
numbered registers.

For non-cache-inhibited ST from a vector source onto a scalar
destination: with the Vector
loop effectively creating multiple memory writes to the same location,
we can deduce that the last of these will be the "successful" one. Thus,
implementations are free and clear to optimise out the overwriting STs,
leaving just the last one as the "winner".  Bear in mind that predicate
masks will skip some elements (in source non-zeroing mode).
Cache-inhibited ST operations on the other hand **MUST** write out
a Vector source multiple successive times to the exact same Scalar
destination.

Note that there are no immediate versions of cache-inhibited LD/ST.

**LD/ST Indexed**

The modes for `RA+RB` indexed version are slightly different:

| 0-1 |  2  |  3   4  |  description              |
| --- | --- |---------|-------------------------- |
| 00  | SEA |  dz  sz | normal mode        |
| 01  | SEA | dz sz   | Strided (scalar only source)   |
| 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
| 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
| 11  | inv | zz  RC1 |  Rc=0: pred-result z/nonz |

Vector Indexed Strided Mode is qualified as follows:

    if mode = 0b01 and !RA.isvec and !RB.isvec:
        svctx.ldstmode = elementstride

A summary of the effect of Vectorisation of src or dest:
 
     imm(RA)  RT.v   RA.v   no stride allowed
     imm(RA)  RT.s   RA.v   no stride allowed
     imm(RA)  RT.v   RA.s   stride-select allowed
     imm(RA)  RT.s   RA.s   not vectorised
     RA,RB    RT.v  {RA|RB}.v UNDEFINED
     RA,RB    RT.s  {RA|RB}.v UNDEFINED
     RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
     RA,RB    RT.s  {RA&RB}.s not vectorised

Signed Effective Address computation is only relevant for
Vector Indexed Mode, when elwidth overrides are applied.
The source override applies to RB, and before adding to
RA in order to calculate the Effective Address, if SEA is
set RB is sign-extended from elwidth bits to the full 64
bits.  For other Modes (ffirst, saturate),
all EA computation with elwidth overrides is unsigned.

Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.

# Vectorisation of Scalar Power ISA v3.0B

OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
[[isa/fixedstore]] pseudocode to be of the form:

    lbux RT, RA, RB
    EA <- (RA) + (RB)
    RT <- MEM(EA)

and for immediate variants:

    lb RT,D(RA)
    EA <- RA + EXTS(D)
    RT <- MEM(EA)

Thus in the first example, the source registers may each be independently
marked as scalar or vector, and likewise the destination; in the second
example only the one source and one dest may be marked as scalar or
vector.

Thus we can see that Vector Indexed may be covered, and, as demonstrated
with the pseudocode below, the immediate can be used to give unit stride or element stride.  With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.

    # LD not VLD!  format - ldop RT, immed(RA)
    # op_width: lb=1, lh=2, lw=4, ld=8
    op_load(RT, RA, op_width, immed, svctx, RAupdate):
      ps = get_pred_val(FALSE, RA); # predication on src
      pd = get_pred_val(FALSE, RT); # ... AND on dest
      for (i=0, j=0, u=0; i < VL && j < VL;):
        # skip nonpredicates elements
        if (RA.isvec) while (!(ps & 1<<i)) i++;
        if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
        if (RT.isvec) while (!(pd & 1<<j)) j++;
        if svctx.ldstmode == elementstride:
          # element stride mode
          srcbase = ireg[RA]
          offs = i * immed              # j*immed for a ST
        elif svctx.ldstmode == unitstride:
          # unit stride mode
          srcbase = ireg[RA]
          offs = immed + (i * op_width) # j*op_width for ST
        elif RA.isvec:
          # quirky Vector indexed mode but with an immediate
          srcbase = ireg[RA+i]
          offs = immed;
        else
          # standard scalar mode (but predicated)
          # no stride multiplier means VSPLAT mode
          srcbase = ireg[RA]
          offs = immed

        # compute EA
        EA = srcbase + offs
        # update RA?
        if RAupdate: ireg[RAupdate+u] = EA;
        # load from memory
        ireg[RT+j] <= MEM[EA];
        if (!RT.isvec)
            break # destination scalar, end now
        if (RA.isvec) i++;
        if (RAupdate.isvec) u++;
        if (RT.isvec) j++;

Indexed LD is:
 
    # format: ldop RT, RA, RB
    function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
      ps = get_pred_val(FALSE, RA); # predication on src
      pd = get_pred_val(FALSE, RT); # ... AND on dest
      for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
        # skip nonpredicated RA, RB and RT
        if (RA.isvec) while (!(ps & 1<<i)) i++;
        if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
        if (RB.isvec) while (!(ps & 1<<k)) k++;
        if (RT.isvec) while (!(pd & 1<<j)) j++;
        if svctx.ldstmode == elementstride:
            EA = ireg[RA] + ireg[RB]*j   # register-strided
        else
            EA = ireg[RA+i] + ireg[RB+k] # indexed address
        if RAupdate: ireg[RAupdate+u] = EA
        ireg[RT+j] <= MEM[EA];
        if (!RT.isvec)
            break # destination scalar, end immediately
        if svctx.ldstmode != elementstride:
            if (!RA.isvec && !RB.isvec)
                break # scalar-scalar
        if (RA.isvec) i++;
        if (RAupdate.isvec) u++;
        if (RB.isvec) k++;
        if (RT.isvec) j++;

Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.

# LD/ST ffirst

LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
ordinary one.  Exceptions, if any are needed occur "as normal" exactly as
they would on any Scalar v3.0 Power ISA LD/ST. However for elements 1
and above, if an exception would occur, then VL is **truncated** to the
previous element: the exception is **not** then raised because the
LD/ST that would otherwise have caused an exception is *required* to be cancelled.

ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail.  Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST.  See <https://bugs.libre-soc.org/show_bug.cgi?id=561>

    for(i = 0; i < VL; i++)
        reg[rt + i] = mem[reg[ra] + i * reg[rb]];

High security implementations where any kind of speculative probing
of memory pages is considered a risk should take advantage of the fact that
implementations may truncate VL at any point, without requiring software
to be rewritten and made non-portable. Such implementations may choose
to *always* set VL=1 which will have the effect of terminating any
speculative probing (and also adversely affect performance), but will
at least not require applications to be rewritten.

Low-performance simpler hardware implementations may also
choose (always) to also set VL=1 as the bare minimum compliant implementation of
LD/ST Fail-First. It is however critically important to remember that
the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
**MUST** raise exceptions exactly like an ordinary LD/ST.

For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins the following ffirst LD/ST operations on an aligned boundary
such as the beginning of a cache line, or beginning of a Virtual Memory
page. Likewise, to reduce workloads or balance resources.

Vertical-First Mode is slightly strange in that only one element
at a time is ever executed anyway.  Given that programmers may
legitimately choose to alter srcstep and dststep in non-sequential
order as part of explicit loops, it is neither possible nor
safe to make speculative assumptions about future LD/STs.
Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
This is very different from Arithmetic (Data-dependent) FFirst
where Vertical-First Mode is fully deterministic, not speculative.

# LOAD/STORE Elwidths <a name="elwidth"></a>

Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
others like it provide an explicit operation width.  There are therefore
*three* widths involved:

* operation width (lb=8, lh=16, lw=32, ld=64)
* src elelent width override
* destination element width override

Some care is therefore needed to express and make clear the transformations, 
which are expressly in this order:

* Load at the operation width (lb/lh/lw/ld) as usual
* byte-reversal as usual
* Non-saturated mode:
   - zero-extension or truncation from operation width to source elwidth
   - zero/truncation to dest elwidth
* Saturated mode:
   - Sign-extension or truncation from operation width to source width
   - signed/unsigned saturation down to dest elwidth

In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
is treated effectively as completely separate and distinct from SV
augmentation.  This is primarily down to quirks surrounding LE/BE and
byte-reversal in OpenPOWER.

It is unfortunately possible to request an elwidth override on the memory side which
does not mesh with the operation width: these result in `UNDEFINED`
behaviour.  The reason is that the effect of attempting a 64-bit `sv.ld`
operation with a source elwidth override of 8/16/32 would result in
overlapping memory requests, particularly on unit and element strided
operations.  Thus it is `UNDEFINED` when the elwidth is smaller than
the memory operation width. Examples include `sv.lw/sw=16/els` which
requests (overlapping) 4-byte memory reads offset from
each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
where the dest elwidth override is less than the operation width.

Note the following regarding the pseudocode to follow:

* `scalar identity behaviour` SV Context parameter conditions turn this
  into a straight absolute fully-compliant Scalar v3.0B LD operation
* `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
  rather than `ld`)
* `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
  a "normal" part of Scalar v3.0B LD
* `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
  as a "normal" part of Scalar v3.0B LD
* `svctx` specifies the SV Context and includes VL as well as
  source and destination elwidth overrides.

Below is the pseudocode for Unit-Strided LD (which includes Vector capability).

Note that twin predication, predication-zeroing, saturation
and other modes have all been removed, for clarity and simplicity:

    # LD not VLD! (ldbrx if brev=True)
    # this covers unit stride mode and a type of vector offset
    function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
      for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):

        if not svctx.unit/el-strided:
            # strange vector mode, compute 64 bit address which is
            # not polymorphic! elwidth hardcoded to 64 here
            srcbase = get_polymorphed_reg(RA, 64, i)
        else:
            # unit / element stride mode, compute 64 bit address
            srcbase = get_polymorphed_reg(RA, 64, 0)
            # adjust for unit/el-stride
            srcbase += ....

        # takes care of (merges) processor LE/BE and ld/ldbrx
        bytereverse = brev XNOR MSR.LE

        # read the underlying memory
        memread <= mem[srcbase + imm_offs];

        # optionally performs byteswap at op width
        if (bytereverse):
            memread = byteswap(memread, op_width)

        # check saturation.
        if svpctx.saturation_mode:
            ... saturation adjustment...
        else:
            # truncate/extend to over-ridden source width.
            memread = adjust_wid(memread, op_width, svctx.src_elwidth)

        # takes care of inserting memory-read (now correctly byteswapped)
        # into regfile underlying LE-defined order, into the right place
        # within the NEON-like register, respecting destination element
        # bitwidth, and the element index (j)
        set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)

        # increments both src and dest element indices (no predication here)
        i++;
        j++;

# Remapped LD/ST

In the [[sv/remap]] page the concept of "Remapping" is described.
Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
elements worth of LDs or STs.  The usual interest in such re-mapping
is for example in separating out 24-bit RGB channel data into separate
contiguous registers.  NEON covers this as shown in the diagram below:

<img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >

Remap easily covers this capability, and with dest
elwidth overrides and saturation may do so with built-in conversion that
would normally require additional width-extension, sign-extension and
min/max Vectorised instructions as post-processing stages.

Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
because the generic abstracted concept of "Remapping", when applied to
LD/ST, will give that same capability, with far more flexibility.

Also LD/ST with immediate has a Pack/Unpack option similar to VSX
'vpack' and 'vunpack'.

# notes from lxo

this section covers assembly notation for the immediate and indexed LD/ST.
the summary is that in immediate mode for LD it is not clear that if the 
destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
the memory being read is *still a vector load*, known as "unit or element strides".

This anomaly is made clear with the following notation:

    sv.ld RT.v, imm(RA).v

The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
 
    sv.ld RT.v, imm(RA)

Notes taken from IRC conversation

    <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
    <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
    <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
    <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
    <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
    <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
    <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers

permutations of vector selection, to identify above asm-syntax:

     imm(RA)  RT.v   RA.v   nonstrided
         sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
           mem@     0+r#2   offs+(r#2+1)  offs+(r#2+2)
           destreg  r#      r#+1          r#+2
     imm(RA)  RT.s   RA.v   nonstrided
         sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
           (dest r# is scalar) -> VSELECT mode
     imm(RA)  RT.v   RA.s   fixed stride: unit or element
         sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
           mem@r#2  +0   +1   +2
           destreg  r#   r#+1 r#+2
         sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
           mem@r#2  +0 ...   +offs ...  +offs*2
           destreg  r#       r#+1       r#+2
     imm(RA)  RT.s   RA.s   not vectorised
         sv.ld r#, ofst(r#2)

indexed mode:

     RA,RB    RT.v  RA.v  RB.v
        sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
     RA,RB    RT.v  RA.s  RB.v
        sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
     RA,RB    RT.v  RA.v  RB.s
        sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
     RA,RB    RT.v  RA.s  RB.s
        sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
     RA,RB    RT.s  RA.v  RB.v
     RA,RB    RT.s  RA.s  RB.v
     RA,RB    RT.s  RA.v  RB.s
     RA,RB    RT.s  RA.s  RB.s not vectorised