[[!tag standards]]

# SV Load and Store

Links:

* <https://bugs.libre-soc.org/show_bug.cgi?id=561>
* <https://bugs.libre-soc.org/show_bug.cgi?id=572>
* <https://bugs.libre-soc.org/show_bug.cgi?id=571>
* <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
* <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>

Vectorisation of Load and Store requires creation, from scalar operations,
a number of different types:

* fixed stride (contiguous sequence with no gaps)
* element strided (sequential but regularly offset, with gaps)
* vector indexed (vector of base addresses and vector of offsets)
* fail-first on the same (where it makes sense to do so)

OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
[[isa/fixedstore]] pseudocode to be of the form:

    lbux RT, RA, RB
    EA <- (RA) + (RB)
    RT <- MEM(EA)

and for immediate variants:

    lb RT,D(RA)
    EA <- RA + EXTS(D)
    RT <- MEM(EA)

Thus in the first example, the source registers may each be independently
marked as scalar or vector, and likewise the destination; in the second
example only the one source and one dest may be marked as scalar or
vector.

Thus we can see that Vector Indexed may be covered, and, as demonstrated
with the pseudocode below, the immediate can be set to the element width
in order to give unit stride.

At the minimum however it is possible to provide unit stride and vector
mode, as follows:

    # LD not VLD!
    # op_width: lb=1, lh=2, lw=4, ld=8
    op_load(RT, RA, op_width, immed, svctx, update):
      ps = get_pred_val(FALSE, RA); # predication on src
      pd = get_pred_val(FALSE, RT); # ... AND on dest
      for (int i = 0, int j = 0; i < VL && j < VL;):
        # skip nonpredicates elements
        if (RA.isvec) while (!(ps & 1<<i)) i++;
        if (RT.isvec) while (!(pd & 1<<j)) j++;
        if svctx.ldstmode == elementstride:
          # element stride mode
          srcbase = ireg[RA]
          offs = i * immed
        elif svctx.ldstmode == unitstride:
          # unit stride mode
          srcbase = ireg[RA]
          offs = i * op_width
        elif RA.isvec:
          # indirect mode (multi mode)
          srcbase = ireg[RA+i]
          offs = immed;
        else
          # standard scalar mode (but predicated)
          # no stride multiplier means VSPLAT mode
          srcbase = ireg[RA]
          offs = immed

        # compute EA
        EA = srcbase + offs
        # update RA? load from memory
        if update: ireg[rsv+i] = EA;
        ireg[RT+j] <= MEM[EA];
        if (!RT.isvec)
            break # destination scalar, end now
        if (RA.isvec) i++;
        if (RT.isvec) j++;

Indexed LD is:
 
    function op_ldx(RT, RA, RB, update=False) # LD not VLD!
      rdv = map_dest_extra(RT);
      rsv = map_src_extra(RA);
      rso = map_src_extra(RB);
      ps = get_pred_val(FALSE, RA); # predication on src
      pd = get_pred_val(FALSE, RT); # ... AND on dest
      for (i=0, j=0, k=0; i < VL && j < VL && k < VL):
        # skip nonpredicated RA, RB and RT
        if (RA.isvec) while (!(ps & 1<<i)) i++;
        if (RB.isvec) while (!(ps & 1<<k)) k++;
        if (RT.isvec) while (!(pd & 1<<j)) j++;
        EA = ireg[rsv+i] + ireg[rso+k] # indexed address
        if update: ireg[rsv+i] = EA
        ireg[rdv+j] <= MEM[EA];
        if (!RT.isvec)
            break # destination scalar, end immediately
        if (!RA.isvec && !RB.isvec)
            break # scalar-scalar
        if (RA.isvec) i++;
        if (RB.isvec) k++;
        if (RT.isvec) j++;

# Determining the LD/ST Modes

A minor complication (caused by the retro-fitting of modern Vector
features to a Scalar ISA) is that certain features do not exactly make
sense or are considered a security risk.  Fail-first on Vector Indexed
allows attackers to probe large numbers of pages from userspace, where
strided fail-first (by creating contiguous sequential LDs) does not.

In addition, even in other modes, Vector source RA makes no sense for
computing offsets, and reduce mode even less.  Realistically we need
an alternative table meaning for [[sv/svp64]] mode.

TODO

| 0-1 |  2  |  3   4  |  description              |
| --- | --- |---------|-------------------------- |
| 00  |   0 |  sz  dz | normal mode                      |
| 00  |   1 | sz CRM  | reduce mode (mapreduce), SUBVL=1 |
| 00  |   1 | SVM CRM | subvector reduce mode, SUBVL>1   |
| 01  | inv | CR-bit  | Rc=1: ffirst CR sel              |
| 01  | inv | sz  RC1 |  Rc=0: ffirst z/nonz |
| 10  |   N | sz   dz |  sat mode: N=0/1 u/s |
| 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
| 11  | inv | sz  RC1 |  Rc=0: pred-result z/nonz |

# LOAD/STORE Elwidths <a name="ldst"></a>

Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
others like it provide an explicit operation width.  In order to fit the
different types of LD/ST Modes into SV the src elwidth field is used to
select that Mode, and the actual src elwidth is implicitly the same as
the operation width.  We then still apply Twin Predication but using:

* operation width (lb=8, lh=16, lw=32, ld=64) as src elwidth
* destination element width override

Saturation (and other transformations) occur on the value loaded from
memory as if it was an "infinite bitwidth", sign-extended (if Saturation
requests signed) from the source width (lb, lh, lw, ld) followed then
by the actual Saturation to the destination width.

In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
is treated effectively as completely separate and distinct from SV
augmentation.  This is primarily down to quirks surrounding LE/BE and
byte-reversal in OpenPOWER.

Note the following regarding the pseudocode to follow:

* `scalar identity behaviour` SV Context parameter conditions turn this
  into a straight absolute fully-compliant Scalar v3.0B LD operation
* `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
  rather than `ld`)
* `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
  a "normal" part of Scalar v3.0B LD
* `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
  as a "normal" part of Scalar v3.0B LD
* `svctx` specifies the SV Context and includes VL as well as
  destination elwidth overrides.

Below is the pseudocode for Unit-Strided LD (which includes Vector capability).

Note that twin predication, predication-zeroing, saturation
and other modes have all been removed, for clarity and simplicity:

    # LD not VLD! (ldbrx if brev=True)
    # this covers unit stride mode and a type of vector offset
    function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
      for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):

        if RA.isvec:
            # strange vector mode, compute 64 bit address which is
            # not polymorphic! elwidth hardcoded to 64 here
            srcbase = get_polymorphed_reg(RA, 64, i)
        else:
            # unit stride mode, compute the address
            srcbase = ireg[RA] + i * op_width;

        # takes care of (merges) processor LE/BE and ld/ldbrx
        bytereverse = brev XNOR MSR.LE

        # read the underlying memory
        memread <= mem[srcbase + imm_offs];

        # optionally performs byteswap at op width
        if (bytereverse):
            memread = byteswap(memread, op_width)

        # now truncate/extend to over-ridden width.
        if not svpctx.saturation_mode:
            memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
        else:
            ... saturation adjustment...

        # takes care of inserting memory-read (now correctly byteswapped)
        # into regfile underlying LE-defined order, into the right place
        # within the NEON-like register, respecting destination element
        # bitwidth, and the element index (j)
        set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)

        # increments both src and dest element indices (no predication here)
        i++;
        j++;

When RA is marked as Vectorised the mode switches to an anomalous
version similar to Indexed.  The element indices increment to select a
64 bit base address, effectively as if the src elwidth was hard-set to
"default".  The important thing to note is that `i*op_width` is *not*
added on to the base address unless RA is marked as a scalar address.

# Remapped LD/ST

In the [[sv/propagation]] page the concept of "Remapping" is described.
Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
elements worth of LDs or STs.  The usual interest in such re-mapping
is for example in separating out 24-bit RGB channel data into separate
contiguous registers.  NEON covers this as shown in the diagram below:

<img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >

Remap easily covers this capability, and with dest
elwidth overrides and saturation may do so with built-in conversion that
would normally require additional width-extension, sign-extension and
min/max Vectorised instructions as post-processing stages.

Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
because the generic abstracted concept of "Remapping", when applied to
LD/ST, will give that same capability, with far more flexibility.