openpower/sv/ldst.mdwn

   1 [[!tag standards]]
   2
   3 # SV Load and Store
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12
  13 # Rationale
  14
  15 All Vector ISAs dating back fifty years have extensive and comprehensive
  16 Load and Store operations that go far beyond the capabilities of Scalar
  17 RISC or CISC processors, yet at their heart on an individual element
  18 basis may be found to be no different from RISC Scalar equivalents.
  19
  20 The resource savings from Vector LD/ST are significant and stem from
  21 the fact that one single instruction can trigger a dozen (or in some
  22 microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
  23
  24 Additionally, and simply: if the Arithmetic side of an ISA supports
  25 Vector Operations, then in order to keep the ALUs 100% occupied the
  26 Memory infrastructure (and the ISA itself) correspondingly needs Vector
  27 Memory Operations as well.
  28
  29 Vectorised Load and Store also presents an extra dimension (literally)
  30 which creates scenarios unique to Vector applications, that a Scalar
  31 (and even a SIMD) ISA simply never encounters.  SVP64 endeavours to
  32 add such modes without changing the behaviour of the underlying Base
  33 (Scalar) v3.0B operations.
  34
  35 # Modes overview
  36
  37 Vectorisation of Load and Store requires creation, from scalar operations,
  38 a number of different modes:
  39
  40 * **fixed aka "unit" stride** - contiguous sequence with no gaps
  41 * **element strided** - sequential but regularly offset, with gaps
  42 * **vector indexed** - vector of base addresses and vector of offsets
  43 * **Speculative fail-first** - where it makes sense to do so
  44 * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
  45
  46 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
  47 as well as Element-width overrides and Twin-Predication.
  48
  49 *Despite being constructed from Scalar LD/ST none of these Modes
  50 exist or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
  51
  52 # Determining the LD/ST Modes
  53
  54 A minor complication (caused by the retro-fitting of modern Vector
  55 features to a Scalar ISA) is that certain features do not exactly make
  56 sense or are considered a security risk.  Fail-first on Vector Indexed
  57 would allow attackers to probe large numbers of pages from userspace, where
  58 strided fail-first (by creating contiguous sequential LDs) does not.
  59
  60 In addition, reduce mode makes no sense, and for LD/ST with immediates
  61  Vector source RA makes no sense either (or, is a quirk). Realistically we need
  62 an alternative table meaning for [[sv/svp64]] mode.  The following modes make sense:
  63
  64 * saturation
  65 * predicate-result (mostly for cache-inhibited LD/ST)
  66 * normal
  67 * fail-first (where Vector Indexed is banned)
  68 * Signed Effective Address computation (Vector Indexed only)
  69
  70 More than that however it is necessary to fit the usual Vector ISA
  71 capabilities onto both Power ISA LD/ST with immediate and to
  72 LD/ST Indexed. They present subtly different Mode tables.
  73
  74 Fields used in tables below:
  75
  76 * **sz / dz**  if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero.  otherwise the element is ignored or skipped, depending on context.
  77 * **zz**: both sz and dz are set equal to this flag.
  78 * **inv CR bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
  79 * **N** sets signed/unsigned saturation.
  80 * **RC1** as if Rc=1, stores CRs *but not the result*
  81 * **SEA** - Signed Effective Address, if enabled performs sign-extension on
  82   registers that have been reduced due to elwidth overrides
  83
  84 **LD/ST immediate**
  85
  86 The table for [[sv/svp64]] for `immed(RA)` is:
  87
  88 | 0-1 |  2  |  3   4  |  description               |
  89 | --- | --- |---------|--------------------------- |
  90 | 00  | 0   |  zz els | normal mode                |
  91 | 00  | 1   |  rsvd   | reserved                   |
  92 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
  93 | 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
  94 | 10  |   N | zz  els |  sat mode: N=0/1 u/s       |
  95 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
  96 | 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |
  97
  98 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
  99 whether stride is unit or element:
 100
 101     if RA.isvec:
 102         svctx.ldstmode = indexed
 103     elif els == 0:
 104         svctx.ldstmode = unitstride
 105     elif immediate != 0:
 106         svctx.ldstmode = elementstride
 107
 108 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
 109 in effect the multiplication of the immediate-offset by zero results
 110 in reading from the exact same memory location, *even with a Vector
 111 register*. (Normally this type of behaviour is reserved for the
 112 mapreduce modes)
 113
 114 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
 115 just the once and be copied, rather than hitting the Data Cache
 116 multiple times with the same memory read at the same location.
 117 The benefit of Cache-inhibited LD-splats is that it allows
 118 for memory-mapped peripherals to have multiple
 119 data values read in quick succession and stored in sequentially
 120 numbered registers (but, see Note below).
 121
 122 For non-cache-inhibited ST from a vector source onto a scalar
 123 destination: with the Vector
 124 loop effectively creating multiple memory writes to the same location,
 125 we can deduce that the last of these will be the "successful" one. Thus,
 126 implementations are free and clear to optimise out the overwriting STs,
 127 leaving just the last one as the "winner".  Bear in mind that predicate
 128 masks will skip some elements (in source non-zeroing mode).
 129 Cache-inhibited ST operations on the other hand **MUST** write out
 130 a Vector source multiple successive times to the exact same Scalar
 131 destination. Just like Cache-inhibited LDs, multiple values may be
 132 written out in quick succession to a memory-mapped peripheral from
 133 sequentially-numbered registers.
 134
 135 Note that there are no immediate versions of cache-inhibited LD/ST
 136 (no *Scalar* cache-inhibited immediate instructions to Vectorise).
 137 A future version of the Power ISA *may* have such Scalar instructions.
 138
 139 **LD/ST Indexed**
 140
 141 The modes for `RA+RB` indexed version are slightly different:
 142
 143 | 0-1 |  2  |  3   4  |  description              |
 144 | --- | --- |---------|-------------------------- |
 145 | 00  | SEA |  dz  sz | normal mode        |
 146 | 01  | SEA | dz sz   | Strided (scalar only source)   |
 147 | 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
 148 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 149 | 11  | inv | zz  RC1 |  Rc=0: pred-result z/nonz |
 150
 151 Vector Indexed Strided Mode is qualified as follows:
 152
 153     if mode = 0b01 and !RA.isvec and !RB.isvec:
 154         svctx.ldstmode = elementstride
 155
 156 A summary of the effect of Vectorisation of src or dest:
 157
 158      imm(RA)  RT.v   RA.v   no stride allowed
 159      imm(RA)  RT.s   RA.v   no stride allowed
 160      imm(RA)  RT.v   RA.s   stride-select allowed
 161      imm(RA)  RT.s   RA.s   not vectorised
 162      RA,RB    RT.v  {RA|RB}.v Standard Indexed
 163      RA,RB    RT.s  {RA|RB}.v Indexed but single LD (no VSPLAT)
 164      RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
 165      RA,RB    RT.s  {RA&RB}.s not vectorised (scalar identity)
 166
 167 Signed Effective Address computation is only relevant for
 168 Vector Indexed Mode, when elwidth overrides are applied.
 169 The source override applies to RB, and before adding to
 170 RA in order to calculate the Effective Address, if SEA is
 171 set RB is sign-extended from elwidth bits to the full 64
 172 bits.  For other Modes (ffirst, saturate),
 173 all EA computation with elwidth overrides is unsigned.
 174
 175 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
 176 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
 177 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv,
 178 copying the one *scalar* value into multiple register destinations.
 179
 180 Note also that cache-inhibited VSPLAT with Predicate-result is possible.
 181 This allows for example to issue a massive batch of memory-mapped
 182 peripheral reads, stopping at the first NULL-terminated character and
 183 truncating VL to that point. No branch is needed to issue that large burst
 184 of LDs.
 185
 186 The multiple reads/writes to/from the same destination address is,
 187 in Vector-Indexed LD/ST, very similar to the relaxed constraints of
 188 mapreduce mode,
 189
 190 # Vectorisation of Scalar Power ISA v3.0B
 191
 192 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
 193 [[isa/fixedstore]] pseudocode to be of the form:
 194
 195     lbux RT, RA, RB
 196     EA <- (RA) + (RB)
 197     RT <- MEM(EA)
 198
 199 and for immediate variants:
 200
 201     lb RT,D(RA)
 202     EA <- RA + EXTS(D)
 203     RT <- MEM(EA)
 204
 205 Thus in the first example, the source registers may each be independently
 206 marked as scalar or vector, and likewise the destination; in the second
 207 example only the one source and one dest may be marked as scalar or
 208 vector.
 209
 210 Thus we can see that Vector Indexed may be covered, and, as demonstrated
 211 with the pseudocode below, the immediate can be used to give unit stride or element stride.  With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
 212
 213     # LD not VLD!  format - ldop RT, immed(RA)
 214     # op_width: lb=1, lh=2, lw=4, ld=8
 215     op_load(RT, RA, op_width, immed, svctx, RAupdate):
 216       ps = get_pred_val(FALSE, RA); # predication on src
 217       pd = get_pred_val(FALSE, RT); # ... AND on dest
 218       for (i=0, j=0, u=0; i < VL && j < VL;):
 219         # skip nonpredicates elements
 220         if (RA.isvec) while (!(ps & 1<<i)) i++;
 221         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 222         if (RT.isvec) while (!(pd & 1<<j)) j++;
 223         if svctx.ldstmode == elementstride:
 224           # element stride mode
 225           srcbase = ireg[RA]
 226           offs = i * immed              # j*immed for a ST
 227         elif svctx.ldstmode == unitstride:
 228           # unit stride mode
 229           srcbase = ireg[RA]
 230           offs = immed + (i * op_width) # j*op_width for ST
 231         elif RA.isvec:
 232           # quirky Vector indexed mode but with an immediate
 233           srcbase = ireg[RA+i]
 234           offs = immed;
 235         else
 236           # standard scalar mode (but predicated)
 237           # no stride multiplier means VSPLAT mode
 238           srcbase = ireg[RA]
 239           offs = immed
 240
 241         # compute EA
 242         EA = srcbase + offs
 243         # update RA?
 244         if RAupdate: ireg[RAupdate+u] = EA;
 245         # load from memory
 246         ireg[RT+j] <= MEM[EA];
 247         if (!RT.isvec)
 248             break # destination scalar, end now
 249         if (RA.isvec) i++;
 250         if (RAupdate.isvec) u++;
 251         if (RT.isvec) j++;
 252
 253 Indexed LD is:
 254
 255     # format: ldop RT, RA, RB
 256     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
 257       ps = get_pred_val(FALSE, RA); # predication on src
 258       pd = get_pred_val(FALSE, RT); # ... AND on dest
 259       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
 260         # skip nonpredicated RA, RB and RT
 261         if (RA.isvec) while (!(ps & 1<<i)) i++;
 262         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 263         if (RB.isvec) while (!(ps & 1<<k)) k++;
 264         if (RT.isvec) while (!(pd & 1<<j)) j++;
 265         if svctx.ldstmode == elementstride:
 266             EA = ireg[RA] + ireg[RB]*j   # register-strided
 267         else
 268             EA = ireg[RA+i] + ireg[RB+k] # indexed address
 269         if RAupdate: ireg[RAupdate+u] = EA
 270         ireg[RT+j] <= MEM[EA];
 271         if (!RT.isvec)
 272             break # destination scalar, end immediately
 273         if svctx.ldstmode != elementstride:
 274             if (!RA.isvec && !RB.isvec)
 275                 break # scalar-scalar
 276         if (RA.isvec) i++;
 277         if (RAupdate.isvec) u++;
 278         if (RB.isvec) k++;
 279         if (RT.isvec) j++;
 280
 281 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
 282
 283 # LD/ST ffirst
 284
 285 LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
 286 ordinary one.  Exceptions, if any are needed occur "as normal" exactly as
 287 they would on any Scalar v3.0 Power ISA LD/ST. However for elements 1
 288 and above, if an exception would occur, then VL is **truncated** to the
 289 previous element: the exception is **not** then raised because the
 290 LD/ST that would otherwise have caused an exception is *required* to be cancelled.
 291
 292 ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail.  Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST.  See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 293
 294     for(i = 0; i < VL; i++)
 295         reg[rt + i] = mem[reg[ra] + i * reg[rb]];
 296
 297 High security implementations where any kind of speculative probing
 298 of memory pages is considered a risk should take advantage of the fact that
 299 implementations may truncate VL at any point, without requiring software
 300 to be rewritten and made non-portable. Such implementations may choose
 301 to *always* set VL=1 which will have the effect of terminating any
 302 speculative probing (and also adversely affect performance), but will
 303 at least not require applications to be rewritten.
 304
 305 Low-performance simpler hardware implementations may also
 306 choose (always) to also set VL=1 as the bare minimum compliant implementation of
 307 LD/ST Fail-First. It is however critically important to remember that
 308 the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
 309 **MUST** raise exceptions exactly like an ordinary LD/ST.
 310
 311 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins the following ffirst LD/ST operations on an aligned boundary
 312 such as the beginning of a cache line, or beginning of a Virtual Memory
 313 page. Likewise, to reduce workloads or balance resources.
 314
 315 Vertical-First Mode is slightly strange in that only one element
 316 at a time is ever executed anyway.  Given that programmers may
 317 legitimately choose to alter srcstep and dststep in non-sequential
 318 order as part of explicit loops, it is neither possible nor
 319 safe to make speculative assumptions about future LD/STs.
 320 Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
 321 This is very different from Arithmetic (Data-dependent) FFirst
 322 where Vertical-First Mode is fully deterministic, not speculative.
 323
 324 # LOAD/STORE Elwidths <a name="elwidth"></a>
 325
 326 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
 327 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 328 others like it provide an explicit operation width.  There are therefore
 329 *three* widths involved:
 330
 331 * operation width (lb=8, lh=16, lw=32, ld=64)
 332 * src elelent width override
 333 * destination element width override
 334
 335 Some care is therefore needed to express and make clear the transformations,
 336 which are expressly in this order:
 337
 338 * Load at the operation width (lb/lh/lw/ld) as usual
 339 * byte-reversal as usual
 340 * Non-saturated mode:
 341    - zero-extension or truncation from operation width to source elwidth
 342    - zero/truncation to dest elwidth
 343 * Saturated mode:
 344    - Sign-extension or truncation from operation width to source width
 345    - signed/unsigned saturation down to dest elwidth
 346
 347 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
 348 is treated effectively as completely separate and distinct from SV
 349 augmentation.  This is primarily down to quirks surrounding LE/BE and
 350 byte-reversal in OpenPOWER.
 351
 352 It is unfortunately possible to request an elwidth override on the memory side which
 353 does not mesh with the operation width: these result in `UNDEFINED`
 354 behaviour.  The reason is that the effect of attempting a 64-bit `sv.ld`
 355 operation with a source elwidth override of 8/16/32 would result in
 356 overlapping memory requests, particularly on unit and element strided
 357 operations.  Thus it is `UNDEFINED` when the elwidth is smaller than
 358 the memory operation width. Examples include `sv.lw/sw=16/els` which
 359 requests (overlapping) 4-byte memory reads offset from
 360 each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
 361 where the dest elwidth override is less than the operation width.
 362
 363 Note the following regarding the pseudocode to follow:
 364
 365 * `scalar identity behaviour` SV Context parameter conditions turn this
 366   into a straight absolute fully-compliant Scalar v3.0B LD operation
 367 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 368   rather than `ld`)
 369 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 370   a "normal" part of Scalar v3.0B LD
 371 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 372   as a "normal" part of Scalar v3.0B LD
 373 * `svctx` specifies the SV Context and includes VL as well as
 374   source and destination elwidth overrides.
 375
 376 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
 377
 378 Note that twin predication, predication-zeroing, saturation
 379 and other modes have all been removed, for clarity and simplicity:
 380
 381     # LD not VLD! (ldbrx if brev=True)
 382     # this covers unit stride mode and a type of vector offset
 383     function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
 384       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
 385
 386         if not svctx.unit/el-strided:
 387             # strange vector mode, compute 64 bit address which is
 388             # not polymorphic! elwidth hardcoded to 64 here
 389             srcbase = get_polymorphed_reg(RA, 64, i)
 390         else:
 391             # unit / element stride mode, compute 64 bit address
 392             srcbase = get_polymorphed_reg(RA, 64, 0)
 393             # adjust for unit/el-stride
 394             srcbase += ....
 395
 396         # takes care of (merges) processor LE/BE and ld/ldbrx
 397         bytereverse = brev XNOR MSR.LE
 398
 399         # read the underlying memory
 400         memread <= mem[srcbase + imm_offs];
 401
 402         # optionally performs byteswap at op width
 403         if (bytereverse):
 404             memread = byteswap(memread, op_width)
 405
 406         # check saturation.
 407         if svpctx.saturation_mode:
 408             ... saturation adjustment...
 409         else:
 410             # truncate/extend to over-ridden source width.
 411             memread = adjust_wid(memread, op_width, svctx.src_elwidth)
 412
 413         # takes care of inserting memory-read (now correctly byteswapped)
 414         # into regfile underlying LE-defined order, into the right place
 415         # within the NEON-like register, respecting destination element
 416         # bitwidth, and the element index (j)
 417         set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
 418
 419         # increments both src and dest element indices (no predication here)
 420         i++;
 421         j++;
 422
 423 # Remapped LD/ST
 424
 425 In the [[sv/remap]] page the concept of "Remapping" is described.
 426 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 427 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 428 elements worth of LDs or STs.  The usual interest in such re-mapping
 429 is for example in separating out 24-bit RGB channel data into separate
 430 contiguous registers.  NEON covers this as shown in the diagram below:
 431
 432 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
 433
 434 Remap easily covers this capability, and with dest
 435 elwidth overrides and saturation may do so with built-in conversion that
 436 would normally require additional width-extension, sign-extension and
 437 min/max Vectorised instructions as post-processing stages.
 438
 439 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 440 because the generic abstracted concept of "Remapping", when applied to
 441 LD/ST, will give that same capability, with far more flexibility.
 442
 443 Also LD/ST with immediate has a Pack/Unpack option similar to VSX
 444 'vpack' and 'vunpack'.
 445
 446 # notes from lxo
 447
 448 this section covers assembly notation for the immediate and indexed LD/ST.
 449 the summary is that in immediate mode for LD it is not clear that if the
 450 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
 451 the memory being read is *still a vector load*, known as "unit or element strides".
 452
 453 This anomaly is made clear with the following notation:
 454
 455     sv.ld RT.v, imm(RA).v
 456
 457 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
 458
 459     sv.ld RT.v, imm(RA)
 460
 461 Notes taken from IRC conversation
 462
 463     <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
 464     <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
 465     <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
 466     <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
 467     <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
 468     <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
 469     <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
 470
 471 permutations of vector selection, to identify above asm-syntax:
 472
 473      imm(RA)  RT.v   RA.v   nonstrided
 474          sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
 475            mem@     0+r#2   offs+(r#2+1)  offs+(r#2+2)
 476            destreg  r#      r#+1          r#+2
 477      imm(RA)  RT.s   RA.v   nonstrided
 478          sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
 479            (dest r# is scalar) -> VSELECT mode
 480      imm(RA)  RT.v   RA.s   fixed stride: unit or element
 481          sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
 482            mem@r#2  +0   +1   +2
 483            destreg  r#   r#+1 r#+2
 484          sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
 485            mem@r#2  +0 ...   +offs ...  +offs*2
 486            destreg  r#       r#+1       r#+2
 487      imm(RA)  RT.s   RA.s   not vectorised
 488          sv.ld r#, ofst(r#2)
 489
 490 indexed mode:
 491
 492      RA,RB    RT.v  RA.v  RB.v
 493         sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
 494      RA,RB    RT.v  RA.s  RB.v
 495         sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
 496      RA,RB    RT.v  RA.v  RB.s
 497         sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
 498      RA,RB    RT.v  RA.s  RB.s
 499         sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
 500      RA,RB    RT.s  RA.v  RB.v
 501      RA,RB    RT.s  RA.s  RB.v
 502      RA,RB    RT.s  RA.v  RB.s
 503      RA,RB    RT.s  RA.s  RB.s not vectorised