openpower/sv/ldst.mdwn

   1 [[!tag standards]]
   2
   3 # SV Load and Store
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12 * [[ldst/discussion]]
  13
  14 # Rationale
  15
  16 All Vector ISAs dating back fifty years have extensive and comprehensive
  17 Load and Store operations that go far beyond the capabilities of Scalar
  18 RISC or CISC processors, yet at their heart on an individual element
  19 basis may be found to be no different from RISC Scalar equivalents.
  20
  21 The resource savings from Vector LD/ST are significant and stem from
  22 the fact that one single instruction can trigger a dozen (or in some
  23 microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
  24
  25 Additionally, and simply: if the Arithmetic side of an ISA supports
  26 Vector Operations, then in order to keep the ALUs 100% occupied the
  27 Memory infrastructure (and the ISA itself) correspondingly needs Vector
  28 Memory Operations as well.
  29
  30 Vectorised Load and Store also presents an extra dimension (literally)
  31 which creates scenarios unique to Vector applications, that a Scalar
  32 (and even a SIMD) ISA simply never encounters.  SVP64 endeavours to
  33 add the modes typically found in *all* Scalable Vector ISAs,
  34 without changing the behaviour of the underlying Base
  35 (Scalar) v3.0B operations in any way.
  36
  37 # Modes overview
  38
  39 Vectorisation of Load and Store requires creation, from scalar operations,
  40 a number of different modes:
  41
  42 * **fixed aka "unit" stride** - contiguous sequence with no gaps
  43 * **element strided** - sequential but regularly offset, with gaps
  44 * **vector indexed** - vector of base addresses and vector of offsets
  45 * **Speculative fail-first** - where it makes sense to do so
  46 * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
  47
  48 *Despite being constructed from Scalar LD/ST none of these Modes
  49 exist or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
  50
  51 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
  52 as well as Element-width overrides and Twin-Predication.
  53
  54 Note also that Indexed [[sv/remap]] mode may be applied to both
  55 v3.0 LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions.
  56 LD/ST-Indexed should not be conflated with Indexed REMAP mode: clarification
  57 is provided below.
  58
  59 # Determining the LD/ST Modes
  60
  61 A minor complication (caused by the retro-fitting of modern Vector
  62 features to a Scalar ISA) is that certain features do not exactly make
  63 sense or are considered a security risk.  Fail-first on Vector Indexed
  64 would allow attackers to probe large numbers of pages from userspace, where
  65 strided fail-first (by creating contiguous sequential LDs) does not.
  66
  67 In addition, reduce mode makes no sense, and for LD/ST with immediates
  68  Vector source RA makes no sense either (or, is a quirk). Realistically we need
  69 an alternative table meaning for [[sv/svp64]] mode.  The following modes make sense:
  70
  71 * saturation
  72 * predicate-result (mostly for cache-inhibited LD/ST)
  73 * normal
  74 * fail-first (where Vector Indexed is banned)
  75 * Signed Effective Address computation (Vector Indexed only)
  76 * Pack/Unpack (on LD/ST immediate operations only)
  77
  78 More than that however it is necessary to fit the usual Vector ISA
  79 capabilities onto both Power ISA LD/ST with immediate and to
  80 LD/ST Indexed. They present subtly different Mode tables.
  81
  82 Fields used in tables below:
  83
  84 * **sz / dz**  if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero.  otherwise the element is ignored or skipped, depending on context.
  85 * **zz**: both sz and dz are set equal to this flag.
  86 * **inv CR bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
  87 * **N** sets signed/unsigned saturation.
  88 * **RC1** as if Rc=1, stores CRs *but not the result*
  89 * **SEA** - Signed Effective Address, if enabled performs sign-extension on
  90   registers that have been reduced due to elwidth overrides
  91
  92 **LD/ST immediate**
  93
  94 The table for [[sv/svp64]] for `immed(RA)` is:
  95
  96 | 0-1 |  2  |  3   4  |  description               |
  97 | --- | --- |---------|--------------------------- |
  98 | 00  | 0   |  zz els | normal mode                |
  99 | 00  | 1   |  zz els | Structured Pack/Unpack     |
 100 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
 101 | 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
 102 | 10  |   N | zz  els |  sat mode: N=0/1 u/s       |
 103 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
 104 | 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |
 105
 106 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
 107 whether stride is unit or element:
 108
 109     if RA.isvec:
 110         svctx.ldstmode = indexed
 111     elif els == 0:
 112         svctx.ldstmode = unitstride
 113     elif immediate != 0:
 114         svctx.ldstmode = elementstride
 115
 116 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
 117 in effect the multiplication of the immediate-offset by zero results
 118 in reading from the exact same memory location, *even with a Vector
 119 register*. (Normally this type of behaviour is reserved for the
 120 mapreduce modes)
 121
 122 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
 123 just the once and be copied, rather than hitting the Data Cache
 124 multiple times with the same memory read at the same location.
 125 The benefit of Cache-inhibited LD-splats is that it allows
 126 for memory-mapped peripherals to have multiple
 127 data values read in quick succession and stored in sequentially
 128 numbered registers (but, see Note below).
 129
 130 For non-cache-inhibited ST from a vector source onto a scalar
 131 destination: with the Vector
 132 loop effectively creating multiple memory writes to the same location,
 133 we can deduce that the last of these will be the "successful" one. Thus,
 134 implementations are free and clear to optimise out the overwriting STs,
 135 leaving just the last one as the "winner".  Bear in mind that predicate
 136 masks will skip some elements (in source non-zeroing mode).
 137 Cache-inhibited ST operations on the other hand **MUST** write out
 138 a Vector source multiple successive times to the exact same Scalar
 139 destination. Just like Cache-inhibited LDs, multiple values may be
 140 written out in quick succession to a memory-mapped peripheral from
 141 sequentially-numbered registers.
 142
 143 Note that any memory location may be Cache-inhibited
 144 (Power ISA v.1, Book III, 1.6.1, p1033)
 145
 146 **LD/ST Indexed**
 147
 148 The modes for `RA+RB` indexed version are slightly different:
 149
 150 | 0-1 |  2  |  3   4  |  description              |
 151 | --- | --- |---------|-------------------------- |
 152 | 00  | SEA |  dz  sz | normal mode        |
 153 | 01  | SEA | dz sz   | Strided (scalar only source)   |
 154 | 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
 155 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 156 | 11  | inv | zz  RC1 |  Rc=0: pred-result z/nonz |
 157
 158 Vector Indexed Strided Mode is qualified as follows:
 159
 160     if mode = 0b01 and !RA.isvec and !RB.isvec:
 161         svctx.ldstmode = elementstride
 162
 163 A summary of the effect of Vectorisation of src or dest:
 164
 165      imm(RA)  RT.v   RA.v   no stride allowed
 166      imm(RA)  RT.s   RA.v   no stride allowed
 167      imm(RA)  RT.v   RA.s   stride-select allowed
 168      imm(RA)  RT.s   RA.s   not vectorised
 169      RA,RB    RT.v  {RA|RB}.v Standard Indexed
 170      RA,RB    RT.s  {RA|RB}.v Indexed but single LD (no VSPLAT)
 171      RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
 172      RA,RB    RT.s  {RA&RB}.s not vectorised (scalar identity)
 173
 174 Signed Effective Address computation is only relevant for
 175 Vector Indexed Mode, when elwidth overrides are applied.
 176 The source override applies to RB, and before adding to
 177 RA in order to calculate the Effective Address, if SEA is
 178 set RB is sign-extended from elwidth bits to the full 64
 179 bits.  For other Modes (ffirst, saturate),
 180 all EA computation with elwidth overrides is unsigned.
 181
 182 Note that cache-inhibited LD/ST  when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  Even with scalar src a
 183 Cache-inhibited LD will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are typically used to read and write memory-mapped peripherals.
 184 If a genuine cache-inhibited LD-VSPLAT is required then a single *scalar*
 185 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv,
 186 copying the one *scalar* value into multiple register destinations.
 187
 188 Note also that cache-inhibited VSPLAT with Predicate-result is possible.
 189 This allows for example to issue a massive batch of memory-mapped
 190 peripheral reads, stopping at the first NULL-terminated character and
 191 truncating VL to that point. No branch is needed to issue that large burst
 192 of LDs, which may be valuable in Embedded scenarios.
 193
 194 # Vectorisation of Scalar Power ISA v3.0B
 195
 196 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
 197 [[isa/fixedstore]] pseudocode to be of the form:
 198
 199     lbux RT, RA, RB
 200     EA <- (RA) + (RB)
 201     RT <- MEM(EA)
 202
 203 and for immediate variants:
 204
 205     lb RT,D(RA)
 206     EA <- RA + EXTS(D)
 207     RT <- MEM(EA)
 208
 209 Thus in the first example, the source registers may each be independently
 210 marked as scalar or vector, and likewise the destination; in the second
 211 example only the one source and one dest may be marked as scalar or
 212 vector.
 213
 214 Thus we can see that Vector Indexed may be covered, and, as demonstrated
 215 with the pseudocode below, the immediate can be used to give unit stride or element stride.  With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
 216
 217     # LD not VLD!  format - ldop RT, immed(RA)
 218     # op_width: lb=1, lh=2, lw=4, ld=8
 219     op_load(RT, RA, op_width, immed, svctx, RAupdate):
 220       ps = get_pred_val(FALSE, RA); # predication on src
 221       pd = get_pred_val(FALSE, RT); # ... AND on dest
 222       for (i=0, j=0, u=0; i < VL && j < VL;):
 223         # skip nonpredicates elements
 224         if (RA.isvec) while (!(ps & 1<<i)) i++;
 225         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 226         if (RT.isvec) while (!(pd & 1<<j)) j++;
 227         if svctx.ldstmode == elementstride:
 228           # element stride mode
 229           srcbase = ireg[RA]
 230           offs = i * immed              # j*immed for a ST
 231         elif svctx.ldstmode == unitstride:
 232           # unit stride mode
 233           srcbase = ireg[RA]
 234           offs = immed + (i * op_width) # j*op_width for ST
 235         elif RA.isvec:
 236           # quirky Vector indexed mode but with an immediate
 237           srcbase = ireg[RA+i]
 238           offs = immed;
 239         else
 240           # standard scalar mode (but predicated)
 241           # no stride multiplier means VSPLAT mode
 242           srcbase = ireg[RA]
 243           offs = immed
 244
 245         # compute EA
 246         EA = srcbase + offs
 247         # update RA?
 248         if RAupdate: ireg[RAupdate+u] = EA;
 249         # load from memory
 250         ireg[RT+j] <= MEM[EA];
 251         if (!RT.isvec)
 252             break # destination scalar, end now
 253         if (RA.isvec) i++;
 254         if (RAupdate.isvec) u++;
 255         if (RT.isvec) j++;
 256
 257 Indexed LD is:
 258
 259     # format: ldop RT, RA, RB
 260     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
 261       ps = get_pred_val(FALSE, RA); # predication on src
 262       pd = get_pred_val(FALSE, RT); # ... AND on dest
 263       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
 264         # skip nonpredicated RA, RB and RT
 265         if (RA.isvec) while (!(ps & 1<<i)) i++;
 266         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 267         if (RB.isvec) while (!(ps & 1<<k)) k++;
 268         if (RT.isvec) while (!(pd & 1<<j)) j++;
 269         if svctx.ldstmode == elementstride:
 270             EA = ireg[RA] + ireg[RB]*j   # register-strided
 271         else
 272             EA = ireg[RA+i] + ireg[RB+k] # indexed address
 273         if RAupdate: ireg[RAupdate+u] = EA
 274         ireg[RT+j] <= MEM[EA];
 275         if (!RT.isvec)
 276             break # destination scalar, end immediately
 277         if svctx.ldstmode != elementstride:
 278             if (!RA.isvec && !RB.isvec)
 279                 break # scalar-scalar
 280         if (RA.isvec) i++;
 281         if (RAupdate.isvec) u++;
 282         if (RB.isvec) k++;
 283         if (RT.isvec) j++;
 284
 285 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
 286
 287 # LD/ST Indexed vs Indexed REMAP
 288
 289 Unfortunately the word "Indexed" is used twice in completely different
 290 contexts, potentially causing confusion.
 291
 292 * There has existed instructions in the Power ISA `ld RT,RA,RB` since
 293   its creation: these are called "LD/ST Indexed" instructions and their
 294   name and meaning is well-established.
 295 * There now exists, in Simple-V, a [[sv/remap]] mode called "Indexed"
 296   Mode that can be applied to *any* instruction **including those
 297   named LD/ST Indexed**.
 298
 299 Whilst it may be costly in terms of register reads to allow REMAP
 300 Indexed Mode to be applied to any Vectorised LD/ST Indexed operation such as
 301 `sv.ld *RT,RA,*RB`, or even viewed as redundant, firstly the strict
 302 application of the RISC Paradigm that Simple-V follows makes it awkward
 303 to consider *preventing* the application of Indexed REMAP to such
 304 operations, and secondly they are not actually the same at all.
 305
 306 Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
 307 effectively performs an *in-place* re-ordering of the offsets, RB.
 308 To achieve the same effect without Indexed REMAP would require taking
 309 a *copy* of the Vector of offsets starting at RB, manually explicitly
 310 reordering them, and finally using the copy of re-ordered offsets in
 311 a non-REMAP'ed `sv.ld`.  Pseudocode showing what actually occurs:
 312
 313     # sv.ld *RT,RA,*RB with Index REMAP applied to RB
 314     for i in 0..VL-1:
 315         rb_idx = indexed_remap(i) # normally rb_idx = i
 316         EA = GPR(RA) + GPR(RB+rb_idx)
 317         GPR(RT+i) = MEM(EA, 8)
 318
 319 Thus it can be seen that the use of Indexed REMAP saves copying
 320 and manual reordering of the Vector of RB offsets.
 321
 322 # LD/ST ffirst
 323
 324 LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
 325 is not active) as an
 326 ordinary one, with all behaviour with respect to Interrupts Exceptions
 327 Page Faults Memory Management being identical in every regard to Scalar
 328 v3.0 Power ISA LD/ST. However for elements 1
 329 and above, if an exception would occur, then VL is **truncated** to the
 330 previous element: the exception is **not** then raised because the
 331 LD/ST that would otherwise have caused an exception is *required* to be cancelled. Additionally an implementor may choose to truncate VL for
 332 any arbitrary reason *except for the very first*.
 333
 334 ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting speculative feedback on which pages would fail.  Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST.  See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 335
 336     for(i = 0; i < VL; i++)
 337         reg[rt + i] = mem[reg[ra] + i * reg[rb]];
 338
 339 High security implementations where any kind of speculative probing
 340 of memory pages is considered a risk should take advantage of the fact that
 341 implementations may truncate VL at any point, without requiring software
 342 to be rewritten and made non-portable. Such implementations may choose
 343 to *always* set VL=1 which will have the effect of terminating any
 344 speculative probing (and also adversely affect performance), but will
 345 at least not require applications to be rewritten.
 346
 347 Low-performance simpler hardware implementations may also
 348 choose (always) to also set VL=1 as the bare minimum compliant implementation of
 349 LD/ST Fail-First. It is however critically important to remember that
 350 the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
 351 **MUST** raise exceptions exactly like an ordinary LD/ST.
 352
 353 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins the following ffirst LD/ST operations on an aligned boundary
 354 such as the beginning of a cache line, or beginning of a Virtual Memory
 355 page. Likewise, to reduce workloads or balance resources.
 356
 357 Vertical-First Mode is slightly strange in that only one element
 358 at a time is ever executed anyway.  Given that programmers may
 359 legitimately choose to alter srcstep and dststep in non-sequential
 360 order as part of explicit loops, it is neither possible nor
 361 safe to make speculative assumptions about future LD/STs.
 362 Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
 363 This is very different from Arithmetic (Data-dependent) FFirst
 364 where Vertical-First Mode is fully deterministic, not speculative.
 365
 366 # LD/ST Pack/Unpack Mode
 367
 368 As described in [[sv/normal]],
 369 Structured Pack/Unpack is similar to VSX `vpack` and `vunpack` except
 370 generalised not only to a Schedule to be applied to any operation but
 371 also extended to vec2/3/4.
 372
 373 Just as in [[sv/normal]] operations,
 374 setting this mode changes the meaning of bits 4-5 in `RM` from being
 375 `ELWIDTH` to a pair of Pack/Unpack bits.  Thus it is not possible
 376 to separately override source and destination elwidths at the same
 377 time as use Pack/Unpack: the `SRC_ELWIDTH` bits (6-7) must be used as
 378 both source and destination elwidth.
 379
 380 Pack/Unpack only applies to LD/ST-immediate operations.
 381 See [[sv/svp64/appendix]] for details on how Pack/Unpack
 382 is implemented.
 383
 384 # LOAD/STORE Elwidths <a name="elwidth"></a>
 385
 386 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
 387 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 388 others like it provide an explicit operation width.  There are therefore
 389 *three* widths involved:
 390
 391 * operation width (lb=8, lh=16, lw=32, ld=64)
 392 * src element width override (8/16/32/default)
 393 * destination element width override (8/16/32/default)
 394
 395 Some care is therefore needed to express and make clear the transformations,
 396 which are expressly in this order:
 397
 398 * Calculate the Effective Address from RA at full width
 399   but (on Indexed Load) allow srcwidth overrides on RB
 400 * Load at the operation width (lb/lh/lw/ld) as usual
 401 * byte-reversal as usual
 402 * Non-saturated mode:
 403    - zero-extension or truncation from operation width to dest elwidth
 404    - place result in destination at dest elwidth
 405 * Saturated mode:
 406    - Sign-extension or truncation from operation width to dest width
 407    - signed/unsigned saturation down to dest elwidth
 408
 409 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
 410 is treated effectively as completely separate and distinct from SV
 411 augmentation.  This is primarily down to quirks surrounding LE/BE and
 412 byte-reversal in OpenPOWER.
 413
 414 It is rather unfortunately possible to request an elwidth override
 415 on the memory side which
 416 does not mesh with the overridden operation width: these result in
 417 `UNDEFINED`
 418 behaviour.  The reason is that the effect of attempting a 64-bit `sv.ld`
 419 operation with a source elwidth override of 8/16/32 would result in
 420 overlapping memory requests, particularly on unit and element strided
 421 operations.  Thus it is `UNDEFINED` when the elwidth is smaller than
 422 the memory operation width. Examples include `sv.lw/sw=16/els` which
 423 requests (overlapping) 4-byte memory reads offset from
 424 each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
 425 where the dest elwidth override is less than the operation width.
 426
 427 Note the following regarding the pseudocode to follow:
 428
 429 * `scalar identity behaviour` SV Context parameter conditions turn this
 430   into a straight absolute fully-compliant Scalar v3.0B LD operation
 431 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 432   rather than `ld`)
 433 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 434   a "normal" part of Scalar v3.0B LD
 435 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 436   as a "normal" part of Scalar v3.0B LD
 437 * `svctx` specifies the SV Context and includes VL as well as
 438   source and destination elwidth overrides.
 439
 440 Below is the pseudocode for Unit-Strided LD (which includes Vector capability). Observe in particular that RA, as the base address in
 441 both Immediate and Indexed LD/ST,
 442 does not have element-width overriding applied to it.
 443
 444 Note that predication, predication-zeroing,
 445 and other modes except saturation have all been removed,
 446 for clarity and simplicity:
 447
 448     # LD not VLD!
 449     # this covers unit stride mode and a type of vector offset
 450     function op_ld(RT, RA, op_width, imm_offs, svctx)
 451       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
 452         if not svctx.unit/el-strided:
 453             # strange vector mode, compute 64 bit address which is
 454             # not polymorphic! elwidth hardcoded to 64 here
 455             srcbase = get_polymorphed_reg(RA, 64, i)
 456         else:
 457             # unit / element stride mode, compute 64 bit address
 458             srcbase = get_polymorphed_reg(RA, 64, 0)
 459             # adjust for unit/el-stride
 460             srcbase += ....
 461
 462         # read the underlying memory
 463         memread <= MEM(srcbase + imm_offs, op_width)
 464
 465         # check saturation.
 466         if svpctx.saturation_mode:
 467             # ... saturation adjustment...
 468             memread = clamp(memread, op_width, svctx.dest_elwidth)
 469         else:
 470             # truncate/extend to over-ridden dest width.
 471             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
 472
 473         # takes care of inserting memory-read (now correctly byteswapped)
 474         # into regfile underlying LE-defined order, into the right place
 475         # within the NEON-like register, respecting destination element
 476         # bitwidth, and the element index (j)
 477         set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
 478
 479         # increments both src and dest element indices (no predication here)
 480         i++;
 481         j++;
 482
 483 Note above that the source elwidth is *not used at all* in LD-immediate
 484
 485 For LD/Indexed, the key is that in the calculation of the Effective Address,
 486 RA has no elwidth override but RB does.  Pseudocode below is simplified
 487 for clarity: predication and all modes except saturation are removed:
 488
 489     # LD not VLD! ld*rx if brev else ld*
 490     function op_ld(RT, RA, RB, op_width, svctx, brev)
 491       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
 492         if not svctx.el-strided:
 493             # RA not polymorphic! elwidth hardcoded to 64 here
 494             srcbase = get_polymorphed_reg(RA, 64, i)
 495         else:
 496             # element stride mode, again RA not polymorphic
 497             srcbase = get_polymorphed_reg(RA, 64, 0)
 498         # RB *is* polymorphic
 499         offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
 500         # sign-extend
 501         if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
 502
 503         # takes care of (merges) processor LE/BE and ld/ldbrx
 504         bytereverse = brev XNOR MSR.LE
 505
 506         # read the underlying memory
 507         memread <= MEM(srcbase + offs, op_width)
 508
 509         # optionally performs byteswap at op width
 510         if (bytereverse):
 511             memread = byteswap(memread, op_width)
 512
 513         if svpctx.saturation_mode:
 514             # ... saturation adjustment...
 515             memread = clamp(memread, op_width, svctx.dest_elwidth)
 516         else:
 517             # truncate/extend to over-ridden dest width.
 518             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
 519
 520         # takes care of inserting memory-read (now correctly byteswapped)
 521         # into regfile underlying LE-defined order, into the right place
 522         # within the NEON-like register, respecting destination element
 523         # bitwidth, and the element index (j)
 524         set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
 525
 526         # increments both src and dest element indices (no predication here)
 527         i++;
 528         j++;
 529
 530 # Remapped LD/ST
 531
 532 In the [[sv/remap]] page the concept of "Remapping" is described.
 533 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 534 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 535 elements worth of LDs or STs.  The usual interest in such re-mapping
 536 is for example in separating out 24-bit RGB channel data into separate
 537 contiguous registers.  NEON covers this as shown in the diagram below:
 538
 539 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
 540
 541 Remap easily covers this capability, and with dest
 542 elwidth overrides and saturation may do so with built-in conversion that
 543 would normally require additional width-extension, sign-extension and
 544 min/max Vectorised instructions as post-processing stages.
 545
 546 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 547 because the generic abstracted concept of "Remapping", when applied to
 548 LD/ST, will give that same capability, with far more flexibility.
 549
 550 Also LD/ST with immediate has a Pack/Unpack option similar to VSX
 551 'vpack' and 'vunpack'.
 552