\newpage{}
-# SV Load and Store
-
-**Rationale**
-
-All Vector ISAs dating back fifty years have extensive and comprehensive
-Load and Store operations that go far beyond the capabilities of Scalar
-RISC and most CISC processors, yet at their heart on an individual element
-basis may be found to be no different from RISC Scalar equivalents.
-
-The resource savings from Vector LD/ST are significant and stem
-from the fact that one single instruction can trigger a dozen (or in
-some microarchitectures such as Cray or NEC SX Aurora) hundreds of
-element-level Memory accesses.
-
-Additionally, and simply: if the Arithmetic side of an ISA supports
-Vector Operations, then in order to keep the ALUs 100% occupied the
-Memory infrastructure (and the ISA itself) correspondingly needs Vector
-Memory Operations as well.
-
-Vectorised Load and Store also presents an extra dimension (literally)
-which creates scenarios unique to Vector applications, that a Scalar (and
-even a SIMD) ISA simply never encounters. SVP64 endeavours to add the
-modes typically found in *all* Scalable Vector ISAs, without changing the
-behaviour of the underlying Base (Scalar) v3.0B operations in any way.
-(The sole apparent exception is Post-Increment Mode on LD/ST-update
-instructions)
-
-## Modes overview
-
-Vectorisation of Load and Store requires creation, from scalar operations,
-a number of different modes:
-
-* **fixed aka "unit" stride** - contiguous sequence with no gaps
-* **element strided** - sequential but regularly offset, with gaps
-* **vector indexed** - vector of base addresses and vector of offsets
-* **Speculative fail-first** - where it makes sense to do so
-* **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
-
-*Despite being constructed from Scalar LD/ST none of these Modes exist
-or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
-
-Also included in SVP64 LD/ST is both signed and unsigned Saturation,
-as well as Element-width overrides and Twin-Predication.
-
-Note also that Indexed [[sv/remap]] mode may be applied to both v3.0
-LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions.
-LD/ST-Indexed should not be conflated with Indexed REMAP mode:
-clarification is provided below.
-
-**Determining the LD/ST Modes**
-
-A minor complication (caused by the retro-fitting of modern Vector
-features to a Scalar ISA) is that certain features do not exactly make
-sense or are considered a security risk. Fail-first on Vector Indexed
-would allow attackers to probe large numbers of pages from userspace,
-where strided fail-first (by creating contiguous sequential LDs) does not.
-
-In addition, reduce mode makes no sense. Realistically we need an
-alternative table definition for [[sv/svp64]] `RM.MODE`. The following
-modes make sense:
-
-* saturation
-* predicate-result (mostly for cache-inhibited LD/ST)
-* simple (no augmentation)
-* fail-first (where Vector Indexed is banned)
-* Signed Effective Address computation (Vector Indexed only)
-
-More than that however it is necessary to fit the usual Vector ISA
-capabilities onto both Power ISA LD/ST with immediate and to LD/ST
-Indexed. They present subtly different Mode tables, which, due to lack
-of space, have the following quirks:
-
-* LD/ST Immediate has no individual control over src/dest zeroing,
- whereas LD/ST Indexed does.
-* LD/ST Indexed has limited zeroing on pred-result, LD/ST Immediate has
- *no* option to select zeroing on pred-result.
-
-## Format and fields
-
-Fields used in tables below:
-
-* **sz / dz** if predication is enabled will put zeros into the dest
- (or as src in the case of twin pred) when the predicate bit is zero.
- otherwise the element is ignored or skipped, depending on context.
-* **zz**: both sz and dz are set equal to this flag.
-* **inv CR bit** just as in branches (BO) these bits allow testing of
- a CR bit and whether it is set (inv=0) or unset (inv=1)
-* **N** sets signed/unsigned saturation.
-* **RC1** as if Rc=1, stores CRs *but not the result*
-* **SEA** - Signed Effective Address, if enabled performs sign-extension on
- registers that have been reduced due to elwidth overrides
-* **PI** - post-increment mode (applies to LD/ST with update only).
- the Effective Address utilised is always just RA, i.e. the computation of
- EA is stored in RA **after** it is actually used.
-* **LF** - Load/Store Fail or Fault First: for any reason Load or Store Vectors
- may be truncated to (at least) one element, and VL altered to indicate such.
-
-**LD/ST immediate**
-
-The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
-(bits 19:23 of `RM`) is:
-
-| 0-1 | 2 | 3 4 | description |
-| --- | --- |---------|--------------------------- |
-| 00 | 0 | zz els | simple mode |
-| 00 | 1 | PI LF | post-increment and Fault-First |
-| 01 | inv | CR-bit | Rc=1: ffirst CR sel |
-| 01 | inv | els RC1 | Rc=0: ffirst z/nonz |
-| 10 | N | zz els | sat mode: N=0/1 u/s |
-| 11 | inv | CR-bit | Rc=1: pred-result CR sel |
-| 11 | inv | els RC1 | Rc=0: pred-result z/nonz |
-
-The `els` bit is only relevant when `RA.isvec` is clear: this indicates
-whether stride is unit or element:
-
-```
- if RA.isvec:
- svctx.ldstmode = indexed
- elif els == 0:
- svctx.ldstmode = unitstride
- elif immediate != 0:
- svctx.ldstmode = elementstride
-```
-
-An immediate of zero is a safety-valve to allow `LD-VSPLAT`: in effect
-the multiplication of the immediate-offset by zero results in reading from
-the exact same memory location, *even with a Vector register*. (Normally
-this type of behaviour is reserved for the mapreduce modes)
-
-For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur just
-the once and be copied, rather than hitting the Data Cache multiple
-times with the same memory read at the same location. The benefit of
-Cache-inhibited LD-splats is that it allows for memory-mapped peripherals
-to have multiple data values read in quick succession and stored in
-sequentially numbered registers (but, see Note below).
-
-For non-cache-inhibited ST from a vector source onto a scalar destination:
-with the Vector loop effectively creating multiple memory writes to
-the same location, we can deduce that the last of these will be the
-"successful" one. Thus, implementations are free and clear to optimise
-out the overwriting STs, leaving just the last one as the "winner".
-Bear in mind that predicate masks will skip some elements (in source
-non-zeroing mode). Cache-inhibited ST operations on the other hand
-**MUST** write out a Vector source multiple successive times to the exact
-same Scalar destination. Just like Cache-inhibited LDs, multiple values
-may be written out in quick succession to a memory-mapped peripheral
-from sequentially-numbered registers.
-
-Note that any memory location may be Cache-inhibited
-(Power ISA v3.1, Book III, 1.6.1, p1033)
-
-*Programmer's Note: an immediate also with a Scalar source as a "VSPLAT"
-mode is simply not possible: there are not enough Mode bits. One single
-Scalar Load operation may be used instead, followed by any arithmetic
-operation (including a simple mv) in "Splat" mode.*
-
-**LD/ST Indexed**
-
-The modes for `RA+RB` indexed version are slightly different
-but are the same `RM.MODE` bits (19:23 of `RM`):
-
-| 0-1 | 2 | 3 4 | description |
-| --- | --- |---------|-------------------------- |
-| 00 | SEA | dz sz | simple mode |
-| 01 | SEA | dz sz | Strided (scalar only source) |
-| 10 | N | dz sz | sat mode: N=0/1 u/s |
-| 11 | inv | CR-bit | Rc=1: pred-result CR sel |
-| 11 | inv | zz RC1 | Rc=0: pred-result z/nonz |
-
-Vector Indexed Strided Mode is qualified as follows:
-
- if mode = 0b01 and !RA.isvec and !RB.isvec:
- svctx.ldstmode = elementstride
-
-A summary of the effect of Vectorisation of src or dest:
-
-```
- imm(RA) RT.v RA.v no stride allowed
- imm(RA) RT.s RA.v no stride allowed
- imm(RA) RT.v RA.s stride-select allowed
- imm(RA) RT.s RA.s not vectorised
- RA,RB RT.v {RA|RB}.v Standard Indexed
- RA,RB RT.s {RA|RB}.v Indexed but single LD (no VSPLAT)
- RA,RB RT.v {RA&RB}.s VSPLAT possible. stride selectable
- RA,RB RT.s {RA&RB}.s not vectorised (scalar identity)
-```
-
-Signed Effective Address computation is only relevant for Vector Indexed
-Mode, when elwidth overrides are applied. The source override applies to
-RB, and before adding to RA in order to calculate the Effective Address,
-if SEA is set RB is sign-extended from elwidth bits to the full 64 bits.
-For other Modes (ffirst, saturate), all EA computation with elwidth
-overrides is unsigned.
-
-Note that cache-inhibited LD/ST when VSPLAT is activated will perform
-**multiple** LD/ST operations, sequentially. Even with scalar src
-a Cache-inhibited LD will read the same memory location *multiple
-times*, storing the result in successive Vector destination registers.
-This because the cache-inhibit instructions are typically used to read
-and write memory-mapped peripherals. If a genuine cache-inhibited
-LD-VSPLAT is required then a single *scalar* cache-inhibited LD should
-be performed, followed by a VSPLAT-augmented mv, copying the one *scalar*
-value into multiple register destinations.
-
-Note also that cache-inhibited VSPLAT with Predicate-result is possible.
-This allows for example to issue a massive batch of memory-mapped
-peripheral reads, stopping at the first NULL-terminated character and
-truncating VL to that point. No branch is needed to issue that large
-burst of LDs, which may be valuable in Embedded scenarios.
-
-## Vectorisation of Scalar Power ISA v3.0B
-
-Scalar Power ISA Load/Store operations may be seen from their
-pseudocode to be of the form:
-
-```
- lbux RT, RA, RB
- EA <- (RA) + (RB)
- RT <- MEM(EA)
-```
-
-and for immediate variants:
-
-```
- lb RT,D(RA)
- EA <- RA + EXTS(D)
- RT <- MEM(EA)
-```
-
-Thus in the first example, the source registers may each be independently
-marked as scalar or vector, and likewise the destination; in the second
-example only the one source and one dest may be marked as scalar or
-vector.
-
-Thus we can see that Vector Indexed may be covered, and, as demonstrated
-with the pseudocode below, the immediate can be used to give unit
-stride or element stride. With there being no way to tell which from
-the Power v3.0B Scalar opcode alone, the choice is provided instead by
-the SV Context.
-
-```
- # LD not VLD! format - ldop RT, immed(RA)
- # op_width: lb=1, lh=2, lw=4, ld=8
- op_load(RT, RA, op_width, immed, svctx, RAupdate):
- ps = get_pred_val(FALSE, RA); # predication on src
- pd = get_pred_val(FALSE, RT); # ... AND on dest
- for (i=0, j=0, u=0; i < VL && j < VL;):
- # skip nonpredicates elements
- if (RA.isvec) while (!(ps & 1<<i)) i++;
- if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
- if (RT.isvec) while (!(pd & 1<<j)) j++;
- if postinc:
- offs = 0; # added afterwards
- if RA.isvec: srcbase = ireg[RA+i]
- else srcbase = ireg[RA]
- elif svctx.ldstmode == elementstride:
- # element stride mode
- srcbase = ireg[RA]
- offs = i * immed # j*immed for a ST
- elif svctx.ldstmode == unitstride:
- # unit stride mode
- srcbase = ireg[RA]
- offs = immed + (i * op_width) # j*op_width for ST
- elif RA.isvec:
- # quirky Vector indexed mode but with an immediate
- srcbase = ireg[RA+i]
- offs = immed;
- else
- # standard scalar mode (but predicated)
- # no stride multiplier means VSPLAT mode
- srcbase = ireg[RA]
- offs = immed
-
- # compute EA
- EA = srcbase + offs
- # load from memory
- ireg[RT+j] <= MEM[EA];
- # check post-increment of EA
- if postinc: EA = srcbase + immed;
- # update RA?
- if RAupdate: ireg[RAupdate+u] = EA;
- if (!RT.isvec)
- break # destination scalar, end now
- if (RA.isvec) i++;
- if (RAupdate.isvec) u++;
- if (RT.isvec) j++;
-```
-
-Indexed LD is:
-
-```
- # format: ldop RT, RA, RB
- function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
- ps = get_pred_val(FALSE, RA); # predication on src
- pd = get_pred_val(FALSE, RT); # ... AND on dest
- for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
- # skip nonpredicated RA, RB and RT
- if (RA.isvec) while (!(ps & 1<<i)) i++;
- if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
- if (RB.isvec) while (!(ps & 1<<k)) k++;
- if (RT.isvec) while (!(pd & 1<<j)) j++;
- if svctx.ldstmode == elementstride:
- EA = ireg[RA] + ireg[RB]*j # register-strided
- else
- EA = ireg[RA+i] + ireg[RB+k] # indexed address
- if RAupdate: ireg[RAupdate+u] = EA
- ireg[RT+j] <= MEM[EA];
- if (!RT.isvec)
- break # destination scalar, end immediately
- if (RA.isvec) i++;
- if (RAupdate.isvec) u++;
- if (RB.isvec) k++;
- if (RT.isvec) j++;
-```
-
-Note that Element-Strided uses the Destination Step because with both
-sources being Scalar as a prerequisite condition of activation of
-Element-Stride Mode, the source step (being Scalar) would never advance.
-
-Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update"
-mode (`ldux`) to be effectively a *completely different* register from
-RA-as-a-source. This because there is room in svp64 to extend RA-as-src
-as well as RA-as-dest, both independently as scalar or vector *and*
-independently extending their range.
-
-*Programmer's note: being able to set RA-as-a-source as separate from
-RA-as-a-destination as Scalar is **extremely valuable** once it is
-remembered that Simple-V element operations must be in Program Order,
-especially in loops, for saving on multiple address computations. Care
-does have to be taken however that RA-as-src is not overwritten by
-RA-as-dest unless intentionally desired, especially in element-strided
-Mode.*
-
-## LD/ST Indexed vs Indexed REMAP
-
-Unfortunately the word "Indexed" is used twice in completely different
-contexts, potentially causing confusion.
-
-* There has existed instructions in the Power ISA `ld RT,RA,RB` since
- its creation: these are called "LD/ST Indexed" instructions and their
- name and meaning is well-established.
-* There now exists, in Simple-V, a REMAP mode called "Indexed"
- Mode that can be applied to *any* instruction **including those
- named LD/ST Indexed**.
-
-Whilst it may be costly in terms of register reads to allow REMAP Indexed
-Mode to be applied to any Vectorised LD/ST Indexed operation such as
-`sv.ld *RT,RA,*RB`, or even misleadingly labelled as redundant, firstly
-the strict application of the RISC Paradigm that Simple-V follows makes
-it awkward to consider *preventing* the application of Indexed REMAP to
-such operations, and secondly they are not actually the same at all.
-
-Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
-effectively performs an *in-place* re-ordering of the offsets, RB.
-To achieve the same effect without Indexed REMAP would require taking
-a *copy* of the Vector of offsets starting at RB, manually explicitly
-reordering them, and finally using the copy of re-ordered offsets in a
-non-REMAP'ed `sv.ld`. Using non-strided LD as an example, pseudocode
-showing what actually occurs, where the pseudocode for `indexed_remap`
-may be found in [[sv/remap]]:
-
-```
- # sv.ld *RT,RA,*RB with Index REMAP applied to RB
- for i in 0..VL-1:
- if remap.indexed:
- rb_idx = indexed_remap(i) # remap
- else:
- rb_idx = i # use the index as-is
- EA = GPR(RA) + GPR(RB+rb_idx)
- GPR(RT+i) = MEM(EA, 8)
-```
-
-Thus it can be seen that the use of Indexed REMAP saves copying
-and manual reordering of the Vector of RB offsets.
-
-## LD/ST ffirst
-
-LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
-is not active) as an ordinary one, with all behaviour with respect to
-Interrupts Exceptions Page Faults Memory Management being identical
-in every regard to Scalar v3.0 Power ISA LD/ST. However for elements
-1 and above, if an exception would occur, then VL is **truncated**
-to the previous element: the exception is **not** then raised because
-the LD/ST that would otherwise have caused an exception is *required*
-to be cancelled. Additionally an implementor may choose to truncate VL
-for any arbitrary reason *except for the very first*.
-
-ffirst LD/ST to multiple pages via a Vectorised Index base is
-considered a security risk due to the abuse of probing multiple
-pages in rapid succession and getting speculative feedback on which
-pages would fail. Therefore Vector Indexed LD/ST is prohibited
-entirely, and the Mode bit instead used for element-strided LD/ST.
-
-```
- for(i = 0; i < VL; i++)
- reg[rt + i] = mem[reg[ra] + i * reg[rb]];
-```
-
-High security implementations where any kind of speculative probing of
-memory pages is considered a risk should take advantage of the fact
-that implementations may truncate VL at any point, without requiring
-software to be rewritten and made non-portable. Such implementations may
-choose to *always* set VL=1 which will have the effect of terminating
-any speculative probing (and also adversely affect performance), but
-will at least not require applications to be rewritten.
-
-Low-performance simpler hardware implementations may also choose (always)
-to also set VL=1 as the bare minimum compliant implementation of LD/ST
-Fail-First. It is however critically important to remember that the first
-element LD/ST **MUST** be treated as an ordinary LD/ST, i.e. **MUST**
-raise exceptions exactly like an ordinary LD/ST.
-
-For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value
-for any implementation-specific reason. For example: it is perfectly
-reasonable for implementations to alter VL when ffirst LD or ST operations
-are initiated on a nonaligned boundary, such that within a loop the
-subsequent iteration of that loop begins the following ffirst LD/ST
-operations on an aligned boundary such as the beginning of a cache line,
-or beginning of a Virtual Memory page. Likewise, to reduce workloads or
-balance resources.
-
-Vertical-First Mode is slightly strange in that only one element at a time
-is ever executed anyway. Given that programmers may legitimately choose
-to alter srcstep and dststep in non-sequential order as part of explicit
-loops, it is neither possible nor safe to make speculative assumptions
-about future LD/STs. Therefore, Fail-First LD/ST in Vertical-First is
-`UNDEFINED`. This is very different from Arithmetic (Data-dependent)
-FFirst where Vertical-First Mode is fully deterministic, not speculative.
-
-## Data-Dependent Fail-First (not Fail/Fault-First)
-
-Not to be confused with Fail/Fault First, Data-Fail-First performs an
-additional check on the data into a Condition Register Field and if a test on
-the CR Field fails then VL is truncated and further looping terminates.
-This is precisely the same as Arithmetic Data-Dependent Fail-First, the
-only difference being that the result comes from the LD/ST.
-
-In the case of Store operations there is a quirk when VLi (VL inclusive
-is "Valid") is clear. Bear in mind the riteria is that the truncated
-Vector of results,
-when VLi is clear, must all pass the "test", but when VLi is set the
-*current failed test* is permitted to be included. Thus, the actual
-update (store) to Memory is **not permitted to take place** should the
-test fail. Therefore, on testing the value to be stored, and after updating
-the corresponding CR Field Element, when VLi=0 and finding that the
-test fails the Memory store must **not** occur. By contrast if VLi=1
-and the test the Store may proceed *and then* looping terminates.
-In this way, when non-Inclusive, the Vector of Truncated results contains
-only Stores that passed the test, and when Inclusive the Vector of
-Truncated results contains the first-failed data.
-
-Below is an example of loading the starting addresses of Linked-List nodes.
-If VLi=1 it will load the NULL pointer into the Vector of results.
-If however VLi=0 it will *exclude* the NULL pointer by truncating VL to
-one Element earlier.
-
-```
- RT=1 # vec - deliberately overlaps by one with RA
- RA=0 # vec - first one is valid, contains ptr
- imm = 8 # offset_of(ptr->next)
- for i in range(VL):
- EA = GPR(RA+i) + imm # ptr + offset(next)
- data = MEM(EA, 8) # 64-bit address of ptr->next
- GPR(RT+i) = data # happens to be read on next loop!
- # was a normal ld up to this point. now the Data-Fail-First
- CR.field(i) = conditions(data)
- if CR.field(i).EQ == testbit: # check if zero
- if VLI then VL = i+1 # update VL, inclusive
- else VL = i # update VL
- break # stop looping
-```
-
-## LOAD/STORE Elwidths <a name="elwidth"></a>
-
-Loads and Stores are almost unique in that the Power Scalar ISA
-provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
-others like it provide an explicit operation width. There are therefore
-*three* widths involved:
-
-* operation width (lb=8, lh=16, lw=32, ld=64)
-* src element width override (8/16/32/default)
-* destination element width override (8/16/32/default)
-
-Some care is therefore needed to express and make clear the transformations,
-which are expressly in this order:
-
-* Calculate the Effective Address from RA at full width
- but (on Indexed Load) allow srcwidth overrides on RB
-* Load at the operation width (lb/lh/lw/ld) as usual
-* byte-reversal as usual
-* Non-saturated mode:
- - zero-extension or truncation from operation width to dest elwidth
- - place result in destination at dest elwidth
-* Saturated mode:
- - Sign-extension or truncation from operation width to dest width
- - signed/unsigned saturation down to dest elwidth
-
-In order to respect Power v3.0B Scalar behaviour the memory side
-is treated effectively as completely separate and distinct from SV
-augmentation. This is primarily down to quirks surrounding LE/BE and
-byte-reversal.
-
-It is rather unfortunately possible to request an elwidth override on
-the memory side which does not mesh with the overridden operation width:
-these result in `UNDEFINED` behaviour. The reason is that the effect
-of attempting a 64-bit `sv.ld` operation with a source elwidth override
-of 8/16/32 would result in overlapping memory requests, particularly
-on unit and element strided operations. Thus it is `UNDEFINED` when
-the elwidth is smaller than the memory operation width. Examples include
-`sv.lw/sw=16/els` which requests (overlapping) 4-byte memory reads offset
-from each other at 2-byte intervals. Store likewise is also `UNDEFINED`
-where the dest elwidth override is less than the operation width.
-
-Note the following regarding the pseudocode to follow:
-
-* `scalar identity behaviour` SV Context parameter conditions turn this
- into a straight absolute fully-compliant Scalar v3.0B LD operation
-* `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
- rather than `ld`)
-* `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
- a "normal" part of Scalar v3.0B LD
-* `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
- as a "normal" part of Scalar v3.0B LD
-* `svctx` specifies the SV Context and includes VL as well as
- source and destination elwidth overrides.
-
-Below is the pseudocode for Unit-Strided LD (which includes Vector
-capability). Observe in particular that RA, as the base address in both
-Immediate and Indexed LD/ST, does not have element-width overriding
-applied to it.
-
-Note that predication, predication-zeroing, and other modes except
-saturation have all been removed, for clarity and simplicity:
-
-```
- # LD not VLD!
- # this covers unit stride mode and a type of vector offset
- function op_ld(RT, RA, op_width, imm_offs, svctx)
- for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
- if not svctx.unit/el-strided:
- # strange vector mode, compute 64 bit address which is
- # not polymorphic! elwidth hardcoded to 64 here
- srcbase = get_polymorphed_reg(RA, 64, i)
- else:
- # unit / element stride mode, compute 64 bit address
- srcbase = get_polymorphed_reg(RA, 64, 0)
- # adjust for unit/el-stride
- srcbase += ....
-
- # read the underlying memory
- memread <= MEM(srcbase + imm_offs, op_width)
-
- # check saturation.
- if svpctx.saturation_mode:
- # ... saturation adjustment...
- memread = clamp(memread, op_width, svctx.dest_elwidth)
- else:
- # truncate/extend to over-ridden dest width.
- memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
-
- # takes care of inserting memory-read (now correctly byteswapped)
- # into regfile underlying LE-defined order, into the right place
- # within the NEON-like register, respecting destination element
- # bitwidth, and the element index (j)
- set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
-
- # increments both src and dest element indices (no predication here)
- i++;
- j++;
-```
-
-Note above that the source elwidth is *not used at all* in LD-immediate.
-
-For LD/Indexed, the key is that in the calculation of the Effective Address,
-RA has no elwidth override but RB does. Pseudocode below is simplified
-for clarity: predication and all modes except saturation are removed:
-
-```
- # LD not VLD! ld*rx if brev else ld*
- function op_ld(RT, RA, RB, op_width, svctx, brev)
- for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
- if not svctx.el-strided:
- # RA not polymorphic! elwidth hardcoded to 64 here
- srcbase = get_polymorphed_reg(RA, 64, i)
- else:
- # element stride mode, again RA not polymorphic
- srcbase = get_polymorphed_reg(RA, 64, 0)
- # RB *is* polymorphic
- offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
- # sign-extend
- if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
-
- # takes care of (merges) processor LE/BE and ld/ldbrx
- bytereverse = brev XNOR MSR.LE
-
- # read the underlying memory
- memread <= MEM(srcbase + offs, op_width)
-
- # optionally performs byteswap at op width
- if (bytereverse):
- memread = byteswap(memread, op_width)
-
- if svpctx.saturation_mode:
- # ... saturation adjustment...
- memread = clamp(memread, op_width, svctx.dest_elwidth)
- else:
- # truncate/extend to over-ridden dest width.
- memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
-
- # takes care of inserting memory-read (now correctly byteswapped)
- # into regfile underlying LE-defined order, into the right place
- # within the NEON-like register, respecting destination element
- # bitwidth, and the element index (j)
- set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
-
- # increments both src and dest element indices (no predication here)
- i++;
- j++;
-```
-
-## Remapped LD/ST
-
-In the [[sv/remap]] page the concept of "Remapping" is described. Whilst
-it is expensive to set up (2 64-bit opcodes minimum) it provides a way to
-arbitrarily perform 1D, 2D and 3D "remapping" of up to 64 elements worth
-of LDs or STs. The usual interest in such re-mapping is for example in
-separating out 24-bit RGB channel data into separate contiguous registers.
-
-REMAP easily covers this capability, and with dest elwidth overrides
-and saturation may do so with built-in conversion that would normally
-require additional width-extension, sign-extension and min/max Vectorised
-instructions as post-processing stages.
-
-Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
-because the generic abstracted concept of "Remapping", when applied to
-LD/ST, will give that same capability, with far more flexibility.
-
-It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
-established through `svstep`, are also an easy way to perform regular
-Structure Packing, at the vec2/vec3/vec4 granularity level. Beyond that,
-REMAP will need to be used.
-
---------
-
-\newpage{}
-
# Condition Register SVP64 Operations
Condition Register Fields are only 4 bits wide: this presents some
```
[[!tag opf_rfc]]
+
+--------
+
+\newpage{}
+