8562d32c20cdbeaa5d4c2c82e0e964779774e8fd
[libreriscv.git] / openpower / sv / ldst.mdwn
1 [[!tag standards]]
2
3 # SV Load and Store
4
5 Links:
6
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
10 * <https://bugs.libre-soc.org/show_bug.cgi?id=940> autoincrement mode
11 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
12 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
13 * [[ldst/discussion]]
14
15 # Rationale
16
17 All Vector ISAs dating back fifty years have extensive and comprehensive
18 Load and Store operations that go far beyond the capabilities of Scalar
19 RISC and most CISC processors, yet at their heart on an individual element
20 basis may be found to be no different from RISC Scalar equivalents.
21
22 The resource savings from Vector LD/ST are significant and stem from
23 the fact that one single instruction can trigger a dozen (or in some
24 microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
25
26 Additionally, and simply: if the Arithmetic side of an ISA supports
27 Vector Operations, then in order to keep the ALUs 100% occupied the
28 Memory infrastructure (and the ISA itself) correspondingly needs Vector
29 Memory Operations as well.
30
31 Vectorised Load and Store also presents an extra dimension (literally)
32 which creates scenarios unique to Vector applications, that a Scalar
33 (and even a SIMD) ISA simply never encounters. SVP64 endeavours to
34 add the modes typically found in *all* Scalable Vector ISAs,
35 without changing the behaviour of the underlying Base
36 (Scalar) v3.0B operations in any way.
37
38 # Modes overview
39
40 Vectorisation of Load and Store requires creation, from scalar operations,
41 a number of different modes:
42
43 * **fixed aka "unit" stride** - contiguous sequence with no gaps
44 * **element strided** - sequential but regularly offset, with gaps
45 * **vector indexed** - vector of base addresses and vector of offsets
46 * **Speculative fail-first** - where it makes sense to do so
47 * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
48
49 *Despite being constructed from Scalar LD/ST none of these Modes
50 exist or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
51
52 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
53 as well as Element-width overrides and Twin-Predication.
54
55 Note also that Indexed [[sv/remap]] mode may be applied to both
56 v3.0 LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions.
57 LD/ST-Indexed should not be conflated with Indexed REMAP mode: clarification
58 is provided below.
59
60 **Determining the LD/ST Modes**
61
62 A minor complication (caused by the retro-fitting of modern Vector
63 features to a Scalar ISA) is that certain features do not exactly make
64 sense or are considered a security risk. Fail-first on Vector Indexed
65 would allow attackers to probe large numbers of pages from userspace, where
66 strided fail-first (by creating contiguous sequential LDs) does not.
67
68 In addition, reduce mode makes no sense.
69 Realistically we need
70 an alternative table definition for [[sv/svp64]] `RM.MODE`.
71 The following modes make sense:
72
73 * saturation
74 * predicate-result (mostly for cache-inhibited LD/ST)
75 * simple (no augmentation)
76 * fail-first (where Vector Indexed is banned)
77 * Signed Effective Address computation (Vector Indexed only)
78 * Pack/Unpack (on LD/ST immediate operations only)
79
80 More than that however it is necessary to fit the usual Vector ISA
81 capabilities onto both Power ISA LD/ST with immediate and to
82 LD/ST Indexed. They present subtly different Mode tables, which, due
83 to lack of space, have the following quirks:
84
85 * LD/ST Immediate has no individual control over src/dest zeroing,
86 whereas LD/ST Indexed does.
87 * LD/ST Immediate has no Saturated Pack/Unpack (Arithmetic Mode does)
88 * LD/ST Indexed has no Pack/Unpack (REMAP may be used instead)
89
90 # Format and fields
91
92 Fields used in tables below:
93
94 * **sz / dz** if predication is enabled will put zeros into the dest (or as src in the case of twin pred) when the predicate bit is zero. otherwise the element is ignored or skipped, depending on context.
95 * **zz**: both sz and dz are set equal to this flag.
96 * **inv CR bit** just as in branches (BO) these bits allow testing of a CR bit and whether it is set (inv=0) or unset (inv=1)
97 * **N** sets signed/unsigned saturation.
98 * **RC1** as if Rc=1, stores CRs *but not the result*
99 * **SEA** - Signed Effective Address, if enabled performs sign-extension on
100 registers that have been reduced due to elwidth overrides
101
102 **LD/ST immediate**
103
104 The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
105 (bits 19:23 of `RM`) is:
106
107 | 0-1 | 2 | 3 4 | description |
108 | --- | --- |---------|--------------------------- |
109 | 00 | 0 | zz els | simple mode |
110 | 00 | 1 | PI LF | post-increment and Fault-First |
111 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
112 | 01 | inv | els RC1 | Rc=0: ffirst z/nonz |
113 | 10 | N | zz els | sat mode: N=0/1 u/s |
114 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
115 | 11 | inv | els RC1 | Rc=0: pred-result z/nonz |
116
117 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
118 whether stride is unit or element:
119
120 if RA.isvec:
121 svctx.ldstmode = indexed
122 elif els == 0:
123 svctx.ldstmode = unitstride
124 elif immediate != 0:
125 svctx.ldstmode = elementstride
126
127 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
128 in effect the multiplication of the immediate-offset by zero results
129 in reading from the exact same memory location, *even with a Vector
130 register*. (Normally this type of behaviour is reserved for the
131 mapreduce modes)
132
133 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
134 just the once and be copied, rather than hitting the Data Cache
135 multiple times with the same memory read at the same location.
136 The benefit of Cache-inhibited LD-splats is that it allows
137 for memory-mapped peripherals to have multiple
138 data values read in quick succession and stored in sequentially
139 numbered registers (but, see Note below).
140
141 For non-cache-inhibited ST from a vector source onto a scalar
142 destination: with the Vector
143 loop effectively creating multiple memory writes to the same location,
144 we can deduce that the last of these will be the "successful" one. Thus,
145 implementations are free and clear to optimise out the overwriting STs,
146 leaving just the last one as the "winner". Bear in mind that predicate
147 masks will skip some elements (in source non-zeroing mode).
148 Cache-inhibited ST operations on the other hand **MUST** write out
149 a Vector source multiple successive times to the exact same Scalar
150 destination. Just like Cache-inhibited LDs, multiple values may be
151 written out in quick succession to a memory-mapped peripheral from
152 sequentially-numbered registers.
153
154 Note that any memory location may be Cache-inhibited
155 (Power ISA v3.1, Book III, 1.6.1, p1033)
156
157 *Programmer's Note: an immediate also with a Scalar source as
158 a "VSPLAT" mode is simply not possible: there are not enough
159 Mode bits. One single Scalar Load operation may be used instead, followed
160 by any arithmetic operation (including a simple mv) in "Splat"
161 mode.*
162
163 **LD/ST Indexed**
164
165 The modes for `RA+RB` indexed version are slightly different
166 but are the same `RM.MODE` bits (19:23 of `RM`):
167
168 | 0-1 | 2 | 3 4 | description |
169 | --- | --- |---------|-------------------------- |
170 | 00 | SEA | dz sz | simple mode |
171 | 01 | SEA | dz sz | Strided (scalar only source) |
172 | 10 | N | dz sz | sat mode: N=0/1 u/s |
173 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
174 | 11 | inv | zz RC1 | Rc=0: pred-result z/nonz |
175
176 Vector Indexed Strided Mode is qualified as follows:
177
178 if mode = 0b01 and !RA.isvec and !RB.isvec:
179 svctx.ldstmode = elementstride
180
181 A summary of the effect of Vectorisation of src or dest:
182
183 imm(RA) RT.v RA.v no stride allowed
184 imm(RA) RT.s RA.v no stride allowed
185 imm(RA) RT.v RA.s stride-select allowed
186 imm(RA) RT.s RA.s not vectorised
187 RA,RB RT.v {RA|RB}.v Standard Indexed
188 RA,RB RT.s {RA|RB}.v Indexed but single LD (no VSPLAT)
189 RA,RB RT.v {RA&RB}.s VSPLAT possible. stride selectable
190 RA,RB RT.s {RA&RB}.s not vectorised (scalar identity)
191
192 Signed Effective Address computation is only relevant for
193 Vector Indexed Mode, when elwidth overrides are applied.
194 The source override applies to RB, and before adding to
195 RA in order to calculate the Effective Address, if SEA is
196 set RB is sign-extended from elwidth bits to the full 64
197 bits. For other Modes (ffirst, saturate),
198 all EA computation with elwidth overrides is unsigned.
199
200 Note that cache-inhibited LD/ST when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially. Even with scalar src a
201 Cache-inhibited LD will read the same memory location *multiple times*, storing the result in successive Vector destination registers. This because the cache-inhibit instructions are typically used to read and write memory-mapped peripherals.
202 If a genuine cache-inhibited LD-VSPLAT is required then a single *scalar*
203 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv,
204 copying the one *scalar* value into multiple register destinations.
205
206 Note also that cache-inhibited VSPLAT with Predicate-result is possible.
207 This allows for example to issue a massive batch of memory-mapped
208 peripheral reads, stopping at the first NULL-terminated character and
209 truncating VL to that point. No branch is needed to issue that large burst
210 of LDs, which may be valuable in Embedded scenarios.
211
212 # Vectorisation of Scalar Power ISA v3.0B
213
214 Scalar Power ISA Load/Store operations may be seen from [[isa/fixedload]] and
215 [[isa/fixedstore]] pseudocode to be of the form:
216
217 lbux RT, RA, RB
218 EA <- (RA) + (RB)
219 RT <- MEM(EA)
220
221 and for immediate variants:
222
223 lb RT,D(RA)
224 EA <- RA + EXTS(D)
225 RT <- MEM(EA)
226
227 Thus in the first example, the source registers may each be independently
228 marked as scalar or vector, and likewise the destination; in the second
229 example only the one source and one dest may be marked as scalar or
230 vector.
231
232 Thus we can see that Vector Indexed may be covered, and, as demonstrated
233 with the pseudocode below, the immediate can be used to give unit stride or element stride. With there being no way to tell which from the Power v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
234
235 # LD not VLD! format - ldop RT, immed(RA)
236 # op_width: lb=1, lh=2, lw=4, ld=8
237 op_load(RT, RA, op_width, immed, svctx, RAupdate):
238  ps = get_pred_val(FALSE, RA); # predication on src
239  pd = get_pred_val(FALSE, RT); # ... AND on dest
240  for (i=0, j=0, u=0; i < VL && j < VL;):
241 # skip nonpredicates elements
242 if (RA.isvec) while (!(ps & 1<<i)) i++;
243 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
244 if (RT.isvec) while (!(pd & 1<<j)) j++;
245 if svctx.ldstmode == elementstride:
246 # element stride mode
247 srcbase = ireg[RA]
248 offs = i * immed # j*immed for a ST
249 elif svctx.ldstmode == unitstride:
250 # unit stride mode
251 srcbase = ireg[RA]
252 offs = immed + (i * op_width) # j*op_width for ST
253 elif RA.isvec:
254 # quirky Vector indexed mode but with an immediate
255 srcbase = ireg[RA+i]
256 offs = immed;
257 else
258 # standard scalar mode (but predicated)
259 # no stride multiplier means VSPLAT mode
260 srcbase = ireg[RA]
261 offs = immed
262
263 # compute EA
264 EA = srcbase + offs
265 # update RA?
266 if RAupdate: ireg[RAupdate+u] = EA;
267 # load from memory
268 ireg[RT+j] <= MEM[EA];
269 if (!RT.isvec)
270 break # destination scalar, end now
271 if (RA.isvec) i++;
272 if (RAupdate.isvec) u++;
273 if (RT.isvec) j++;
274
275 Indexed LD is:
276
277 # format: ldop RT, RA, RB
278 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
279  ps = get_pred_val(FALSE, RA); # predication on src
280  pd = get_pred_val(FALSE, RT); # ... AND on dest
281  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
282 # skip nonpredicated RA, RB and RT
283 if (RA.isvec) while (!(ps & 1<<i)) i++;
284 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
285 if (RB.isvec) while (!(ps & 1<<k)) k++;
286 if (RT.isvec) while (!(pd & 1<<j)) j++;
287 if svctx.ldstmode == elementstride:
288 EA = ireg[RA] + ireg[RB]*j # register-strided
289 else
290 EA = ireg[RA+i] + ireg[RB+k] # indexed address
291 if RAupdate: ireg[RAupdate+u] = EA
292 ireg[RT+j] <= MEM[EA];
293 if (!RT.isvec)
294 break # destination scalar, end immediately
295 if (RA.isvec) i++;
296 if (RAupdate.isvec) u++;
297 if (RB.isvec) k++;
298 if (RT.isvec) j++;
299
300 Note that Element-Strided uses the Destination Step because with both
301 sources being Scalar as a prerequisite condition of activation of
302 Element-Stride Mode, the source step (being Scalar) would never advance.
303
304 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source. This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
305
306 *Programmer's note: being able to set RA-as-a-source
307 as separate from RA-as-a-destination as Scalar is **extremely valuable**
308 once it is remembered that Simple-V element operations must
309 be in Program Order, especially in loops, for saving on
310 multiple address computations. Care does have
311 to be taken however that RA-as-src is not overwritten by
312 RA-as-dest unless intentionally desired, especially in element-strided Mode.*
313
314 # LD/ST Indexed vs Indexed REMAP
315
316 Unfortunately the word "Indexed" is used twice in completely different
317 contexts, potentially causing confusion.
318
319 * There has existed instructions in the Power ISA `ld RT,RA,RB` since
320 its creation: these are called "LD/ST Indexed" instructions and their
321 name and meaning is well-established.
322 * There now exists, in Simple-V, a [[sv/remap]] mode called "Indexed"
323 Mode that can be applied to *any* instruction **including those
324 named LD/ST Indexed**.
325
326 Whilst it may be costly in terms of register reads to allow REMAP
327 Indexed Mode to be applied to any Vectorised LD/ST Indexed operation such as
328 `sv.ld *RT,RA,*RB`, or even misleadingly
329 labelled as redundant, firstly the strict
330 application of the RISC Paradigm that Simple-V follows makes it awkward
331 to consider *preventing* the application of Indexed REMAP to such
332 operations, and secondly they are not actually the same at all.
333
334 Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
335 effectively performs an *in-place* re-ordering of the offsets, RB.
336 To achieve the same effect without Indexed REMAP would require taking
337 a *copy* of the Vector of offsets starting at RB, manually explicitly
338 reordering them, and finally using the copy of re-ordered offsets in
339 a non-REMAP'ed `sv.ld`. Using non-strided LD as an example,
340 pseudocode showing what actually occurs,
341 where the pseudocode for `indexed_remap` may be found in [[sv/remap]]:
342
343 # sv.ld *RT,RA,*RB with Index REMAP applied to RB
344 for i in 0..VL-1:
345 if remap.indexed:
346 rb_idx = indexed_remap(i) # remap
347 else:
348 rb_idx = i # use the index as-is
349 EA = GPR(RA) + GPR(RB+rb_idx)
350 GPR(RT+i) = MEM(EA, 8)
351
352 Thus it can be seen that the use of Indexed REMAP saves copying
353 and manual reordering of the Vector of RB offsets.
354
355 # LD/ST ffirst
356
357 LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
358 is not active) as an
359 ordinary one, with all behaviour with respect to Interrupts Exceptions
360 Page Faults Memory Management being identical in every regard to Scalar
361 v3.0 Power ISA LD/ST. However for elements 1
362 and above, if an exception would occur, then VL is **truncated** to the
363 previous element: the exception is **not** then raised because the
364 LD/ST that would otherwise have caused an exception is *required* to be cancelled. Additionally an implementor may choose to truncate VL for
365 any arbitrary reason *except for the very first*.
366
367 ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting speculative feedback on which pages would fail. Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST. See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
368
369 for(i = 0; i < VL; i++)
370 reg[rt + i] = mem[reg[ra] + i * reg[rb]];
371
372 High security implementations where any kind of speculative probing
373 of memory pages is considered a risk should take advantage of the fact that
374 implementations may truncate VL at any point, without requiring software
375 to be rewritten and made non-portable. Such implementations may choose
376 to *always* set VL=1 which will have the effect of terminating any
377 speculative probing (and also adversely affect performance), but will
378 at least not require applications to be rewritten.
379
380 Low-performance simpler hardware implementations may also
381 choose (always) to also set VL=1 as the bare minimum compliant implementation of
382 LD/ST Fail-First. It is however critically important to remember that
383 the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
384 **MUST** raise exceptions exactly like an ordinary LD/ST.
385
386 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins the following ffirst LD/ST operations on an aligned boundary
387 such as the beginning of a cache line, or beginning of a Virtual Memory
388 page. Likewise, to reduce workloads or balance resources.
389
390 Vertical-First Mode is slightly strange in that only one element
391 at a time is ever executed anyway. Given that programmers may
392 legitimately choose to alter srcstep and dststep in non-sequential
393 order as part of explicit loops, it is neither possible nor
394 safe to make speculative assumptions about future LD/STs.
395 Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
396 This is very different from Arithmetic (Data-dependent) FFirst
397 where Vertical-First Mode is fully deterministic, not speculative.
398
399 # LOAD/STORE Elwidths <a name="elwidth"></a>
400
401 Loads and Stores are almost unique in that the Power Scalar ISA
402 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
403 others like it provide an explicit operation width. There are therefore
404 *three* widths involved:
405
406 * operation width (lb=8, lh=16, lw=32, ld=64)
407 * src element width override (8/16/32/default)
408 * destination element width override (8/16/32/default)
409
410 Some care is therefore needed to express and make clear the transformations,
411 which are expressly in this order:
412
413 * Calculate the Effective Address from RA at full width
414 but (on Indexed Load) allow srcwidth overrides on RB
415 * Load at the operation width (lb/lh/lw/ld) as usual
416 * byte-reversal as usual
417 * Non-saturated mode:
418 - zero-extension or truncation from operation width to dest elwidth
419 - place result in destination at dest elwidth
420 * Saturated mode:
421 - Sign-extension or truncation from operation width to dest width
422 - signed/unsigned saturation down to dest elwidth
423
424 In order to respect Power v3.0B Scalar behaviour the memory side
425 is treated effectively as completely separate and distinct from SV
426 augmentation. This is primarily down to quirks surrounding LE/BE and
427 byte-reversal.
428
429 It is rather unfortunately possible to request an elwidth override
430 on the memory side which
431 does not mesh with the overridden operation width: these result in
432 `UNDEFINED`
433 behaviour. The reason is that the effect of attempting a 64-bit `sv.ld`
434 operation with a source elwidth override of 8/16/32 would result in
435 overlapping memory requests, particularly on unit and element strided
436 operations. Thus it is `UNDEFINED` when the elwidth is smaller than
437 the memory operation width. Examples include `sv.lw/sw=16/els` which
438 requests (overlapping) 4-byte memory reads offset from
439 each other at 2-byte intervals. Store likewise is also `UNDEFINED`
440 where the dest elwidth override is less than the operation width.
441
442 Note the following regarding the pseudocode to follow:
443
444 * `scalar identity behaviour` SV Context parameter conditions turn this
445 into a straight absolute fully-compliant Scalar v3.0B LD operation
446 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
447 rather than `ld`)
448 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
449 a "normal" part of Scalar v3.0B LD
450 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
451 as a "normal" part of Scalar v3.0B LD
452 * `svctx` specifies the SV Context and includes VL as well as
453 source and destination elwidth overrides.
454
455 Below is the pseudocode for Unit-Strided LD (which includes Vector capability). Observe in particular that RA, as the base address in
456 both Immediate and Indexed LD/ST,
457 does not have element-width overriding applied to it.
458
459 Note that predication, predication-zeroing,
460 and other modes except saturation have all been removed,
461 for clarity and simplicity:
462
463 # LD not VLD!
464 # this covers unit stride mode and a type of vector offset
465 function op_ld(RT, RA, op_width, imm_offs, svctx)
466 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
467 if not svctx.unit/el-strided:
468 # strange vector mode, compute 64 bit address which is
469 # not polymorphic! elwidth hardcoded to 64 here
470 srcbase = get_polymorphed_reg(RA, 64, i)
471 else:
472 # unit / element stride mode, compute 64 bit address
473 srcbase = get_polymorphed_reg(RA, 64, 0)
474 # adjust for unit/el-stride
475 srcbase += ....
476
477 # read the underlying memory
478 memread <= MEM(srcbase + imm_offs, op_width)
479
480 # check saturation.
481 if svpctx.saturation_mode:
482 # ... saturation adjustment...
483 memread = clamp(memread, op_width, svctx.dest_elwidth)
484 else:
485 # truncate/extend to over-ridden dest width.
486 memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
487
488 # takes care of inserting memory-read (now correctly byteswapped)
489 # into regfile underlying LE-defined order, into the right place
490 # within the NEON-like register, respecting destination element
491 # bitwidth, and the element index (j)
492 set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
493
494 # increments both src and dest element indices (no predication here)
495 i++;
496 j++;
497
498 Note above that the source elwidth is *not used at all* in LD-immediate.
499
500 For LD/Indexed, the key is that in the calculation of the Effective Address,
501 RA has no elwidth override but RB does. Pseudocode below is simplified
502 for clarity: predication and all modes except saturation are removed:
503
504 # LD not VLD! ld*rx if brev else ld*
505 function op_ld(RT, RA, RB, op_width, svctx, brev)
506 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
507 if not svctx.el-strided:
508 # RA not polymorphic! elwidth hardcoded to 64 here
509 srcbase = get_polymorphed_reg(RA, 64, i)
510 else:
511 # element stride mode, again RA not polymorphic
512 srcbase = get_polymorphed_reg(RA, 64, 0)
513 # RB *is* polymorphic
514 offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
515 # sign-extend
516 if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
517
518 # takes care of (merges) processor LE/BE and ld/ldbrx
519 bytereverse = brev XNOR MSR.LE
520
521 # read the underlying memory
522 memread <= MEM(srcbase + offs, op_width)
523
524 # optionally performs byteswap at op width
525 if (bytereverse):
526 memread = byteswap(memread, op_width)
527
528 if svpctx.saturation_mode:
529 # ... saturation adjustment...
530 memread = clamp(memread, op_width, svctx.dest_elwidth)
531 else:
532 # truncate/extend to over-ridden dest width.
533 memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
534
535 # takes care of inserting memory-read (now correctly byteswapped)
536 # into regfile underlying LE-defined order, into the right place
537 # within the NEON-like register, respecting destination element
538 # bitwidth, and the element index (j)
539 set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
540
541 # increments both src and dest element indices (no predication here)
542 i++;
543 j++;
544
545 # Remapped LD/ST
546
547 In the [[sv/remap]] page the concept of "Remapping" is described.
548 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
549 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
550 elements worth of LDs or STs. The usual interest in such re-mapping
551 is for example in separating out 24-bit RGB channel data into separate
552 contiguous registers. NEON covers this as shown in the diagram below:
553
554 ![Load/Strore remap](/openpower/sv/load-store.svg)
555
556 Remap easily covers this capability, and with dest
557 elwidth overrides and saturation may do so with built-in conversion that
558 would normally require additional width-extension, sign-extension and
559 min/max Vectorised instructions as post-processing stages.
560
561 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
562 because the generic abstracted concept of "Remapping", when applied to
563 LD/ST, will give that same capability, with far more flexibility.
564
565 It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
566 established through sv.setvl, are also an easy way to perform regular
567 Structure Packing, at the vec2/vec3/vec4 granularity level. Beyond
568 that, REMAP will need to be used.