spell fixes
[libreriscv.git] / openpower / sv / ldst.mdwn
1 # SV Load and Store
2
3 Links:
4
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
6 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=940> autoincrement mode
9 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
10 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
11 * [[ldst/discussion]]
12
13 ## Rationale
14
15 All Vector ISAs dating back fifty years have extensive and comprehensive
16 Load and Store operations that go far beyond the capabilities of Scalar
17 RISC and most CISC processors, yet at their heart on an individual element
18 basis may be found to be no different from RISC Scalar equivalents.
19
20 The resource savings from Vector LD/ST are significant and stem
21 from the fact that one single instruction can trigger a dozen (or in
22 some microarchitectures such as Cray or NEC SX Aurora) hundreds of
23 element-level Memory accesses.
24
25 Additionally, and simply: if the Arithmetic side of an ISA supports
26 Vector Operations, then in order to keep the ALUs 100% occupied the
27 Memory infrastructure (and the ISA itself) correspondingly needs Vector
28 Memory Operations as well.
29
30 Vectorised Load and Store also presents an extra dimension (literally)
31 which creates scenarios unique to Vector applications, that a Scalar (and
32 even a SIMD) ISA simply never encounters. SVP64 endeavours to add the
33 modes typically found in *all* Scalable Vector ISAs, without changing the
34 behaviour of the underlying Base (Scalar) v3.0B operations in any way.
35 (The sole apparent exception is Post-Increment Mode on LD/ST-update
36 instructions)
37
38 ## Modes overview
39
40 Vectorisation of Load and Store requires creation, from scalar operations,
41 a number of different modes:
42
43 * **fixed aka "unit" stride** - contiguous sequence with no gaps
44 * **element strided** - sequential but regularly offset, with gaps
45 * **vector indexed** - vector of base addresses and vector of offsets
46 * **Speculative fail-first** - where it makes sense to do so
47 * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
48
49 *Despite being constructed from Scalar LD/ST none of these Modes exist
50 or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
51
52 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
53 as well as Element-width overrides and Twin-Predication.
54
55 Note also that Indexed [[sv/remap]] mode may be applied to both v3.0
56 LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions.
57 LD/ST-Indexed should not be conflated with Indexed REMAP mode:
58 clarification is provided below.
59
60 **Determining the LD/ST Modes**
61
62 A minor complication (caused by the retro-fitting of modern Vector
63 features to a Scalar ISA) is that certain features do not exactly make
64 sense or are considered a security risk. Fail-first on Vector Indexed
65 would allow attackers to probe large numbers of pages from userspace,
66 where strided fail-first (by creating contiguous sequential LDs) does not.
67
68 In addition, reduce mode makes no sense. Realistically we need an
69 alternative table definition for [[sv/svp64]] `RM.MODE`. The following
70 modes make sense:
71
72 * saturation
73 * predicate-result would be useful but is lower priority than Data-Dependent Fail-First
74 * simple (no augmentation)
75 * fail-first (where Vector Indexed is banned)
76 * Signed Effective Address computation (Vector Indexed only)
77
78 More than that however it is necessary to fit the usual Vector ISA
79 capabilities onto both Power ISA LD/ST with immediate and to LD/ST
80 Indexed. They present subtly different Mode tables, which, due to lack
81 of space, have the following quirks:
82
83 * LD/ST Immediate has no individual control over src/dest zeroing,
84 whereas LD/ST Indexed does.
85 * LD/ST Immediate has saturation but LD/ST Indexed does not.
86
87 ## Format and fields
88
89 Fields used in tables below:
90
91 * **sz / dz** if predication is enabled will put zeros into the dest
92 (or as src in the case of twin pred) when the predicate bit is zero.
93 otherwise the element is ignored or skipped, depending on context.
94 * **zz**: both sz and dz are set equal to this flag.
95 * **inv CR bit** just as in branches (BO) these bits allow testing of
96 a CR bit and whether it is set (inv=0) or unset (inv=1)
97 * **N** sets signed/unsigned saturation.
98 * **RC1** as if Rc=1, stores CRs *but not the result*
99 * **SEA** - Signed Effective Address, if enabled performs sign-extension on
100 registers that have been reduced due to elwidth overrides
101 * **PI** - post-increment mode (applies to LD/ST with update only).
102 the Effective Address utilised is always just RA, i.e. the computation of
103 EA is stored in RA **after** it is actually used.
104 * **LF** - Load/Store Fail or Fault First: for any reason Load or Store Vectors
105 may be truncated to (at least) one element, and VL altered to indicate such.
106 * **VLi* - Inclusive Data-Dependent Fail-First: the failing element is included
107 in the Truncated Vector.
108
109 When VLi=0 on Store Operations the Memory update does **not** take place
110 on the element that failed. EA does **not** update into RA on Load/Store
111 with Update instructions either.
112
113 **LD/ST immediate**
114
115 The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
116 (bits 19:23 of `RM`) is:
117
118 | 0 | 1 | 2 | 3 4 | description |
119 |---|---| --- |---------|--------------------------- |
120 | 0 | 0 | 0 | zz els | simple mode |
121 | 0 | 0 | 1 | PI LF | post-increment and Fault-First |
122 | 1 | 0 | N | zz els | sat mode: N=0/1 u/s |
123 |VLi| 1 | inv | CR-bit | Rc=1: ffirst CR sel |
124 |VLi| 1 | inv | els RC1 | Rc=0: ffirst z/nonz |
125
126
127 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
128 whether stride is unit or element:
129
130 ```
131 if RA.isvec:
132 svctx.ldstmode = indexed
133 elif els == 0:
134 svctx.ldstmode = unitstride
135 elif immediate != 0:
136 svctx.ldstmode = elementstride
137 ```
138
139 An immediate of zero is a safety-valve to allow `LD-VSPLAT`: in effect
140 the multiplication of the immediate-offset by zero results in reading from
141 the exact same memory location, *even with a Vector register*. (Normally
142 this type of behaviour is reserved for the mapreduce modes)
143
144 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur just
145 the once and be copied, rather than hitting the Data Cache multiple
146 times with the same memory read at the same location. The benefit of
147 Cache-inhibited LD-splats is that it allows for memory-mapped peripherals
148 to have multiple data values read in quick succession and stored in
149 sequentially numbered registers (but, see Note below).
150
151 For non-cache-inhibited ST from a vector source onto a scalar destination:
152 with the Vector loop effectively creating multiple memory writes to
153 the same location, we can deduce that the last of these will be the
154 "successful" one. Thus, implementations are free and clear to optimise
155 out the overwriting STs, leaving just the last one as the "winner".
156 Bear in mind that predicate masks will skip some elements (in source
157 non-zeroing mode). Cache-inhibited ST operations on the other hand
158 **MUST** write out a Vector source multiple successive times to the exact
159 same Scalar destination. Just like Cache-inhibited LDs, multiple values
160 may be written out in quick succession to a memory-mapped peripheral
161 from sequentially-numbered registers.
162
163 Note that any memory location may be Cache-inhibited
164 (Power ISA v3.1, Book III, 1.6.1, p1033)
165
166 *Programmer's Note: an immediate also with a Scalar source as a "VSPLAT"
167 mode is simply not possible: there are not enough Mode bits. One single
168 Scalar Load operation may be used instead, followed by any arithmetic
169 operation (including a simple mv) in "Splat" mode.*
170
171 **LD/ST Indexed**
172
173 The modes for `RA+RB` indexed version are slightly different
174 but are the same `RM.MODE` bits (19:23 of `RM`):
175
176 | 0 | 1 | 2 | 3 4 | description |
177 |---|---| --- |---------|--------------------------- |
178 |els| 0 | SEA | dz sz | simple mode |
179 |VLi| 1 | inv | CR-bit | Rc=1: ffirst CR sel |
180 |VLi| 1 | inv | els RC1 | Rc=0: ffirst z/nonz |
181
182 Vector Indexed Strided Mode is qualified as follows:
183
184 ```
185 if els and !RA.isvec and !RB.isvec:
186 svctx.ldstmode = elementstride
187 ```
188
189 A summary of the effect of Vectorisation of src or dest:
190
191 ```
192 imm(RA) RT.v RA.v no stride allowed
193 imm(RA) RT.s RA.v no stride allowed
194 imm(RA) RT.v RA.s stride-select allowed
195 imm(RA) RT.s RA.s not vectorised
196 RA,RB RT.v {RA|RB}.v Standard Indexed
197 RA,RB RT.s {RA|RB}.v Indexed but single LD (no VSPLAT)
198 RA,RB RT.v {RA&RB}.s VSPLAT possible. stride selectable
199 RA,RB RT.s {RA&RB}.s not vectorised (scalar identity)
200 ```
201
202 Signed Effective Address computation is only relevant for Vector Indexed
203 Mode, when elwidth overrides are applied. The source override applies to
204 RB, and before adding to RA in order to calculate the Effective Address,
205 if SEA is set RB is sign-extended from elwidth bits to the full 64 bits.
206 For other Modes (ffirst, saturate), all EA computation with elwidth
207 overrides is unsigned.
208
209 Note that cache-inhibited LD/ST when VSPLAT is activated will perform
210 **multiple** LD/ST operations, sequentially. Even with scalar src
211 a Cache-inhibited LD will read the same memory location *multiple
212 times*, storing the result in successive Vector destination registers.
213 This because the cache-inhibit instructions are typically used to read
214 and write memory-mapped peripherals. If a genuine cache-inhibited
215 LD-VSPLAT is required then a single *scalar* cache-inhibited LD should
216 be performed, followed by a VSPLAT-augmented mv, copying the one *scalar*
217 value into multiple register destinations.
218
219 Note also that cache-inhibited VSPLAT with Predicate-result is possible.
220 This allows for example to issue a massive batch of memory-mapped
221 peripheral reads, stopping at the first NULL-terminated character and
222 truncating VL to that point. No branch is needed to issue that large
223 burst of LDs, which may be valuable in Embedded scenarios.
224
225 ## Vectorisation of Scalar Power ISA v3.0B
226
227 Scalar Power ISA Load/Store operations may be seen from [[isa/fixedload]]
228 and [[isa/fixedstore]] pseudocode to be of the form:
229
230 ```
231 lbux RT, RA, RB
232 EA <- (RA) + (RB)
233 RT <- MEM(EA)
234 ```
235
236 and for immediate variants:
237
238 ```
239 lb RT,D(RA)
240 EA <- RA + EXTS(D)
241 RT <- MEM(EA)
242 ```
243
244 Thus in the first example, the source registers may each be independently
245 marked as scalar or vector, and likewise the destination; in the second
246 example only the one source and one dest may be marked as scalar or
247 vector.
248
249 Thus we can see that Vector Indexed may be covered, and, as demonstrated
250 with the pseudocode below, the immediate can be used to give unit
251 stride or element stride. With there being no way to tell which from
252 the Power v3.0B Scalar opcode alone, the choice is provided instead by
253 the SV Context.
254
255 ```
256 # LD not VLD! format - ldop RT, immed(RA)
257 # op_width: lb=1, lh=2, lw=4, ld=8
258 op_load(RT, RA, op_width, immed, svctx, RAupdate):
259  ps = get_pred_val(FALSE, RA); # predication on src
260  pd = get_pred_val(FALSE, RT); # ... AND on dest
261  for (i=0, j=0, u=0; i < VL && j < VL;):
262 # skip nonpredicates elements
263 if (RA.isvec) while (!(ps & 1<<i)) i++;
264 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
265 if (RT.isvec) while (!(pd & 1<<j)) j++;
266 if postinc:
267 offs = 0; # added afterwards
268 if RA.isvec: srcbase = ireg[RA+i]
269 else srcbase = ireg[RA]
270 elif svctx.ldstmode == elementstride:
271 # element stride mode
272 srcbase = ireg[RA]
273 offs = i * immed # j*immed for a ST
274 elif svctx.ldstmode == unitstride:
275 # unit stride mode
276 srcbase = ireg[RA]
277 offs = immed + (i * op_width) # j*op_width for ST
278 elif RA.isvec:
279 # quirky Vector indexed mode but with an immediate
280 srcbase = ireg[RA+i]
281 offs = immed;
282 else
283 # standard scalar mode (but predicated)
284 # no stride multiplier means VSPLAT mode
285 srcbase = ireg[RA]
286 offs = immed
287
288 # compute EA
289 EA = srcbase + offs
290 # load from memory
291 ireg[RT+j] <= MEM[EA];
292 # check post-increment of EA
293 if postinc: EA = srcbase + immed;
294 # update RA?
295 if RAupdate: ireg[RAupdate+u] = EA;
296 if (!RT.isvec)
297 break # destination scalar, end now
298 if (RA.isvec) i++;
299 if (RAupdate.isvec) u++;
300 if (RT.isvec) j++;
301 ```
302
303 Indexed LD is:
304
305 ```
306 # format: ldop RT, RA, RB
307 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
308  ps = get_pred_val(FALSE, RA); # predication on src
309  pd = get_pred_val(FALSE, RT); # ... AND on dest
310  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
311 # skip nonpredicated RA, RB and RT
312 if (RA.isvec) while (!(ps & 1<<i)) i++;
313 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
314 if (RB.isvec) while (!(ps & 1<<k)) k++;
315 if (RT.isvec) while (!(pd & 1<<j)) j++;
316 if svctx.ldstmode == elementstride:
317 EA = ireg[RA] + ireg[RB]*j # register-strided
318 else
319 EA = ireg[RA+i] + ireg[RB+k] # indexed address
320 if RAupdate: ireg[RAupdate+u] = EA
321 ireg[RT+j] <= MEM[EA];
322 if (!RT.isvec)
323 break # destination scalar, end immediately
324 if (RA.isvec) i++;
325 if (RAupdate.isvec) u++;
326 if (RB.isvec) k++;
327 if (RT.isvec) j++;
328 ```
329
330 Note that Element-Strided uses the Destination Step because with both
331 sources being Scalar as a prerequisite condition of activation of
332 Element-Stride Mode, the source step (being Scalar) would never advance.
333
334 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update"
335 mode (`ldux`) to be effectively a *completely different* register from
336 RA-as-a-source. This because there is room in svp64 to extend RA-as-src
337 as well as RA-as-dest, both independently as scalar or vector *and*
338 independently extending their range.
339
340 *Programmer's note: being able to set RA-as-a-source as separate from
341 RA-as-a-destination as Scalar is **extremely valuable** once it is
342 remembered that Simple-V element operations must be in Program Order,
343 especially in loops, for saving on multiple address computations. Care
344 does have to be taken however that RA-as-src is not overwritten by
345 RA-as-dest unless intentionally desired, especially in element-strided
346 Mode.*
347
348 ## LD/ST Indexed vs Indexed REMAP
349
350 Unfortunately the word "Indexed" is used twice in completely different
351 contexts, potentially causing confusion.
352
353 * There has existed instructions in the Power ISA `ld RT,RA,RB` since
354 its creation: these are called "LD/ST Indexed" instructions and their
355 name and meaning is well-established.
356 * There now exists, in Simple-V, a [[sv/remap]] mode called "Indexed"
357 Mode that can be applied to *any* instruction **including those
358 named LD/ST Indexed**.
359
360 Whilst it may be costly in terms of register reads to allow REMAP Indexed
361 Mode to be applied to any Vectorised LD/ST Indexed operation such as
362 `sv.ld *RT,RA,*RB`, or even misleadingly labelled as redundant, firstly
363 the strict application of the RISC Paradigm that Simple-V follows makes
364 it awkward to consider *preventing* the application of Indexed REMAP to
365 such operations, and secondly they are not actually the same at all.
366
367 Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
368 effectively performs an *in-place* re-ordering of the offsets, RB.
369 To achieve the same effect without Indexed REMAP would require taking
370 a *copy* of the Vector of offsets starting at RB, manually explicitly
371 reordering them, and finally using the copy of re-ordered offsets in a
372 non-REMAP'ed `sv.ld`. Using non-strided LD as an example, pseudocode
373 showing what actually occurs, where the pseudocode for `indexed_remap`
374 may be found in [[sv/remap]]:
375
376 ```
377 # sv.ld *RT,RA,*RB with Index REMAP applied to RB
378 for i in 0..VL-1:
379 if remap.indexed:
380 rb_idx = indexed_remap(i) # remap
381 else:
382 rb_idx = i # use the index as-is
383 EA = GPR(RA) + GPR(RB+rb_idx)
384 GPR(RT+i) = MEM(EA, 8)
385 ```
386
387 Thus it can be seen that the use of Indexed REMAP saves copying
388 and manual reordering of the Vector of RB offsets.
389
390 ## LD/ST ffirst
391
392 LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
393 is not active) as an ordinary one, with all behaviour with respect to
394 Interrupts Exceptions Page Faults Memory Management being identical
395 in every regard to Scalar v3.0 Power ISA LD/ST. However for elements
396 1 and above, if an exception would occur, then VL is **truncated**
397 to the previous element: the exception is **not** then raised because
398 the LD/ST that would otherwise have caused an exception is *required*
399 to be cancelled. Additionally an implementor may choose to truncate VL
400 for any arbitrary reason *except for the very first*.
401
402 ffirst LD/ST to multiple pages via a Vectorised Index base is
403 considered a security risk due to the abuse of probing multiple
404 pages in rapid succession and getting speculative feedback on which
405 pages would fail. Therefore Vector Indexed LD/ST is prohibited
406 entirely, and the Mode bit instead used for element-strided LD/ST.
407 See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
408
409 ```
410 for(i = 0; i < VL; i++)
411 reg[rt + i] = mem[reg[ra] + i * reg[rb]];
412 ```
413
414 High security implementations where any kind of speculative probing of
415 memory pages is considered a risk should take advantage of the fact
416 that implementations may truncate VL at any point, without requiring
417 software to be rewritten and made non-portable. Such implementations may
418 choose to *always* set VL=1 which will have the effect of terminating
419 any speculative probing (and also adversely affect performance), but
420 will at least not require applications to be rewritten.
421
422 Low-performance simpler hardware implementations may also choose (always)
423 to also set VL=1 as the bare minimum compliant implementation of LD/ST
424 Fail-First. It is however critically important to remember that the first
425 element LD/ST **MUST** be treated as an ordinary LD/ST, i.e. **MUST**
426 raise exceptions exactly like an ordinary LD/ST.
427
428 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value
429 for any implementation-specific reason. For example: it is perfectly
430 reasonable for implementations to alter VL when ffirst LD or ST operations
431 are initiated on a nonaligned boundary, such that within a loop the
432 subsequent iteration of that loop begins the following ffirst LD/ST
433 operations on an aligned boundary such as the beginning of a cache line,
434 or beginning of a Virtual Memory page. Likewise, to reduce workloads or
435 balance resources.
436
437 Vertical-First Mode is slightly strange in that only one element at a time
438 is ever executed anyway. Given that programmers may legitimately choose
439 to alter srcstep and dststep in non-sequential order as part of explicit
440 loops, it is neither possible nor safe to make speculative assumptions
441 about future LD/STs. Therefore, Fail-First LD/ST in Vertical-First is
442 `UNDEFINED`. This is very different from Arithmetic (Data-dependent)
443 FFirst where Vertical-First Mode is fully deterministic, not speculative.
444
445 ## Data-Dependent Fail-First (not Fail/Fault-First)
446
447 Not to be confused with Fail/Fault First, Data-Fail-First performs an
448 additional check on the data into a Condition Register Field and if a test on
449 the CR Field fails then VL is truncated and further looping terminates.
450 This is precisely the same as Arithmetic Data-Dependent Fail-First, the
451 only difference being that the result comes from the LD/ST.
452
453 In the case of Store operations there is a quirk when VLi (VL inclusive
454 is "Valid") is clear. Bear in mind the criteria is that the truncated
455 Vector of results,
456 when VLi is clear, must all pass the "test", but when VLi is set the
457 *current failed test* is permitted to be included. Thus, the actual
458 update (store) to Memory is **not permitted to take place** should the
459 test fail. Therefore, on testing the value to be stored, and after updating
460 the corresponding CR Field Element, when VLi=0 and finding that the
461 test fails the Memory store must **not** occur.
462
463 Additionally, when VLi=0 and a test fails then RA does **not** receive a
464 copy of the Effective Address. Hardware implementations with Out-of-Order
465 Micro-Architectures should use speculative Shadow-Hold and Cancellation
466 when the test fails.
467
468 By contrast if VLi=1
469 and the test fails, Store may proceed *and then* looping terminates.
470 In this way, when non-Inclusive, the Vector of Truncated results contains
471 only Stores that passed the test (and RA=EA updates if any), and when Inclusive the Vector of
472 Truncated results contains the first-failed data.
473
474 Below is an example of loading the starting addresses of Linked-List nodes.
475 If VLi=1 it will load the NULL pointer into the Vector of results.
476 If however VLi=0 it will *exclude* the NULL pointer by truncating VL to
477 one Element earlier.
478
479 ```
480 RT=1 # vec - deliberately overlaps by one with RA
481 RA=0 # vec - first one is valid, contains ptr
482 imm = 8 # offset_of(ptr->next)
483 for i in range(VL):
484 EA = GPR(RA+i) + imm # ptr + offset(next)
485 data = MEM(EA, 8) # 64-bit address of ptr->next
486 GPR(RT+i) = data # happens to be read on next loop!
487 # was a normal ld up to this point. now the Data-Fail-First
488 CR.field(i) = conditions(data)
489 if CR.field(i).EQ == testbit: # check if zero
490 if VLI then VL = i+1 # update VL, inclusive
491 else VL = i # update VL
492 break # stop looping
493 ```
494
495 ## LOAD/STORE Elwidths <a name="elwidth"></a>
496
497 Loads and Stores are almost unique in that the Power Scalar ISA
498 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
499 others like it provide an explicit operation width. There are therefore
500 *three* widths involved:
501
502 * operation width (lb=8, lh=16, lw=32, ld=64)
503 * src element width override (8/16/32/default)
504 * destination element width override (8/16/32/default)
505
506 Some care is therefore needed to express and make clear the transformations,
507 which are expressly in this order:
508
509 * Calculate the Effective Address from RA at full width
510 but (on Indexed Load) allow srcwidth overrides on RB
511 * Load at the operation width (lb/lh/lw/ld) as usual
512 * byte-reversal as usual
513 * Non-saturated mode:
514 - zero-extension or truncation from operation width to dest elwidth
515 - place result in destination at dest elwidth
516 * Saturated mode:
517 - Sign-extension or truncation from operation width to dest width
518 - signed/unsigned saturation down to dest elwidth
519
520 In order to respect Power v3.0B Scalar behaviour the memory side
521 is treated effectively as completely separate and distinct from SV
522 augmentation. This is primarily down to quirks surrounding LE/BE and
523 byte-reversal.
524
525 It is rather unfortunately possible to request an elwidth override on
526 the memory side which does not mesh with the overridden operation width:
527 these result in `UNDEFINED` behaviour. The reason is that the effect
528 of attempting a 64-bit `sv.ld` operation with a source elwidth override
529 of 8/16/32 would result in overlapping memory requests, particularly
530 on unit and element strided operations. Thus it is `UNDEFINED` when
531 the elwidth is smaller than the memory operation width. Examples include
532 `sv.lw/sw=16/els` which requests (overlapping) 4-byte memory reads offset
533 from each other at 2-byte intervals. Store likewise is also `UNDEFINED`
534 where the dest elwidth override is less than the operation width.
535
536 Note the following regarding the pseudocode to follow:
537
538 * `scalar identity behaviour` SV Context parameter conditions turn this
539 into a straight absolute fully-compliant Scalar v3.0B LD operation
540 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
541 rather than `ld`)
542 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
543 a "normal" part of Scalar v3.0B LD
544 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
545 as a "normal" part of Scalar v3.0B LD
546 * `svctx` specifies the SV Context and includes VL as well as
547 source and destination elwidth overrides.
548
549 Below is the pseudocode for Unit-Strided LD (which includes Vector
550 capability). Observe in particular that RA, as the base address in both
551 Immediate and Indexed LD/ST, does not have element-width overriding
552 applied to it.
553
554 Note that predication, predication-zeroing, and other modes except
555 saturation have all been removed, for clarity and simplicity:
556
557 ```
558 # LD not VLD!
559 # this covers unit stride mode and a type of vector offset
560 function op_ld(RT, RA, op_width, imm_offs, svctx)
561 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
562 if not svctx.unit/el-strided:
563 # strange vector mode, compute 64 bit address which is
564 # not polymorphic! elwidth hardcoded to 64 here
565 srcbase = get_polymorphed_reg(RA, 64, i)
566 else:
567 # unit / element stride mode, compute 64 bit address
568 srcbase = get_polymorphed_reg(RA, 64, 0)
569 # adjust for unit/el-stride
570 srcbase += ....
571
572 # read the underlying memory
573 memread <= MEM(srcbase + imm_offs, op_width)
574
575 # check saturation.
576 if svpctx.saturation_mode:
577 # ... saturation adjustment...
578 memread = clamp(memread, op_width, svctx.dest_elwidth)
579 else:
580 # truncate/extend to over-ridden dest width.
581 memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
582
583 # takes care of inserting memory-read (now correctly byteswapped)
584 # into regfile underlying LE-defined order, into the right place
585 # within the NEON-like register, respecting destination element
586 # bitwidth, and the element index (j)
587 set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
588
589 # increments both src and dest element indices (no predication here)
590 i++;
591 j++;
592 ```
593
594 Note above that the source elwidth is *not used at all* in LD-immediate.
595
596 For LD/Indexed, the key is that in the calculation of the Effective Address,
597 RA has no elwidth override but RB does. Pseudocode below is simplified
598 for clarity: predication and all modes except saturation are removed:
599
600 ```
601 # LD not VLD! ld*rx if brev else ld*
602 function op_ld(RT, RA, RB, op_width, svctx, brev)
603 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
604 if not svctx.el-strided:
605 # RA not polymorphic! elwidth hardcoded to 64 here
606 srcbase = get_polymorphed_reg(RA, 64, i)
607 else:
608 # element stride mode, again RA not polymorphic
609 srcbase = get_polymorphed_reg(RA, 64, 0)
610 # RB *is* polymorphic
611 offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
612 # sign-extend
613 if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
614
615 # takes care of (merges) processor LE/BE and ld/ldbrx
616 bytereverse = brev XNOR MSR.LE
617
618 # read the underlying memory
619 memread <= MEM(srcbase + offs, op_width)
620
621 # optionally performs byteswap at op width
622 if (bytereverse):
623 memread = byteswap(memread, op_width)
624
625 if svpctx.saturation_mode:
626 # ... saturation adjustment...
627 memread = clamp(memread, op_width, svctx.dest_elwidth)
628 else:
629 # truncate/extend to over-ridden dest width.
630 memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
631
632 # takes care of inserting memory-read (now correctly byteswapped)
633 # into regfile underlying LE-defined order, into the right place
634 # within the NEON-like register, respecting destination element
635 # bitwidth, and the element index (j)
636 set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
637
638 # increments both src and dest element indices (no predication here)
639 i++;
640 j++;
641 ```
642
643 ## Remapped LD/ST
644
645 In the [[sv/remap]] page the concept of "Remapping" is described. Whilst
646 it is expensive to set up (2 64-bit opcodes minimum) it provides a way to
647 arbitrarily perform 1D, 2D and 3D "remapping" of up to 64 elements worth
648 of LDs or STs. The usual interest in such re-mapping is for example in
649 separating out 24-bit RGB channel data into separate contiguous registers.
650 NEON covers this as shown in the diagram below:
651
652 ![Load/Store remap](/openpower/sv/load-store.svg)
653
654 REMAP easily covers this capability, and with dest elwidth overrides
655 and saturation may do so with built-in conversion that would normally
656 require additional width-extension, sign-extension and min/max Vectorised
657 instructions as post-processing stages.
658
659 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
660 because the generic abstracted concept of "Remapping", when applied to
661 LD/ST, will give that same capability, with far more flexibility.
662
663 It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
664 established through `svstep`, are also an easy way to perform regular
665 Structure Packing, at the vec2/vec3/vec4 granularity level. Beyond that,
666 REMAP will need to be used.
667
668 --------
669
670 [[!tag standards]]
671
672 \newpage{}