1623f513fd16cfc4c9f05df97ec0767f1a43b450
[libreriscv.git] / openpower / sv / ldst.mdwn
1 # SV Load and Store
2
3 Links:
4
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
6 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=940> post autoincrement mode
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=1047> Data-Dependent Fail-First
10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
12 * [[ldst/discussion]]
13
14 ## Rationale
15
16 All Vector ISAs dating back fifty years have extensive and comprehensive
17 Load and Store operations that go far beyond the capabilities of Scalar
18 RISC and most CISC processors, yet at their heart on an individual element
19 basis may be found to be no different from RISC Scalar equivalents.
20
21 The resource savings from Vector LD/ST are significant and stem
22 from the fact that one single instruction can trigger a dozen (or in
23 some microarchitectures such as Cray or NEC SX Aurora) hundreds of
24 element-level Memory accesses.
25
26 Additionally, and simply: if the Arithmetic side of an ISA supports
27 Vector Operations, then in order to keep the ALUs 100% occupied the
28 Memory infrastructure (and the ISA itself) correspondingly needs Vector
29 Memory Operations as well.
30
31 Vectorised Load and Store also presents an extra dimension (literally)
32 which creates scenarios unique to Vector applications, that a Scalar (and
33 even a SIMD) ISA simply never encounters. SVP64 endeavours to add the
34 modes typically found in *all* Scalable Vector ISAs, without changing the
35 behaviour of the underlying Base (Scalar) v3.0B operations in any way.
36 (The sole apparent exception is Post-Increment Mode on LD/ST-update
37 instructions)
38
39 ## Modes overview
40
41 Vectorisation of Load and Store requires creation, from scalar operations,
42 a number of different modes:
43
44 * **fixed aka "unit" stride** - contiguous sequence with no gaps
45 * **element strided** - sequential but regularly offset, with gaps
46 * **vector indexed** - vector of base addresses and vector of offsets
47 * **Speculative fail-first** - where it makes sense to do so
48 * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
49
50 *Despite being constructed from Scalar LD/ST none of these Modes exist
51 or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
52
53 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
54 as well as Element-width overrides and Twin-Predication.
55
56 Note also that Indexed [[sv/remap]] mode may be applied to both v3.0
57 LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions.
58 LD/ST-Indexed should not be conflated with Indexed REMAP mode:
59 clarification is provided below.
60
61 **Determining the LD/ST Modes**
62
63 A minor complication (caused by the retro-fitting of modern Vector
64 features to a Scalar ISA) is that certain features do not exactly make
65 sense or are considered a security risk. Fail-first on Vector Indexed
66 would allow attackers to probe large numbers of pages from userspace,
67 where strided fail-first (by creating contiguous sequential LDs) does not.
68
69 In addition, reduce mode makes no sense. Realistically we need an
70 alternative table definition for [[sv/svp64]] `RM.MODE`. The following
71 modes make sense:
72
73 * saturation
74 * predicate-result would be useful but is lower priority than Data-Dependent Fail-First
75 * simple (no augmentation)
76 * fail-first (where Vector Indexed is banned)
77 * Signed Effective Address computation (Vector Indexed only)
78
79 More than that however it is necessary to fit the usual Vector ISA
80 capabilities onto both Power ISA LD/ST with immediate and to LD/ST
81 Indexed. They present subtly different Mode tables, which, due to lack
82 of space, have the following quirks:
83
84 * LD/ST Immediate has no individual control over src/dest zeroing,
85 whereas LD/ST Indexed does.
86 * LD/ST Immediate has saturation but LD/ST Indexed does not.
87
88 ## Format and fields
89
90 Fields used in tables below:
91
92 * **sz / dz** if predication is enabled will put zeros into the dest
93 (or as src in the case of twin pred) when the predicate bit is zero.
94 otherwise the element is ignored or skipped, depending on context.
95 * **zz**: both sz and dz are set equal to this flag.
96 * **inv CR bit** just as in branches (BO) these bits allow testing of
97 a CR bit and whether it is set (inv=0) or unset (inv=1)
98 * **N** sets signed/unsigned saturation.
99 * **RC1** as if Rc=1, stores CRs *but not the result*
100 * **SEA** - Signed Effective Address, if enabled performs sign-extension on
101 registers that have been reduced due to elwidth overrides
102 * **PI** - post-increment mode (applies to LD/ST with update only).
103 the Effective Address utilised is always just RA, i.e. the computation of
104 EA is stored in RA **after** it is actually used.
105 * **LF** - Load/Store Fail or Fault First: for any reason Load or Store Vectors
106 may be truncated to (at least) one element, and VL altered to indicate such.
107 * **VLi** - Inclusive Data-Dependent Fail-First: the failing element is included
108 in the Truncated Vector.
109 * **els** - Element-strided Mode: the element index (after REMAP)
110 is multiplied by the immediate offset (or Scalar RB for Indexed). Restrictions apply.
111
112 When VLi=0 on Store Operations the Memory update does **not** take place
113 on the element that failed. EA does **not** update into RA on Load/Store
114 with Update instructions either.
115
116 **LD/ST immediate**
117
118 The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
119 (bits 19:23 of `RM`) is:
120
121 | 0 | 1 | 2 | 3 4 | description |
122 |---|---| --- |---------|--------------------------- |
123 | 0 | 0 | 0 | zz els | simple mode |
124 | 0 | 0 | 1 | PI LF | post-increment and Fault-First |
125 | 1 | 0 | N | zz els | sat mode: N=0/1 u/s |
126 |VLi| 1 | inv | CR-bit | Rc=1: ffirst CR sel |
127 |VLi| 1 | inv | els RC1 | Rc=0: ffirst z/nonz |
128
129 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
130 whether stride is unit or element:
131
132 ```
133 if RA.isvec:
134 svctx.ldstmode = indexed
135 elif els == 0:
136 svctx.ldstmode = unitstride
137 elif immediate != 0:
138 svctx.ldstmode = elementstride
139 ```
140
141 An immediate of zero is a safety-valve to allow `LD-VSPLAT`: in effect
142 the multiplication of the immediate-offset by zero results in reading from
143 the exact same memory location, *even with a Vector register*. (Normally
144 this type of behaviour is reserved for the mapreduce modes)
145
146 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur just
147 the once and be copied, rather than hitting the Data Cache multiple
148 times with the same memory read at the same location. The benefit of
149 Cache-inhibited LD-splats is that it allows for memory-mapped peripherals
150 to have multiple data values read in quick succession and stored in
151 sequentially numbered registers (but, see Note below).
152
153 For non-cache-inhibited ST from a vector source onto a scalar destination:
154 with the Vector loop effectively creating multiple memory writes to
155 the same location, we can deduce that the last of these will be the
156 "successful" one. Thus, implementations are free and clear to optimise
157 out the overwriting STs, leaving just the last one as the "winner".
158 Bear in mind that predicate masks will skip some elements (in source
159 non-zeroing mode). Cache-inhibited ST operations on the other hand
160 **MUST** write out a Vector source multiple successive times to the exact
161 same Scalar destination. Just like Cache-inhibited LDs, multiple values
162 may be written out in quick succession to a memory-mapped peripheral
163 from sequentially-numbered registers.
164
165 Note that any memory location may be Cache-inhibited
166 (Power ISA v3.1, Book III, 1.6.1, p1033)
167
168 *Programmer's Note: an immediate also with a Scalar source as a "VSPLAT"
169 mode is simply not possible: there are not enough Mode bits. One single
170 Scalar Load operation may be used instead, followed by any arithmetic
171 operation (including a simple mv) in "Splat" mode.*
172
173 **LD/ST Indexed**
174
175 The modes for `RA+RB` indexed version are slightly different
176 but are the same `RM.MODE` bits (19:23 of `RM`):
177
178 | 0 | 1 | 2 | 3 4 | description |
179 |---|---| --- |---------|--------------------------- |
180 |els| 0 | SEA | dz sz | simple mode |
181 |VLi| 1 | inv | CR-bit | Rc=1: ffirst CR sel |
182 |VLi| 1 | inv | els RC1 | Rc=0: ffirst z/nonz |
183
184 Vector Indexed Strided Mode is qualified as follows:
185
186 ```
187 if els and !RA.isvec and !RB.isvec:
188 svctx.ldstmode = elementstride
189 ```
190
191 A summary of the effect of Vectorisation of src or dest:
192
193 ```
194 imm(RA) RT.v RA.v no stride allowed
195 imm(RA) RT.s RA.v no stride allowed
196 imm(RA) RT.v RA.s stride-select allowed
197 imm(RA) RT.s RA.s not vectorised
198 RA,RB RT.v {RA|RB}.v Standard Indexed
199 RA,RB RT.s {RA|RB}.v Indexed but single LD (no VSPLAT)
200 RA,RB RT.v {RA&RB}.s VSPLAT possible. stride selectable
201 RA,RB RT.s {RA&RB}.s not vectorised (scalar identity)
202 ```
203
204 Signed Effective Address computation is only relevant for Vector Indexed
205 Mode, when elwidth overrides are applied. The source override applies to
206 RB, and before adding to RA in order to calculate the Effective Address,
207 if SEA is set RB is sign-extended from elwidth bits to the full 64 bits.
208 For other Modes (ffirst, saturate), all EA computation with elwidth
209 overrides is unsigned.
210
211 Note that cache-inhibited LD/ST when VSPLAT is activated will perform
212 **multiple** LD/ST operations, sequentially. Even with scalar src
213 a Cache-inhibited LD will read the same memory location *multiple
214 times*, storing the result in successive Vector destination registers.
215 This because the cache-inhibit instructions are typically used to read
216 and write memory-mapped peripherals. If a genuine cache-inhibited
217 LD-VSPLAT is required then a single *scalar* cache-inhibited LD should
218 be performed, followed by a VSPLAT-augmented mv, copying the one *scalar*
219 value into multiple register destinations.
220
221 Note also that cache-inhibited VSPLAT with Data-Dependent Fail-First is possible.
222 This allows for example to issue a massive batch of memory-mapped
223 peripheral reads, stopping at the first NULL-terminated character and
224 truncating VL to that point. No branch is needed to issue that large
225 burst of LDs, which may be valuable in Embedded scenarios.
226
227 ## Vectorisation of Scalar Power ISA v3.0B
228
229 Scalar Power ISA Load/Store operations may be seen from [[isa/fixedload]]
230 and [[isa/fixedstore]] pseudocode to be of the form:
231
232 ```
233 lbux RT, RA, RB
234 EA <- (RA) + (RB)
235 RT <- MEM(EA)
236 ```
237
238 and for immediate variants:
239
240 ```
241 lb RT,D(RA)
242 EA <- RA + EXTS(D)
243 RT <- MEM(EA)
244 ```
245
246 Thus in the first example, the source registers may each be independently
247 marked as scalar or vector, and likewise the destination; in the second
248 example only the one source and one dest may be marked as scalar or
249 vector.
250
251 Thus we can see that Vector Indexed may be covered, and, as demonstrated
252 with the pseudocode below, the immediate can be used to give unit
253 stride or element stride. With there being no way to tell which from
254 the Power v3.0B Scalar opcode alone, the choice is provided instead by
255 the SV Context.
256
257 ```
258 # LD not VLD! format - ldop RT, immed(RA)
259 # op_width: lb=1, lh=2, lw=4, ld=8
260 op_load(RT, RA, op_width, immed, svctx, RAupdate):
261  ps = get_pred_val(FALSE, RA); # predication on src
262  pd = get_pred_val(FALSE, RT); # ... AND on dest
263  for (i=0, j=0, u=0; i < VL && j < VL;):
264 # skip nonpredicates elements
265 if (RA.isvec) while (!(ps & 1<<i)) i++;
266 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
267 if (RT.isvec) while (!(pd & 1<<j)) j++;
268 if postinc:
269 offs = 0; # added afterwards
270 if RA.isvec: srcbase = ireg[RA+i]
271 else srcbase = ireg[RA]
272 elif svctx.ldstmode == elementstride:
273 # element stride mode
274 srcbase = ireg[RA]
275 offs = i * immed # j*immed for a ST
276 elif svctx.ldstmode == unitstride:
277 # unit stride mode
278 srcbase = ireg[RA]
279 offs = immed + (i * op_width) # j*op_width for ST
280 elif RA.isvec:
281 # quirky Vector indexed mode but with an immediate
282 srcbase = ireg[RA+i]
283 offs = immed;
284 else
285 # standard scalar mode (but predicated)
286 # no stride multiplier means VSPLAT mode
287 srcbase = ireg[RA]
288 offs = immed
289
290 # compute EA
291 EA = srcbase + offs
292 # load from memory
293 ireg[RT+j] <= MEM[EA];
294 # check post-increment of EA
295 if postinc: EA = srcbase + immed;
296 # update RA?
297 if RAupdate: ireg[RAupdate+u] = EA;
298 if (!RT.isvec)
299 break # destination scalar, end now
300 if (RA.isvec) i++;
301 if (RAupdate.isvec) u++;
302 if (RT.isvec) j++;
303 ```
304
305 Indexed LD is:
306
307 ```
308 # format: ldop RT, RA, RB
309 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
310  ps = get_pred_val(FALSE, RA); # predication on src
311  pd = get_pred_val(FALSE, RT); # ... AND on dest
312  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
313 # skip nonpredicated RA, RB and RT
314 if (RA.isvec) while (!(ps & 1<<i)) i++;
315 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
316 if (RB.isvec) while (!(ps & 1<<k)) k++;
317 if (RT.isvec) while (!(pd & 1<<j)) j++;
318 if svctx.ldstmode == elementstride:
319 EA = ireg[RA] + ireg[RB]*j # register-strided
320 else
321 EA = ireg[RA+i] + ireg[RB+k] # indexed address
322 if RAupdate: ireg[RAupdate+u] = EA
323 ireg[RT+j] <= MEM[EA];
324 if (!RT.isvec)
325 break # destination scalar, end immediately
326 if (RA.isvec) i++;
327 if (RAupdate.isvec) u++;
328 if (RB.isvec) k++;
329 if (RT.isvec) j++;
330 ```
331
332 Note that Element-Strided uses the Destination Step because with both
333 sources being Scalar as a prerequisite condition of activation of
334 Element-Stride Mode, the source step (being Scalar) would never advance.
335
336 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update"
337 mode (`ldux`) to be effectively a *completely different* register from
338 RA-as-a-source. This because there is room in svp64 to extend RA-as-src
339 as well as RA-as-dest, both independently as scalar or vector *and*
340 independently extending their range.
341
342 *Programmer's note: being able to set RA-as-a-source as separate from
343 RA-as-a-destination as Scalar is **extremely valuable** once it is
344 remembered that Simple-V element operations must be in Program Order,
345 especially in loops, for saving on multiple address computations. Care
346 does have to be taken however that RA-as-src is not overwritten by
347 RA-as-dest unless intentionally desired, especially in element-strided
348 Mode.*
349
350 ## LD/ST Indexed vs Indexed REMAP
351
352 Unfortunately the word "Indexed" is used twice in completely different
353 contexts, potentially causing confusion.
354
355 * There has existed instructions in the Power ISA `ld RT,RA,RB` since
356 its creation: these are called "LD/ST Indexed" instructions and their
357 name and meaning is well-established.
358 * There now exists, in Simple-V, a [[sv/remap]] mode called "Indexed"
359 Mode that can be applied to *any* instruction **including those
360 named LD/ST Indexed**.
361
362 Whilst it may be costly in terms of register reads to allow REMAP Indexed
363 Mode to be applied to any Vectorised LD/ST Indexed operation such as
364 `sv.ld *RT,RA,*RB`, or even misleadingly labelled as redundant, firstly
365 the strict application of the RISC Paradigm that Simple-V follows makes
366 it awkward to consider *preventing* the application of Indexed REMAP to
367 such operations, and secondly they are not actually the same at all.
368
369 Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
370 effectively performs an *in-place* re-ordering of the offsets, RB.
371 To achieve the same effect without Indexed REMAP would require taking
372 a *copy* of the Vector of offsets starting at RB, manually explicitly
373 reordering them, and finally using the copy of re-ordered offsets in a
374 non-REMAP'ed `sv.ld`. Using non-strided LD as an example, pseudocode
375 showing what actually occurs, where the pseudocode for `indexed_remap`
376 may be found in [[sv/remap]]:
377
378 ```
379 # sv.ld *RT,RA,*RB with Index REMAP applied to RB
380 for i in 0..VL-1:
381 if remap.indexed:
382 rb_idx = indexed_remap(i) # remap
383 else:
384 rb_idx = i # use the index as-is
385 EA = GPR(RA) + GPR(RB+rb_idx)
386 GPR(RT+i) = MEM(EA, 8)
387 ```
388
389 Thus it can be seen that the use of Indexed REMAP saves copying
390 and manual reordering of the Vector of RB offsets.
391
392 ## LD/ST ffirst
393
394 LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
395 is not active) as an ordinary one, with all behaviour with respect to
396 Interrupts Exceptions Page Faults Memory Management being identical
397 in every regard to Scalar v3.0 Power ISA LD/ST. However for elements
398 1 and above, if an exception would occur, then VL is **truncated**
399 to the previous element: the exception is **not** then raised because
400 the LD/ST that would otherwise have caused an exception is *required*
401 to be cancelled. Additionally an implementor may choose to truncate VL
402 for any arbitrary reason *except for the very first*.
403
404 ffirst LD/ST to multiple pages via a Vectorised Index base is
405 considered a security risk due to the abuse of probing multiple
406 pages in rapid succession and getting speculative feedback on which
407 pages would fail. Therefore Vector Indexed LD/ST is prohibited
408 entirely, and the Mode bit instead used for element-strided LD/ST.
409 See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
410
411 ```
412 for(i = 0; i < VL; i++)
413 reg[rt + i] = mem[reg[ra] + i * reg[rb]];
414 ```
415
416 High security implementations where any kind of speculative probing of
417 memory pages is considered a risk should take advantage of the fact
418 that implementations may truncate VL at any point, without requiring
419 software to be rewritten and made non-portable. Such implementations may
420 choose to *always* set VL=1 which will have the effect of terminating
421 any speculative probing (and also adversely affect performance), but
422 will at least not require applications to be rewritten.
423
424 Low-performance simpler hardware implementations may also choose (always)
425 to also set VL=1 as the bare minimum compliant implementation of LD/ST
426 Fail-First. It is however critically important to remember that the first
427 element LD/ST **MUST** be treated as an ordinary LD/ST, i.e. **MUST**
428 raise exceptions exactly like an ordinary LD/ST.
429
430 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value
431 for any implementation-specific reason. For example: it is perfectly
432 reasonable for implementations to alter VL when ffirst LD or ST operations
433 are initiated on a nonaligned boundary, such that within a loop the
434 subsequent iteration of that loop begins the following ffirst LD/ST
435 operations on an aligned boundary such as the beginning of a cache line,
436 or beginning of a Virtual Memory page. Likewise, to reduce workloads or
437 balance resources.
438
439 Vertical-First Mode is slightly strange in that only one element at a time
440 is ever executed anyway. Given that programmers may legitimately choose
441 to alter srcstep and dststep in non-sequential order as part of explicit
442 loops, it is neither possible nor safe to make speculative assumptions
443 about future LD/STs. Therefore, Fail-First LD/ST in Vertical-First is
444 `UNDEFINED`. This is very different from Arithmetic (Data-dependent)
445 FFirst where Vertical-First Mode is fully deterministic, not speculative.
446
447 ## Data-Dependent Fail-First (not Fail/Fault-First)
448
449 Not to be confused with Fail/Fault First, Data-Fail-First performs an
450 additional check on the data into a Condition Register Field and if a test
451 on the CR Field fails then VL is truncated and further looping terminates.
452 This is precisely the same as Arithmetic Data-Dependent Fail-First,
453 the only difference being that the result comes from the LD/ST.
454
455 In the case of Store operations there is a quirk when VLi (VL inclusive
456 is "Valid") is clear. Bear in mind the criteria is that the truncated
457 Vector of results, when VLi is clear, must all pass the "test", but when
458 VLi is set the *current failed test* is permitted to be included. Thus,
459 the actual update (store) to Memory is **not permitted to take place**
460 should the test fail. Therefore, on testing the value to be stored,
461 and after updating the corresponding CR Field Element, when VLi=0 and
462 finding that the test fails the Memory store must **not** occur.
463
464 Additionally, when VLi=0 and a test fails then RA does **not** receive a
465 copy of the Effective Address. Hardware implementations with Out-of-Order
466 Micro-Architectures should use speculative Shadow-Hold and Cancellation
467 when the test fails.
468
469 By contrast if VLi=1 and the test fails, Store may proceed *and then*
470 looping terminates. In this way, when non-Inclusive, the Vector of
471 Truncated results contains only Stores that passed the test (and RA=EA
472 updates if any), and when Inclusive the Vector of Truncated results
473 contains the first-failed data.
474
475 Below is an example of loading the starting addresses of Linked-List
476 nodes. If VLi=1 it will load the NULL pointer into the Vector of results.
477 If however VLi=0 it will *exclude* the NULL pointer by truncating VL to
478 one Element earlier.
479
480 ```
481 RT=1 # vec - deliberately overlaps by one with RA
482 RA=0 # vec - first one is valid, contains ptr
483 imm = 8 # offset_of(ptr->next)
484 for i in range(VL):
485 EA = GPR(RA+i) + imm # ptr + offset(next)
486 data = MEM(EA, 8) # 64-bit address of ptr->next
487 GPR(RT+i) = data # happens to be read on next loop!
488 # was a normal ld up to this point. now the Data-Fail-First
489 CR.field(i) = conditions(data)
490 if CR.field(i).EQ == testbit: # check if zero
491 if VLI then VL = i+1 # update VL, inclusive
492 else VL = i # update VL
493 break # stop looping
494 ```
495
496 **Data-Dependent Fault-First on Store-Conditional (Rc=1)**
497
498 There are very few instructions that allow Rc=1 for Load/Store:
499 one of those is the `stdcx.` and other Atomic Store-Conditional
500 instructions. With Simple-V being a loop around Scalar instructions
501 strictly obeying Scalar Program Order a Fail-First loop on an
502 Atomic Store-Conditional will always fail the second and all other
503 Store-Conditional instructions in Horizontal-First Mode because
504 Load-Reservation and Store-Conditional are required to be executed
505 in pairs.
506
507 By contrast, in Vertical-First Mode it is in fact possible to issue
508 the pairs, and consequently allowing Vectorised Data-Dependent Fail-First is
509 useful. Care should be taken however when VL is truncated in Vertical-First
510 Mode.
511
512 ## LOAD/STORE Elwidths <a name="elwidth"></a>
513
514 Loads and Stores are almost unique in that the Power Scalar ISA
515 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
516 others like it provide an explicit operation width. There are therefore
517 *three* widths involved:
518
519 * operation width (lb=8, lh=16, lw=32, ld=64)
520 * src element width override (8/16/32/default)
521 * destination element width override (8/16/32/default)
522
523 Some care is therefore needed to express and make clear the transformations,
524 which are expressly in this order:
525
526 * Calculate the Effective Address from RA at full width
527 but (on Indexed Load) allow srcwidth overrides on RB
528 * Load at the operation width (lb/lh/lw/ld) as usual
529 * byte-reversal as usual
530 * Non-saturated mode:
531 - zero-extension or truncation from operation width to dest elwidth
532 - place result in destination at dest elwidth
533 * Saturated mode:
534 - Sign-extension or truncation from operation width to dest width
535 - signed/unsigned saturation down to dest elwidth
536
537 In order to respect Power v3.0B Scalar behaviour the memory side
538 is treated effectively as completely separate and distinct from SV
539 augmentation. This is primarily down to quirks surrounding LE/BE and
540 byte-reversal.
541
542 It is rather unfortunately possible to request an elwidth override on
543 the memory side which does not mesh with the overridden operation width:
544 these result in `UNDEFINED` behaviour. The reason is that the effect
545 of attempting a 64-bit `sv.ld` operation with a source elwidth override
546 of 8/16/32 would result in overlapping memory requests, particularly
547 on unit and element strided operations. Thus it is `UNDEFINED` when
548 the elwidth is smaller than the memory operation width. Examples include
549 `sv.lw/sw=16/els` which requests (overlapping) 4-byte memory reads offset
550 from each other at 2-byte intervals. Store likewise is also `UNDEFINED`
551 where the dest elwidth override is less than the operation width.
552
553 Note the following regarding the pseudocode to follow:
554
555 * `scalar identity behaviour` SV Context parameter conditions turn this
556 into a straight absolute fully-compliant Scalar v3.0B LD operation
557 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
558 rather than `ld`)
559 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
560 a "normal" part of Scalar v3.0B LD
561 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
562 as a "normal" part of Scalar v3.0B LD
563 * `svctx` specifies the SV Context and includes VL as well as
564 source and destination elwidth overrides.
565
566 Below is the pseudocode for Unit-Strided LD (which includes Vector
567 capability). Observe in particular that RA, as the base address in both
568 Immediate and Indexed LD/ST, does not have element-width overriding
569 applied to it.
570
571 Note that predication, predication-zeroing, and other modes except
572 saturation have all been removed, for clarity and simplicity:
573
574 ```
575 # LD not VLD!
576 # this covers unit stride mode and a type of vector offset
577 function op_ld(RT, RA, op_width, imm_offs, svctx)
578 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
579 if not svctx.unit/el-strided:
580 # strange vector mode, compute 64 bit address which is
581 # not polymorphic! elwidth hardcoded to 64 here
582 srcbase = get_polymorphed_reg(RA, 64, i)
583 else:
584 # unit / element stride mode, compute 64 bit address
585 srcbase = get_polymorphed_reg(RA, 64, 0)
586 # adjust for unit/el-stride
587 srcbase += ....
588
589 # read the underlying memory
590 memread <= MEM(srcbase + imm_offs, op_width)
591
592 # check saturation.
593 if svpctx.saturation_mode:
594 # ... saturation adjustment...
595 memread = clamp(memread, op_width, svctx.dest_elwidth)
596 else:
597 # truncate/extend to over-ridden dest width.
598 memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
599
600 # takes care of inserting memory-read (now correctly byteswapped)
601 # into regfile underlying LE-defined order, into the right place
602 # within the NEON-like register, respecting destination element
603 # bitwidth, and the element index (j)
604 set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
605
606 # increments both src and dest element indices (no predication here)
607 i++;
608 j++;
609 ```
610
611 Note above that the source elwidth is *not used at all* in LD-immediate.
612
613 For LD/Indexed, the key is that in the calculation of the Effective Address,
614 RA has no elwidth override but RB does. Pseudocode below is simplified
615 for clarity: predication and all modes except saturation are removed:
616
617 ```
618 # LD not VLD! ld*rx if brev else ld*
619 function op_ld(RT, RA, RB, op_width, svctx, brev)
620 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
621 if not svctx.el-strided:
622 # RA not polymorphic! elwidth hardcoded to 64 here
623 srcbase = get_polymorphed_reg(RA, 64, i)
624 else:
625 # element stride mode, again RA not polymorphic
626 srcbase = get_polymorphed_reg(RA, 64, 0)
627 # RB *is* polymorphic
628 offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
629 # sign-extend
630 if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
631
632 # takes care of (merges) processor LE/BE and ld/ldbrx
633 bytereverse = brev XNOR MSR.LE
634
635 # read the underlying memory
636 memread <= MEM(srcbase + offs, op_width)
637
638 # optionally performs byteswap at op width
639 if (bytereverse):
640 memread = byteswap(memread, op_width)
641
642 if svpctx.saturation_mode:
643 # ... saturation adjustment...
644 memread = clamp(memread, op_width, svctx.dest_elwidth)
645 else:
646 # truncate/extend to over-ridden dest width.
647 memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
648
649 # takes care of inserting memory-read (now correctly byteswapped)
650 # into regfile underlying LE-defined order, into the right place
651 # within the NEON-like register, respecting destination element
652 # bitwidth, and the element index (j)
653 set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
654
655 # increments both src and dest element indices (no predication here)
656 i++;
657 j++;
658 ```
659
660 ## Remapped LD/ST
661
662 In the [[sv/remap]] page the concept of "Remapping" is described. Whilst
663 it is expensive to set up (2 64-bit opcodes minimum) it provides a way to
664 arbitrarily perform 1D, 2D and 3D "remapping" of up to 64 elements worth
665 of LDs or STs. The usual interest in such re-mapping is for example in
666 separating out 24-bit RGB channel data into separate contiguous registers.
667 NEON covers this as shown in the diagram below:
668
669 ![Load/Store remap](/openpower/sv/load-store.svg)
670
671 REMAP easily covers this capability, and with dest elwidth overrides
672 and saturation may do so with built-in conversion that would normally
673 require additional width-extension, sign-extension and min/max Vectorised
674 instructions as post-processing stages.
675
676 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
677 because the generic abstracted concept of "Remapping", when applied to
678 LD/ST, will give that same capability, with far more flexibility.
679
680 It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
681 established through `svstep`, are also an easy way to perform regular
682 Structure Packing, at the vec2/vec3/vec4 granularity level. Beyond that,
683 REMAP will need to be used.
684
685 --------
686
687 [[!tag standards]]
688
689 \newpage{}