openpower/sv/ldst.mdwn

   1 # SV Load and Store
   2
   3 Links:
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=940> post autoincrement mode
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=1047> Data-Dependent Fail-First
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12 * [[ldst/discussion]]
  13
  14 ## Rationale
  15
  16 All Vector ISAs dating back fifty years have extensive and comprehensive
  17 Load and Store operations that go far beyond the capabilities of Scalar
  18 RISC and most CISC processors, yet at their heart on an individual element
  19 basis may be found to be no different from RISC Scalar equivalents.
  20
  21 The resource savings from Vector LD/ST are significant and stem
  22 from the fact that one single instruction can trigger a dozen (or in
  23 some microarchitectures such as Cray or NEC SX Aurora) hundreds of
  24 element-level Memory accesses.
  25
  26 Additionally, and simply: if the Arithmetic side of an ISA supports
  27 Vector Operations, then in order to keep the ALUs 100% occupied the
  28 Memory infrastructure (and the ISA itself) correspondingly needs Vector
  29 Memory Operations as well.
  30
  31 Vectorised Load and Store also presents an extra dimension (literally)
  32 which creates scenarios unique to Vector applications, that a Scalar (and
  33 even a SIMD) ISA simply never encounters.  SVP64 endeavours to add the
  34 modes typically found in *all* Scalable Vector ISAs, without changing the
  35 behaviour of the underlying Base (Scalar) v3.0B operations in any way.
  36 (The sole apparent exception is Post-Increment Mode on LD/ST-update
  37 instructions)
  38
  39 ## Modes overview
  40
  41 Vectorisation of Load and Store requires creation, from scalar operations,
  42 a number of different modes:
  43
  44 * **fixed aka "unit" stride** - contiguous sequence with no gaps
  45 * **element strided** - sequential but regularly offset, with gaps
  46 * **vector indexed** - vector of base addresses and vector of offsets
  47 * **Speculative fail-first** - where it makes sense to do so
  48 * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
  49
  50 *Despite being constructed from Scalar LD/ST none of these Modes exist
  51 or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
  52
  53 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
  54 as well as Element-width overrides and Twin-Predication.
  55
  56 Note also that Indexed [[sv/remap]] mode may be applied to both v3.0
  57 LD/ST Immediate instructions *and* v3.0 LD/ST Indexed instructions.
  58 LD/ST-Indexed should not be conflated with Indexed REMAP mode:
  59 clarification is provided below.
  60
  61 **Determining the LD/ST Modes**
  62
  63 A minor complication (caused by the retro-fitting of modern Vector
  64 features to a Scalar ISA) is that certain features do not exactly make
  65 sense or are considered a security risk.  Fail-first on Vector Indexed
  66 would allow attackers to probe large numbers of pages from userspace,
  67 where strided fail-first (by creating contiguous sequential LDs) does not.
  68
  69 In addition, reduce mode makes no sense.  Realistically we need an
  70 alternative table definition for [[sv/svp64]] `RM.MODE`.  The following
  71 modes make sense:
  72
  73 * saturation
  74 * predicate-result would be useful but is lower priority than Data-Dependent Fail-First
  75 * simple (no augmentation)
  76 * fail-first (where Vector Indexed is banned)
  77 * Signed Effective Address computation (Vector Indexed only)
  78
  79 More than that however it is necessary to fit the usual Vector ISA
  80 capabilities onto both Power ISA LD/ST with immediate and to LD/ST
  81 Indexed. They present subtly different Mode tables, which, due to lack
  82 of space, have the following quirks:
  83
  84 * LD/ST Immediate has no individual control over src/dest zeroing,
  85   whereas LD/ST Indexed does.
  86 * LD/ST Immediate has saturation but LD/ST Indexed does not.
  87
  88 ## Format and fields
  89
  90 Fields used in tables below:
  91
  92 * **sz / dz**  if predication is enabled will put zeros into the dest
  93   (or as src in the case of twin pred) when the predicate bit is zero.
  94   otherwise the element is ignored or skipped, depending on context.
  95 * **zz**: both sz and dz are set equal to this flag.
  96 * **inv CR bit** just as in branches (BO) these bits allow testing of
  97   a CR bit and whether it is set (inv=0) or unset (inv=1)
  98 * **N** sets signed/unsigned saturation.
  99 * **RC1** as if Rc=1, stores CRs *but not the result*
 100 * **SEA** - Signed Effective Address, if enabled performs sign-extension on
 101   registers that have been reduced due to elwidth overrides
 102 * **PI** - post-increment mode (applies to LD/ST with update only).
 103   the Effective Address utilised is always just RA, i.e. the computation of
 104   EA is stored in RA **after** it is actually used.
 105 * **LF** - Load/Store Fail or Fault First: for any reason Load or Store Vectors
 106   may be truncated to (at least) one element, and VL altered to indicate such.
 107 * **VLi** - Inclusive Data-Dependent Fail-First: the failing element is included
 108   in the Truncated Vector.
 109 * **els** - Element-strided Mode: the element index (after REMAP)
 110   is multiplied by the immediate offset (or Scalar RB for Indexed). Restrictions apply.
 111
 112 When VLi=0 on Store Operations the Memory update does **not** take place
 113 on the element that failed.  EA does **not** update into RA on Load/Store
 114 with Update instructions either.
 115
 116 **LD/ST immediate**
 117
 118 The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
 119 (bits 19:23 of `RM`) is:
 120
 121 | 0 | 1 |  2  |  3   4  |  description               |
 122 |---|---| --- |---------|--------------------------- |
 123 | 0 | 0 | 0   |  zz els | simple mode                |
 124 | 0 | 0 | 1   | PI  LF  | post-increment and Fault-First  |
 125 | 1 | 0 |   N | zz  els |  sat mode: N=0/1 u/s       |
 126 |VLi| 1 | inv | CR-bit  | Rc=1: ffirst CR sel        |
 127 |VLi| 1 | inv | els RC1 |  Rc=0: ffirst z/nonz       |
 128
 129 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
 130 whether stride is unit or element:
 131
 132 ```
 133     if RA.isvec:
 134         svctx.ldstmode = indexed
 135     elif els == 0:
 136         svctx.ldstmode = unitstride
 137     elif immediate != 0:
 138         svctx.ldstmode = elementstride
 139 ```
 140
 141 An immediate of zero is a safety-valve to allow `LD-VSPLAT`: in effect
 142 the multiplication of the immediate-offset by zero results in reading from
 143 the exact same memory location, *even with a Vector register*. (Normally
 144 this type of behaviour is reserved for the mapreduce modes)
 145
 146 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur just
 147 the once and be copied, rather than hitting the Data Cache multiple
 148 times with the same memory read at the same location.  The benefit of
 149 Cache-inhibited LD-splats is that it allows for memory-mapped peripherals
 150 to have multiple data values read in quick succession and stored in
 151 sequentially numbered registers (but, see Note below).
 152
 153 For non-cache-inhibited ST from a vector source onto a scalar destination:
 154 with the Vector loop effectively creating multiple memory writes to
 155 the same location, we can deduce that the last of these will be the
 156 "successful" one. Thus, implementations are free and clear to optimise
 157 out the overwriting STs, leaving just the last one as the "winner".
 158 Bear in mind that predicate masks will skip some elements (in source
 159 non-zeroing mode).  Cache-inhibited ST operations on the other hand
 160 **MUST** write out a Vector source multiple successive times to the exact
 161 same Scalar destination. Just like Cache-inhibited LDs, multiple values
 162 may be written out in quick succession to a memory-mapped peripheral
 163 from sequentially-numbered registers.
 164
 165 Note that any memory location may be Cache-inhibited
 166 (Power ISA v3.1, Book III, 1.6.1, p1033)
 167
 168 *Programmer's Note: an immediate also with a Scalar source as a "VSPLAT"
 169 mode is simply not possible: there are not enough Mode bits. One single
 170 Scalar Load operation may be used instead, followed by any arithmetic
 171 operation (including a simple mv) in "Splat" mode.*
 172
 173 **LD/ST Indexed**
 174
 175 The modes for `RA+RB` indexed version are slightly different
 176 but are the same `RM.MODE` bits (19:23 of `RM`):
 177
 178 | 0 | 1 |  2  |  3   4  |  description               |
 179 |---|---| --- |---------|--------------------------- |
 180 |els| 0 | SEA |  dz  sz | simple mode        |
 181 |VLi| 1 | inv | CR-bit  | Rc=1: ffirst CR sel        |
 182 |VLi| 1 | inv | els RC1 |  Rc=0: ffirst z/nonz       |
 183
 184 Vector Indexed Strided Mode is qualified as follows:
 185
 186 ```
 187     if els and !RA.isvec and !RB.isvec:
 188         svctx.ldstmode = elementstride
 189 ```
 190
 191 A summary of the effect of Vectorisation of src or dest:
 192
 193 ```
 194     imm(RA)  RT.v   RA.v   no stride allowed
 195     imm(RA)  RT.s   RA.v   no stride allowed
 196     imm(RA)  RT.v   RA.s   stride-select allowed
 197     imm(RA)  RT.s   RA.s   not vectorised
 198     RA,RB    RT.v  {RA|RB}.v Standard Indexed
 199     RA,RB    RT.s  {RA|RB}.v Indexed but single LD (no VSPLAT)
 200     RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
 201     RA,RB    RT.s  {RA&RB}.s not vectorised (scalar identity)
 202 ```
 203
 204 Signed Effective Address computation is only relevant for Vector Indexed
 205 Mode, when elwidth overrides are applied.  The source override applies to
 206 RB, and before adding to RA in order to calculate the Effective Address,
 207 if SEA is set RB is sign-extended from elwidth bits to the full 64 bits.
 208 For other Modes (ffirst, saturate), all EA computation with elwidth
 209 overrides is unsigned.
 210
 211 Note that cache-inhibited LD/ST  when VSPLAT is activated will perform
 212 **multiple** LD/ST operations, sequentially.  Even with scalar src
 213 a Cache-inhibited LD will read the same memory location *multiple
 214 times*, storing the result in successive Vector destination registers.
 215 This because the cache-inhibit instructions are typically used to read
 216 and write memory-mapped peripherals.  If a genuine cache-inhibited
 217 LD-VSPLAT is required then a single *scalar* cache-inhibited LD should
 218 be performed, followed by a VSPLAT-augmented mv, copying the one *scalar*
 219 value into multiple register destinations.
 220
 221 Note also that cache-inhibited VSPLAT with Data-Dependent Fail-First is possible.
 222 This allows for example to issue a massive batch of memory-mapped
 223 peripheral reads, stopping at the first NULL-terminated character and
 224 truncating VL to that point. No branch is needed to issue that large
 225 burst of LDs, which may be valuable in Embedded scenarios.
 226
 227 ## Vectorisation of Scalar Power ISA v3.0B
 228
 229 Scalar Power ISA Load/Store operations may be seen from [[isa/fixedload]]
 230 and [[isa/fixedstore]] pseudocode to be of the form:
 231
 232 ```
 233     lbux RT, RA, RB
 234     EA <- (RA) + (RB)
 235     RT <- MEM(EA)
 236 ```
 237
 238 and for immediate variants:
 239
 240 ```
 241     lb RT,D(RA)
 242     EA <- RA + EXTS(D)
 243     RT <- MEM(EA)
 244 ```
 245
 246 Thus in the first example, the source registers may each be independently
 247 marked as scalar or vector, and likewise the destination; in the second
 248 example only the one source and one dest may be marked as scalar or
 249 vector.
 250
 251 Thus we can see that Vector Indexed may be covered, and, as demonstrated
 252 with the pseudocode below, the immediate can be used to give unit
 253 stride or element stride.  With there being no way to tell which from
 254 the Power v3.0B Scalar opcode alone, the choice is provided instead by
 255 the SV Context.
 256
 257 ```
 258     # LD not VLD!  format - ldop RT, immed(RA)
 259     # op_width: lb=1, lh=2, lw=4, ld=8
 260     op_load(RT, RA, op_width, immed, svctx, RAupdate):
 261       ps = get_pred_val(FALSE, RA); # predication on src
 262       pd = get_pred_val(FALSE, RT); # ... AND on dest
 263       for (i=0, j=0, u=0; i < VL && j < VL;):
 264         # skip nonpredicates elements
 265         if (RA.isvec) while (!(ps & 1<<i)) i++;
 266         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 267         if (RT.isvec) while (!(pd & 1<<j)) j++;
 268         if postinc:
 269             offs = 0; # added afterwards
 270             if RA.isvec: srcbase = ireg[RA+i]
 271             else         srcbase = ireg[RA]
 272         elif svctx.ldstmode == elementstride:
 273           # element stride mode
 274           srcbase = ireg[RA]
 275           offs = i * immed              # j*immed for a ST
 276         elif svctx.ldstmode == unitstride:
 277           # unit stride mode
 278           srcbase = ireg[RA]
 279           offs = immed + (i * op_width) # j*op_width for ST
 280         elif RA.isvec:
 281           # quirky Vector indexed mode but with an immediate
 282           srcbase = ireg[RA+i]
 283           offs = immed;
 284         else
 285           # standard scalar mode (but predicated)
 286           # no stride multiplier means VSPLAT mode
 287           srcbase = ireg[RA]
 288           offs = immed
 289
 290         # compute EA
 291         EA = srcbase + offs
 292         # load from memory
 293         ireg[RT+j] <= MEM[EA];
 294         # check post-increment of EA
 295         if postinc: EA = srcbase + immed;
 296         # update RA?
 297         if RAupdate: ireg[RAupdate+u] = EA;
 298         if (!RT.isvec)
 299             break # destination scalar, end now
 300         if (RA.isvec) i++;
 301         if (RAupdate.isvec) u++;
 302         if (RT.isvec) j++;
 303 ```
 304
 305 Indexed LD is:
 306
 307 ```
 308     # format: ldop RT, RA, RB
 309     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
 310       ps = get_pred_val(FALSE, RA); # predication on src
 311       pd = get_pred_val(FALSE, RT); # ... AND on dest
 312       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
 313         # skip nonpredicated RA, RB and RT
 314         if (RA.isvec) while (!(ps & 1<<i)) i++;
 315         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 316         if (RB.isvec) while (!(ps & 1<<k)) k++;
 317         if (RT.isvec) while (!(pd & 1<<j)) j++;
 318         if svctx.ldstmode == elementstride:
 319             EA = ireg[RA] + ireg[RB]*j   # register-strided
 320         else
 321             EA = ireg[RA+i] + ireg[RB+k] # indexed address
 322         if RAupdate: ireg[RAupdate+u] = EA
 323         ireg[RT+j] <= MEM[EA];
 324         if (!RT.isvec)
 325             break # destination scalar, end immediately
 326         if (RA.isvec) i++;
 327         if (RAupdate.isvec) u++;
 328         if (RB.isvec) k++;
 329         if (RT.isvec) j++;
 330 ```
 331
 332 Note that Element-Strided uses the Destination Step because with both
 333 sources being Scalar as a prerequisite condition of activation of
 334 Element-Stride Mode, the source step (being Scalar) would never advance.
 335
 336 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update"
 337 mode (`ldux`) to be effectively a *completely different* register from
 338 RA-as-a-source.  This because there is room in svp64 to extend RA-as-src
 339 as well as RA-as-dest, both independently as scalar or vector *and*
 340 independently extending their range.
 341
 342 *Programmer's note: being able to set RA-as-a-source as separate from
 343 RA-as-a-destination as Scalar is **extremely valuable** once it is
 344 remembered that Simple-V element operations must be in Program Order,
 345 especially in loops, for saving on multiple address computations. Care
 346 does have to be taken however that RA-as-src is not overwritten by
 347 RA-as-dest unless intentionally desired, especially in element-strided
 348 Mode.*
 349
 350 ## LD/ST Indexed vs Indexed REMAP
 351
 352 Unfortunately the word "Indexed" is used twice in completely different
 353 contexts, potentially causing confusion.
 354
 355 * There has existed instructions in the Power ISA `ld RT,RA,RB` since
 356   its creation: these are called "LD/ST Indexed" instructions and their
 357   name and meaning is well-established.
 358 * There now exists, in Simple-V, a [[sv/remap]] mode called "Indexed"
 359   Mode that can be applied to *any* instruction **including those
 360   named LD/ST Indexed**.
 361
 362 Whilst it may be costly in terms of register reads to allow REMAP Indexed
 363 Mode to be applied to any Vectorised LD/ST Indexed operation such as
 364 `sv.ld *RT,RA,*RB`, or even misleadingly labelled  as redundant, firstly
 365 the strict application of the RISC Paradigm that Simple-V follows makes
 366 it awkward to consider *preventing* the application of Indexed REMAP to
 367 such operations, and secondly they are not actually the same at all.
 368
 369 Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
 370 effectively performs an *in-place* re-ordering of the offsets, RB.
 371 To achieve the same effect without Indexed REMAP would require taking
 372 a *copy* of the Vector of offsets starting at RB, manually explicitly
 373 reordering them, and finally using the copy of re-ordered offsets in a
 374 non-REMAP'ed `sv.ld`.  Using non-strided LD as an example, pseudocode
 375 showing what actually occurs, where the pseudocode for `indexed_remap`
 376 may be found in [[sv/remap]]:
 377
 378 ```
 379     # sv.ld *RT,RA,*RB with Index REMAP applied to RB
 380     for i in 0..VL-1:
 381         if remap.indexed:
 382             rb_idx = indexed_remap(i) # remap
 383         else:
 384             rb_idx = i # use the index as-is
 385         EA = GPR(RA) + GPR(RB+rb_idx)
 386         GPR(RT+i) = MEM(EA, 8)
 387 ```
 388
 389 Thus it can be seen that the use of Indexed REMAP saves copying
 390 and manual reordering of the Vector of RB offsets.
 391
 392 ## LD/ST ffirst
 393
 394 LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
 395 is not active) as an ordinary one, with all behaviour with respect to
 396 Interrupts Exceptions Page Faults Memory Management being identical
 397 in every regard to Scalar v3.0 Power ISA LD/ST. However for elements
 398 1 and above, if an exception would occur, then VL is **truncated**
 399 to the previous element: the exception is **not** then raised because
 400 the LD/ST that would otherwise have caused an exception is *required*
 401 to be cancelled. Additionally an implementor may choose to truncate VL
 402 for any arbitrary reason *except for the very first*.
 403
 404 ffirst LD/ST to multiple pages via a Vectorised Index base is
 405 considered a security risk due to the abuse of probing multiple
 406 pages in rapid succession and getting speculative feedback on which
 407 pages would fail.  Therefore Vector Indexed LD/ST is prohibited
 408 entirely, and the Mode bit instead used for element-strided LD/ST.
 409 See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 410
 411 ```
 412     for(i = 0; i < VL; i++)
 413         reg[rt + i] = mem[reg[ra] + i * reg[rb]];
 414 ```
 415
 416 High security implementations where any kind of speculative probing of
 417 memory pages is considered a risk should take advantage of the fact
 418 that implementations may truncate VL at any point, without requiring
 419 software to be rewritten and made non-portable. Such implementations may
 420 choose to *always* set VL=1 which will have the effect of terminating
 421 any speculative probing (and also adversely affect performance), but
 422 will at least not require applications to be rewritten.
 423
 424 Low-performance simpler hardware implementations may also choose (always)
 425 to also set VL=1 as the bare minimum compliant implementation of LD/ST
 426 Fail-First. It is however critically important to remember that the first
 427 element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.  **MUST**
 428 raise exceptions exactly like an ordinary LD/ST.
 429
 430 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value
 431 for any implementation-specific reason. For example: it is perfectly
 432 reasonable for implementations to alter VL when ffirst LD or ST operations
 433 are initiated on a nonaligned boundary, such that within a loop the
 434 subsequent iteration of that loop begins the following ffirst LD/ST
 435 operations on an aligned boundary such as the beginning of a cache line,
 436 or beginning of a Virtual Memory page. Likewise, to reduce workloads or
 437 balance resources.
 438
 439 Vertical-First Mode is slightly strange in that only one element at a time
 440 is ever executed anyway.  Given that programmers may legitimately choose
 441 to alter srcstep and dststep in non-sequential order as part of explicit
 442 loops, it is neither possible nor safe to make speculative assumptions
 443 about future LD/STs.  Therefore, Fail-First LD/ST in Vertical-First is
 444 `UNDEFINED`.  This is very different from Arithmetic (Data-dependent)
 445 FFirst where Vertical-First Mode is fully deterministic, not speculative.
 446
 447 ## Data-Dependent Fail-First (not Fail/Fault-First)
 448
 449 Not to be confused with Fail/Fault First, Data-Fail-First performs an
 450 additional check on the data into a Condition Register Field and if a test
 451 on the CR Field fails then VL is truncated and further looping terminates.
 452 This is precisely the same as Arithmetic Data-Dependent Fail-First,
 453 the only difference being that the result comes from the LD/ST.
 454
 455 In the case of Store operations there is a quirk when VLi (VL inclusive
 456 is "Valid") is clear. Bear in mind the criteria is that the truncated
 457 Vector of results, when VLi is clear, must all pass the "test", but when
 458 VLi is set the *current failed test* is permitted to be included.  Thus,
 459 the actual update (store) to Memory is **not permitted to take place**
 460 should the test fail. Therefore, on testing the value to be stored,
 461 and after updating the corresponding CR Field Element, when VLi=0 and
 462 finding that the test fails the Memory store must **not** occur.
 463
 464 Additionally, when VLi=0 and a test fails then RA does **not** receive a
 465 copy of the Effective Address.  Hardware implementations with Out-of-Order
 466 Micro-Architectures should use speculative Shadow-Hold and Cancellation
 467 when the test fails.
 468
 469 By contrast if VLi=1 and the test fails, Store may proceed *and then*
 470 looping terminates.  In this way, when non-Inclusive, the Vector of
 471 Truncated results contains only Stores that passed the test (and RA=EA
 472 updates if any), and when Inclusive the Vector of Truncated results
 473 contains the first-failed data.
 474
 475 Below is an example of loading the starting addresses of Linked-List
 476 nodes.  If VLi=1 it will load the NULL pointer into the Vector of results.
 477 If however VLi=0 it will *exclude* the NULL pointer by truncating VL to
 478 one Element earlier.
 479
 480 ```
 481    RT=1 # vec - deliberately overlaps by one with RA
 482    RA=0 # vec - first one is valid, contains ptr
 483    imm = 8 # offset_of(ptr->next)
 484    for i in range(VL):
 485        EA = GPR(RA+i) + imm          # ptr + offset(next)
 486        data = MEM(EA, 8)             # 64-bit address of ptr->next
 487        GPR(RT+i) = data              # happens to be read on next loop!
 488        # was a normal ld up to this point. now the Data-Fail-First
 489        CR.field(i) = conditions(data)
 490        if CR.field(i).EQ == testbit: # check if zero
 491            if VLI then   VL = i+1                   # update VL, inclusive
 492            else          VL = i                     # update VL
 493            break                     # stop looping
 494 ```
 495
 496 **Data-Dependent Fault-First on Store-Conditional (Rc=1)**
 497
 498 There are very few instructions that allow Rc=1 for Load/Store:
 499 one of those is the `stdcx.` and other Atomic Store-Conditional
 500 instructions.  With Simple-V being a loop around Scalar instructions
 501 strictly obeying Scalar Program Order a Fail-First loop on an
 502 Atomic Store-Conditional will always fail the second and all other
 503 Store-Conditional instructions in Horizontal-First Mode because
 504 Load-Reservation and Store-Conditional are required to be executed
 505 in pairs.
 506
 507 By contrast, in Vertical-First Mode it is in fact possible to issue
 508 the pairs, and consequently allowing Vectorised Data-Dependent Fail-First is
 509 useful.  Care should be taken however when VL is truncated in Vertical-First
 510 Mode.
 511
 512 ## LOAD/STORE Elwidths <a name="elwidth"></a>
 513
 514 Loads and Stores are almost unique in that the Power Scalar ISA
 515 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 516 others like it provide an explicit operation width.  There are therefore
 517 *three* widths involved:
 518
 519 * operation width (lb=8, lh=16, lw=32, ld=64)
 520 * src element width override (8/16/32/default)
 521 * destination element width override (8/16/32/default)
 522
 523 Some care is therefore needed to express and make clear the transformations,
 524 which are expressly in this order:
 525
 526 * Calculate the Effective Address from RA at full width
 527   but (on Indexed Load) allow srcwidth overrides on RB
 528 * Load at the operation width (lb/lh/lw/ld) as usual
 529 * byte-reversal as usual
 530 * Non-saturated mode:
 531    - zero-extension or truncation from operation width to dest elwidth
 532    - place result in destination at dest elwidth
 533 * Saturated mode:
 534    - Sign-extension or truncation from operation width to dest width
 535    - signed/unsigned saturation down to dest elwidth
 536
 537 In order to respect Power v3.0B Scalar behaviour the memory side
 538 is treated effectively as completely separate and distinct from SV
 539 augmentation.  This is primarily down to quirks surrounding LE/BE and
 540 byte-reversal.
 541
 542 It is rather unfortunately possible to request an elwidth override on
 543 the memory side which does not mesh with the overridden operation width:
 544 these result in `UNDEFINED` behaviour.  The reason is that the effect
 545 of attempting a 64-bit `sv.ld` operation with a source elwidth override
 546 of 8/16/32 would result in overlapping memory requests, particularly
 547 on unit and element strided operations.  Thus it is `UNDEFINED` when
 548 the elwidth is smaller than the memory operation width. Examples include
 549 `sv.lw/sw=16/els` which requests (overlapping) 4-byte memory reads offset
 550 from each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
 551 where the dest elwidth override is less than the operation width.
 552
 553 Note the following regarding the pseudocode to follow:
 554
 555 * `scalar identity behaviour` SV Context parameter conditions turn this
 556   into a straight absolute fully-compliant Scalar v3.0B LD operation
 557 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 558   rather than `ld`)
 559 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 560   a "normal" part of Scalar v3.0B LD
 561 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 562   as a "normal" part of Scalar v3.0B LD
 563 * `svctx` specifies the SV Context and includes VL as well as
 564   source and destination elwidth overrides.
 565
 566 Below is the pseudocode for Unit-Strided LD (which includes Vector
 567 capability). Observe in particular that RA, as the base address in both
 568 Immediate and Indexed LD/ST, does not have element-width overriding
 569 applied to it.
 570
 571 Note that predication, predication-zeroing, and other modes except
 572 saturation have all been removed, for clarity and simplicity:
 573
 574 ```
 575     # LD not VLD!
 576     # this covers unit stride mode and a type of vector offset
 577     function op_ld(RT, RA, op_width, imm_offs, svctx)
 578       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
 579         if not svctx.unit/el-strided:
 580             # strange vector mode, compute 64 bit address which is
 581             # not polymorphic! elwidth hardcoded to 64 here
 582             srcbase = get_polymorphed_reg(RA, 64, i)
 583         else:
 584             # unit / element stride mode, compute 64 bit address
 585             srcbase = get_polymorphed_reg(RA, 64, 0)
 586             # adjust for unit/el-stride
 587             srcbase += ....
 588
 589         # read the underlying memory
 590         memread <= MEM(srcbase + imm_offs, op_width)
 591
 592         # check saturation.
 593         if svpctx.saturation_mode:
 594             # ... saturation adjustment...
 595             memread = clamp(memread, op_width, svctx.dest_elwidth)
 596         else:
 597             # truncate/extend to over-ridden dest width.
 598             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
 599
 600         # takes care of inserting memory-read (now correctly byteswapped)
 601         # into regfile underlying LE-defined order, into the right place
 602         # within the NEON-like register, respecting destination element
 603         # bitwidth, and the element index (j)
 604         set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
 605
 606         # increments both src and dest element indices (no predication here)
 607         i++;
 608         j++;
 609 ```
 610
 611 Note above that the source elwidth is *not used at all* in LD-immediate.
 612
 613 For LD/Indexed, the key is that in the calculation of the Effective Address,
 614 RA has no elwidth override but RB does.  Pseudocode below is simplified
 615 for clarity: predication and all modes except saturation are removed:
 616
 617 ```
 618     # LD not VLD! ld*rx if brev else ld*
 619     function op_ld(RT, RA, RB, op_width, svctx, brev)
 620       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
 621         if not svctx.el-strided:
 622             # RA not polymorphic! elwidth hardcoded to 64 here
 623             srcbase = get_polymorphed_reg(RA, 64, i)
 624         else:
 625             # element stride mode, again RA not polymorphic
 626             srcbase = get_polymorphed_reg(RA, 64, 0)
 627         # RB *is* polymorphic
 628         offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
 629         # sign-extend
 630         if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
 631
 632         # takes care of (merges) processor LE/BE and ld/ldbrx
 633         bytereverse = brev XNOR MSR.LE
 634
 635         # read the underlying memory
 636         memread <= MEM(srcbase + offs, op_width)
 637
 638         # optionally performs byteswap at op width
 639         if (bytereverse):
 640             memread = byteswap(memread, op_width)
 641
 642         if svpctx.saturation_mode:
 643             # ... saturation adjustment...
 644             memread = clamp(memread, op_width, svctx.dest_elwidth)
 645         else:
 646             # truncate/extend to over-ridden dest width.
 647             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
 648
 649         # takes care of inserting memory-read (now correctly byteswapped)
 650         # into regfile underlying LE-defined order, into the right place
 651         # within the NEON-like register, respecting destination element
 652         # bitwidth, and the element index (j)
 653         set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
 654
 655         # increments both src and dest element indices (no predication here)
 656         i++;
 657         j++;
 658 ```
 659
 660 ## Remapped LD/ST
 661
 662 In the [[sv/remap]] page the concept of "Remapping" is described.  Whilst
 663 it is expensive to set up (2 64-bit opcodes minimum) it provides a way to
 664 arbitrarily perform 1D, 2D and 3D "remapping" of up to 64 elements worth
 665 of LDs or STs.  The usual interest in such re-mapping is for example in
 666 separating out 24-bit RGB channel data into separate contiguous registers.
 667 NEON covers this as shown in the diagram below:
 668
 669 ![Load/Store remap](/openpower/sv/load-store.svg)
 670
 671 REMAP easily covers this capability, and with dest elwidth overrides
 672 and saturation may do so with built-in conversion that would normally
 673 require additional width-extension, sign-extension and min/max Vectorised
 674 instructions as post-processing stages.
 675
 676 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 677 because the generic abstracted concept of "Remapping", when applied to
 678 LD/ST, will give that same capability, with far more flexibility.
 679
 680 It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
 681 established through `svstep`, are also an easy way to perform regular
 682 Structure Packing, at the vec2/vec3/vec4 granularity level.  Beyond that,
 683 REMAP will need to be used.
 684
 685 --------
 686
 687 [[!tag standards]]
 688
 689 \newpage{}