openpower/sv/ldst.mdwn

   1 # SV Load and Store
   2
   3 Links:
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=940> post autoincrement mode
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=1047> Data-Dependent Fail-First
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12 * [[ldst/discussion]]
  13
  14 ## Rationale
  15
  16 All Vector ISAs dating back fifty years have extensive and comprehensive
  17 Load and Store operations that go far beyond the capabilities of Scalar
  18 RISC and most CISC processors, yet at their heart on an individual element
  19 basis may be found to be no different from RISC Scalar equivalents.
  20
  21 The resource savings from Vector LD/ST are significant and stem
  22 from the fact that one single instruction can trigger a dozen (or in
  23 some microarchitectures such as Cray or NEC SX Aurora) hundreds of
  24 element-level Memory accesses.
  25
  26 Additionally, and simply: if the Arithmetic side of an ISA supports
  27 Vector Operations, then in order to keep the ALUs 100% occupied the
  28 Memory infrastructure (and the ISA itself) correspondingly needs Vector
  29 Memory Operations as well.
  30
  31 Vectorised Load and Store also presents an extra dimension (literally)
  32 which creates scenarios unique to Vector applications, that a Scalar (and
  33 even a SIMD) ISA simply never encounters.  SVP64 endeavours to add the
  34 modes typically found in *all* Scalable Vector ISAs, without changing the
  35 behaviour of the underlying Base (Scalar) v3.0B operations in any way.
  36 (The sole apparent exception is Post-Increment Mode on LD/ST-update
  37 instructions)
  38
  39 ## Modes overview
  40
  41 Vectorisation of Load and Store requires creation, from scalar operations,
  42 a number of different modes:
  43
  44 * **fixed aka "unit" stride** - contiguous sequence with no gaps
  45 * **element strided** - sequential but regularly offset, with gaps
  46 * **vector indexed** - vector of base addresses and vector of offsets
  47 * **Speculative fail-first** - where it makes sense to do so
  48 * **Structure Packing** - covered in SV by [[sv/remap]] and Pack/Unpack Mode.
  49
  50 *Despite being constructed from Scalar LD/ST none of these Modes exist
  51 or make sense in any Scalar ISA. They **only** exist in Vector ISAs
  52 and are a critical part of its value*.
  53
  54 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
  55 as well as Element-width overrides and Twin-Predication.
  56
  57 Note also that Indexed [[sv/remap]] mode may be applied to both Scalar
  58 LD/ST Immediate Defined Words *and* LD/ST Indexed Defined Words.
  59 LD/ST-Indexed should not be conflated with Indexed REMAP mode:
  60 clarification is provided below.
  61
  62 **Determining the LD/ST Modes**
  63
  64 A minor complication (caused by the retro-fitting of modern Vector
  65 features to a Scalar ISA) is that certain features do not exactly make
  66 sense or are considered a security risk.  Fail-first on Vector Indexed
  67 would allow attackers to probe large numbers of pages from userspace,
  68 where strided fail-first (by creating contiguous sequential LDs) does not.
  69
  70 In addition, reduce mode makes no sense.  Realistically we need an
  71 alternative table definition for [[sv/svp64]] `RM.MODE`.  The following
  72 modes make sense:
  73
  74 * saturation
  75 * simple (no augmentation)
  76 * fail-first (where Vector Indexed is banned)
  77 * Signed Effective Address computation (Vector Indexed only)
  78
  79 More than that however it is necessary to fit the usual Vector ISA
  80 capabilities onto both Power ISA LD/ST with immediate and to LD/ST
  81 Indexed. They present subtly different Mode tables, which, due to lack
  82 of space, have the following quirks:
  83
  84 * LD/ST Immediate has no individual control over src/dest zeroing,
  85   whereas LD/ST Indexed does.
  86 * LD/ST Immediate has saturation but LD/ST Indexed does not.
  87
  88 ## Format and fields
  89
  90 Fields used in tables below:
  91
  92 * **sz / dz**  if predication is enabled will put zeros into the dest
  93   (or as src in the case of twin pred) when the predicate bit is zero.
  94   otherwise the element is ignored or skipped, depending on context.
  95 * **zz**: both sz and dz are set equal to this flag.
  96 * **inv CR bit** just as in branches (BO) these bits allow testing of
  97   a CR bit and whether it is set (inv=0) or unset (inv=1)
  98 * **N** sets signed/unsigned saturation.
  99 * **RC1** as if Rc=1, stores CRs *but not the result*
 100 * **SEA** - Signed Effective Address, if enabled performs sign-extension on
 101   registers that have been reduced due to elwidth overrides
 102 * **PI** - post-increment mode (applies to LD/ST with update only).
 103   the Effective Address utilised is always just RA, i.e. the computation of
 104   EA is stored in RA **after** it is actually used.
 105 * **LF** - Load/Store Fail or Fault First: for any reason Load or Store Vectors
 106   may be truncated to (at least) one element, and VL altered to indicate such.
 107 * **VLi** - Inclusive Data-Dependent Fail-First: the failing element is included
 108   in the Truncated Vector.
 109 * **els** - Element-strided Mode: the element index (after REMAP)
 110   is multiplied by the immediate offset (or Scalar RB for Indexed). Restrictions apply.
 111
 112 When VLi=0 on Store Operations the Memory update does **not** take place
 113 on the element that failed.  EA does **not** update into RA on Load/Store
 114 with Update instructions either.
 115
 116 **LD/ST immediate**
 117
 118 The table for [[sv/svp64]] for `immed(RA)` which is `RM.MODE`
 119 (bits 19:23 of `RM`) is:
 120
 121 | 0 | 1 |  2  |  3   4  |  description               |
 122 |---|---| --- |---------|--------------------------- |
 123 | 0 | 0 | 0   |  zz els | simple mode                |
 124 | 0 | 0 | 1   | PI  LF  | post-increment and Fault-First  |
 125 | 1 | 0 |   N | zz  els |  sat mode: N=0/1 u/s       |
 126 |VLi| 1 | inv | CR-bit  | ffirst CR sel             |
 127
 128 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
 129 whether stride is unit or element:
 130
 131 ```
 132     if RA.isvec:
 133         svctx.ldstmode = indexed
 134     elif els == 0:
 135         svctx.ldstmode = unitstride
 136     elif immediate != 0:
 137         svctx.ldstmode = elementstride
 138 ```
 139
 140 An immediate of zero is a safety-valve to allow `LD-VSPLAT`: in effect
 141 the multiplication of the immediate-offset by zero results in reading from
 142 the exact same memory location, *even with a Vector register*. (Normally
 143 this type of behaviour is reserved for the mapreduce modes)
 144
 145 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur just
 146 the once and be copied, rather than hitting the Data Cache multiple
 147 times with the same memory read at the same location.  The benefit of
 148 Cache-inhibited LD-splats is that it allows for memory-mapped peripherals
 149 to have multiple data values read in quick succession and stored in
 150 sequentially numbered registers (but, see Note below).
 151
 152 For non-cache-inhibited ST from a vector source onto a scalar destination:
 153 with the Vector loop effectively creating multiple memory writes to
 154 the same location, we can deduce that the last of these will be the
 155 "successful" one. Thus, implementations are free and clear to optimise
 156 out the overwriting STs, leaving just the last one as the "winner".
 157 Bear in mind that predicate masks will skip some elements (in source
 158 non-zeroing mode).  Cache-inhibited ST operations on the other hand
 159 **MUST** write out a Vector source multiple successive times to the exact
 160 same Scalar destination. Just like Cache-inhibited LDs, multiple values
 161 may be written out in quick succession to a memory-mapped peripheral
 162 from sequentially-numbered registers.
 163
 164 Note that any memory location may be Cache-inhibited
 165 (Power ISA v3.1, Book III, 1.6.1, p1033)
 166
 167 *Programmer's Note: an immediate also with a Scalar source as a "VSPLAT"
 168 mode is simply not possible: there are not enough Mode bits. One single
 169 Scalar Load operation may be used instead, followed by any arithmetic
 170 operation (including a simple mv) in "Splat" mode.*
 171
 172 **LD/ST Indexed**
 173
 174 The modes for `RA+RB` indexed version are slightly different
 175 but are the same `RM.MODE` bits (19:23 of `RM`):
 176
 177 | 0 | 1 |  2  |  3   4  |  description               |
 178 |---|---| --- |---------|--------------------------- |
 179 |els| 0 | SEA |  dz  sz | simple mode        |
 180 |VLi| 1 | inv | CR-bit  | ffirst CR sel        |
 181
 182 Vector Indexed Strided Mode is qualified as follows:
 183
 184 ```
 185     if els and !RA.isvec and !RB.isvec:
 186         svctx.ldstmode = elementstride
 187 ```
 188
 189 A summary of the effect of Vectorisation of src or dest:
 190
 191 ```
 192     imm(RA)  RT.v   RA.v   no stride allowed
 193     imm(RA)  RT.s   RA.v   no stride allowed
 194     imm(RA)  RT.v   RA.s   stride-select allowed
 195     imm(RA)  RT.s   RA.s   not vectorised
 196     RA,RB    RT.v  {RA|RB}.v Standard Indexed
 197     RA,RB    RT.s  {RA|RB}.v Indexed but single LD (no VSPLAT)
 198     RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
 199     RA,RB    RT.s  {RA&RB}.s not vectorised (scalar identity)
 200 ```
 201
 202 Signed Effective Address computation is only relevant for Vector Indexed
 203 Mode, when elwidth overrides are applied.  The source override applies to
 204 RB, and before adding to RA in order to calculate the Effective Address,
 205 if SEA is set RB is sign-extended from elwidth bits to the full 64 bits.
 206 For other Modes (ffirst, saturate), all EA computation with elwidth
 207 overrides is unsigned.
 208
 209 Note that cache-inhibited LD/ST  when VSPLAT is activated will perform
 210 **multiple** LD/ST operations, sequentially.  Even with scalar src
 211 a Cache-inhibited LD will read the same memory location *multiple
 212 times*, storing the result in successive Vector destination registers.
 213 This because the cache-inhibit instructions are typically used to read
 214 and write memory-mapped peripherals.  If a genuine cache-inhibited
 215 LD-VSPLAT is required then a single *scalar* cache-inhibited LD should
 216 be performed, followed by a VSPLAT-augmented mv, copying the one *scalar*
 217 value into multiple register destinations.
 218
 219 Note also that cache-inhibited VSPLAT with Data-Dependent Fail-First is possible.
 220 This allows for example to issue a massive batch of memory-mapped
 221 peripheral reads, stopping at the first NULL-terminated character and
 222 truncating VL to that point. No branch is needed to issue that large
 223 burst of LDs, which may be valuable in Embedded scenarios.
 224
 225 ## Vectorisation of Scalar Power ISA v3.0B
 226
 227 Scalar Power ISA Load/Store operations may be seen from [[isa/fixedload]]
 228 and [[isa/fixedstore]] pseudocode to be of the form:
 229
 230 ```
 231     lbux RT, RA, RB
 232     EA <- (RA) + (RB)
 233     RT <- MEM(EA)
 234 ```
 235
 236 and for immediate variants:
 237
 238 ```
 239     lb RT,D(RA)
 240     EA <- RA + EXTS(D)
 241     RT <- MEM(EA)
 242 ```
 243
 244 Thus in the first example, the source registers may each be independently
 245 marked as scalar or vector, and likewise the destination; in the second
 246 example only the one source and one dest may be marked as scalar or
 247 vector.
 248
 249 Thus we can see that Vector Indexed may be covered, and, as demonstrated
 250 with the pseudocode below, the immediate can be used to give unit
 251 stride or element stride.  With there being no way to tell which from
 252 the Power v3.0B Scalar opcode alone, the choice is provided instead by
 253 the SV Context.
 254
 255 ```
 256     # LD not VLD!  format - ldop RT, immed(RA)
 257     # op_width: lb=1, lh=2, lw=4, ld=8
 258     op_load(RT, RA, op_width, immed, svctx, RAupdate):
 259       ps = get_pred_val(FALSE, RA); # predication on src
 260       pd = get_pred_val(FALSE, RT); # ... AND on dest
 261       for (i=0, j=0, u=0; i < VL && j < VL;):
 262         # skip nonpredicates elements
 263         if (RA.isvec) while (!(ps & 1<<i)) i++;
 264         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 265         if (RT.isvec) while (!(pd & 1<<j)) j++;
 266         if postinc:
 267             offs = 0; # added afterwards
 268             if RA.isvec: srcbase = ireg[RA+i]
 269             else         srcbase = ireg[RA]
 270         elif svctx.ldstmode == elementstride:
 271           # element stride mode
 272           srcbase = ireg[RA]
 273           offs = i * immed              # j*immed for a ST
 274         elif svctx.ldstmode == unitstride:
 275           # unit stride mode
 276           srcbase = ireg[RA]
 277           offs = immed + (i * op_width) # j*op_width for ST
 278         elif RA.isvec:
 279           # quirky Vector indexed mode but with an immediate
 280           srcbase = ireg[RA+i]
 281           offs = immed;
 282         else
 283           # standard scalar mode (but predicated)
 284           # no stride multiplier means VSPLAT mode
 285           srcbase = ireg[RA]
 286           offs = immed
 287
 288         # compute EA
 289         EA = srcbase + offs
 290         # load from memory
 291         ireg[RT+j] <= MEM[EA];
 292         # check post-increment of EA
 293         if postinc: EA = srcbase + immed;
 294         # update RA?
 295         if RAupdate: ireg[RAupdate+u] = EA;
 296         if (!RT.isvec)
 297             break # destination scalar, end now
 298         if (RA.isvec) i++;
 299         if (RAupdate.isvec) u++;
 300         if (RT.isvec) j++;
 301 ```
 302
 303 Indexed LD is:
 304
 305 ```
 306     # format: ldop RT, RA, RB
 307     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
 308       ps = get_pred_val(FALSE, RA); # predication on src
 309       pd = get_pred_val(FALSE, RT); # ... AND on dest
 310       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
 311         # skip nonpredicated RA, RB and RT
 312         if (RA.isvec) while (!(ps & 1<<i)) i++;
 313         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 314         if (RB.isvec) while (!(ps & 1<<k)) k++;
 315         if (RT.isvec) while (!(pd & 1<<j)) j++;
 316         if svctx.ldstmode == elementstride:
 317             EA = ireg[RA] + ireg[RB]*j   # register-strided
 318         else
 319             EA = ireg[RA+i] + ireg[RB+k] # indexed address
 320         if RAupdate: ireg[RAupdate+u] = EA
 321         ireg[RT+j] <= MEM[EA];
 322         if (!RT.isvec)
 323             break # destination scalar, end immediately
 324         if (RA.isvec) i++;
 325         if (RAupdate.isvec) u++;
 326         if (RB.isvec) k++;
 327         if (RT.isvec) j++;
 328 ```
 329
 330 Note that Element-Strided uses the Destination Step because with both
 331 sources being Scalar as a prerequisite condition of activation of
 332 Element-Stride Mode, the source step (being Scalar) would never advance.
 333
 334 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update"
 335 mode (`ldux`) to be effectively a *completely different* register from
 336 RA-as-a-source.  This because there is room in svp64 to extend RA-as-src
 337 as well as RA-as-dest, both independently as scalar or vector *and*
 338 independently extending their range.
 339
 340 *Programmer's note: being able to set RA-as-a-source as separate from
 341 RA-as-a-destination as Scalar is **extremely valuable** once it is
 342 remembered that Simple-V element operations must be in Program Order,
 343 especially in loops, for saving on multiple address computations. Care
 344 does have to be taken however that RA-as-src is not overwritten by
 345 RA-as-dest unless intentionally desired, especially in element-strided
 346 Mode.*
 347
 348 ## LD/ST Indexed vs Indexed REMAP
 349
 350 Unfortunately the word "Indexed" is used twice in completely different
 351 contexts, potentially causing confusion.
 352
 353 * There has existed instructions in the Power ISA `ld RT,RA,RB` since
 354   its creation: these are called "LD/ST Indexed" instructions and their
 355   name and meaning is well-established.
 356 * There now exists, in Simple-V, a [[sv/remap]] mode called "Indexed"
 357   Mode that can be applied to *any* instruction **including those
 358   named LD/ST Indexed**.
 359
 360 Whilst it may be costly in terms of register reads to allow REMAP Indexed
 361 Mode to be applied to any Vectorised LD/ST Indexed operation such as
 362 `sv.ld *RT,RA,*RB`, or even misleadingly labelled  as redundant, firstly
 363 the strict application of the RISC Paradigm that Simple-V follows makes
 364 it awkward to consider *preventing* the application of Indexed REMAP to
 365 such operations, and secondly they are not actually the same at all.
 366
 367 Indexed REMAP, as applied to RB in the instruction `sv.ld *RT,RA,*RB`
 368 effectively performs an *in-place* re-ordering of the offsets, RB.
 369 To achieve the same effect without Indexed REMAP would require taking
 370 a *copy* of the Vector of offsets starting at RB, manually explicitly
 371 reordering them, and finally using the copy of re-ordered offsets in a
 372 non-REMAP'ed `sv.ld`.  Using non-strided LD as an example, pseudocode
 373 showing what actually occurs, where the pseudocode for `indexed_remap`
 374 may be found in [[sv/remap]]:
 375
 376 ```
 377     # sv.ld *RT,RA,*RB with Index REMAP applied to RB
 378     for i in 0..VL-1:
 379         if remap.indexed:
 380             rb_idx = indexed_remap(i) # remap
 381         else:
 382             rb_idx = i # use the index as-is
 383         EA = GPR(RA) + GPR(RB+rb_idx)
 384         GPR(RT+i) = MEM(EA, 8)
 385 ```
 386
 387 Thus it can be seen that the use of Indexed REMAP saves copying
 388 and manual reordering of the Vector of RB offsets.
 389
 390 ## LD/ST ffirst (Fault-First)
 391
 392 LD/ST ffirst treats the first LD/ST in a vector (element 0 if REMAP
 393 is not active) as an ordinary one, with all behaviour with respect to
 394 Interrupts Exceptions Page Faults Memory Management being identical
 395 in every regard to Scalar v3.0 Power ISA LD/ST. However for elements
 396 1 and above, if an exception would occur, then VL is **truncated**
 397 to the previous element: the exception is **not** then raised because
 398 the LD/ST that would otherwise have caused an exception is *required*
 399 to be cancelled. Additionally an implementor may choose to truncate VL
 400 for any arbitrary reason *except for the very first*.
 401
 402 ffirst LD/ST to multiple pages via a Vectorised Index base is
 403 considered a security risk due to the abuse of probing multiple
 404 pages in rapid succession and getting speculative feedback on which
 405 pages would fail.  Therefore Vector Indexed LD/ST is prohibited
 406 entirely, and the Mode bit instead used for element-strided LD/ST.
 407 See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 408
 409 ```
 410     for(i = 0; i < VL; i++)
 411         reg[rt + i] = mem[reg[ra] + i * reg[rb]];
 412 ```
 413
 414 High security implementations where any kind of speculative probing of
 415 memory pages is considered a risk should take advantage of the fact
 416 that implementations may truncate VL at any point, without requiring
 417 software to be rewritten and made non-portable. Such implementations may
 418 choose to *always* set VL=1 which will have the effect of terminating
 419 any speculative probing (and also adversely affect performance), but
 420 will at least not require applications to be rewritten.
 421
 422 Low-performance simpler hardware implementations may also choose (always)
 423 to also set VL=1 as the bare minimum compliant implementation of LD/ST
 424 Fail-First. It is however critically important to remember that the first
 425 element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.  **MUST**
 426 raise exceptions exactly like an ordinary LD/ST.
 427
 428 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value
 429 for any implementation-specific reason. For example: it is perfectly
 430 reasonable for implementations to alter VL when ffirst LD or ST operations
 431 are initiated on a nonaligned boundary, such that within a loop the
 432 subsequent iteration of that loop begins the following ffirst LD/ST
 433 operations on an aligned boundary such as the beginning of a cache line,
 434 or beginning of a Virtual Memory page. Likewise, to reduce workloads or
 435 balance resources.
 436
 437 Vertical-First Mode is slightly strange in that only one element at a time
 438 is ever executed anyway.  Given that programmers may legitimately choose
 439 to alter srcstep and dststep in non-sequential order as part of explicit
 440 loops, it is neither possible nor safe to make speculative assumptions
 441 about future LD/STs.  Therefore, Fail-First LD/ST in Vertical-First is
 442 `UNDEFINED`.  This is very different from Arithmetic (Data-dependent)
 443 FFirst where Vertical-First Mode is fully deterministic, not speculative.
 444
 445 ## Data-Dependent Fail-First (not Fail/Fault-First)
 446
 447 Not to be confused with Fail/Fault First, Data-Fail-First performs an
 448 additional check on the data, and if the test
 449 fails then VL is truncated and further looping terminates.
 450 This is precisely the same as Arithmetic Data-Dependent Fail-First,
 451 the only difference being that the result comes from the LD/ST
 452 rather than from an Arithmetic operation.
 453
 454 Also a crucial difference between Arithmetic and LD/ST Data-Dependent Fail-First:
 455 except for Store-Conditional a 4-bit Condition Register Field test is created
 456 for testing purposes
 457 *but not stored* (thus there is no RC1 Mode as there is in Arithmetic).
 458 The reason why a CR Field is not stored is because Load/Store, particularly
 459 the Update instructions, is already expensive in register terms,
 460 and adding an extra Vector write would be too costly in hardware.
 461
 462 *Programmer's note: Programmers
 463 may use Data-Dependent Load with a test to truncate VL, and may then
 464 follow up with a `sv.cmpi` or other operation. The important aspect is
 465 that the Vector Load truncated on finding a NULL pointer, for example.*
 466
 467 *Programmer's note: Load-with-Update may be used to update
 468 the register used in Effective Address computation of th
 469 next element.  This may be used to perform single-linked-list
 470 walking, where Data-Dependent Fail-First terminates and
 471 truncates the Vector at the first NULL.*
 472
 473 In the case of Store operations there is a quirk when VLi (VL inclusive
 474 is "Valid") is clear. Bear in mind the criteria is that the truncated
 475 Vector of results, when VLi is clear, must all pass the "test", but when
 476 VLi is set the *current failed test* is permitted to be included.  Thus,
 477 the actual update (store) to Memory is **not permitted to take place**
 478 should the test fail. Therefore, on testing the value to be stored,
 479 when VLi=0 and finding that the test fails the Memory store must **not** occur.
 480
 481 Additionally, when VLi=0 and a test fails then RA does **not** receive a
 482 copy of the Effective Address.  Hardware implementations with Out-of-Order
 483 Micro-Architectures should use speculative Shadow-Hold and Cancellation
 484 when the test fails.
 485
 486 By contrast if VLi=1 and the test fails, Store may proceed *and then*
 487 looping terminates.  In this way, when non-Inclusive, the Vector of
 488 Truncated results contains only Stores that passed the test (and RA=EA
 489 updates if any), and when Inclusive the Vector of Truncated results
 490 contains the first-failed data.
 491
 492 Below is an example of loading the starting addresses of Linked-List
 493 nodes.  If VLi=1 it will load the NULL pointer into the Vector of results.
 494 If however VLi=0 it will *exclude* the NULL pointer by truncating VL to
 495 one Element earlier.
 496
 497 *Programmer's Note: by also setting the RC1 qualifier as well as setting
 498 VLi=1 it is possible to establish a Predicate Mask such that the first
 499 zero in the predicate will be the NULL pointer*
 500
 501 ```
 502    RT=1 # vec - deliberately overlaps by one with RA
 503    RA=0 # vec - first one is valid, contains ptr
 504    imm = 8 # offset_of(ptr->next)
 505    for i in range(VL):
 506        # this part is the Scalar Defined Word (standard scalar ld operation)
 507        EA = GPR(RA+i) + imm          # ptr + offset(next)
 508        data = MEM(EA, 8)             # 64-bit address of ptr->next
 509        GPR(RT+i) = data              # happens to be read on next loop!
 510        # was a normal vector-ld up to this point. now the Data-Fail-First
 511        cr_test = conditions(data)
 512        if Rc=1 or RC1: CR.field(i) = cr_test # only store if Rc=1/RC1
 513        if cr_test.EQ == testbit:             # check if zero
 514            if VLI then   VL = i+1            # update VL, inclusive
 515            else          VL = i              # update VL, exclusive current
 516            break                             # stop looping
 517 ```
 518
 519 **Data-Dependent Fault-First on Store-Conditional (Rc=1)**
 520
 521 There are very few instructions that allow Rc=1 for Load/Store:
 522 one of those is the `stdcx.` and other Atomic Store-Conditional
 523 instructions.  With Simple-V being a loop around Scalar instructions
 524 strictly obeying Scalar Program Order a Horizontal-First Fail-First loop
 525 on an Atomic Store-Conditional will always fail the second and all other
 526 Store-Conditional instructions because
 527 Load-Reservation and Store-Conditional are required to be executed
 528 in pairs.
 529
 530 By contrast, in Vertical-First Mode it is in fact possible to issue
 531 the pairs, and consequently allowing Vectorised Data-Dependent Fail-First is
 532 useful.
 533
 534 Programmer's note: Care should be taken when VL is truncated in
 535 Vertical-First Mode.
 536
 537 **Future potential**
 538
 539 Although Rc=1 on LD/ST is a rare occurrence at present, future versions
 540 of Power ISA *might* conceivably have Rc=1 LD/ST Scalar instructions, and
 541 with the SVP64 Vectorisation Prefixing being itself a RISC-paradigm that
 542 is itself fully-independent of the Scalar Suffix Defined Words, prohibiting
 543 the possibility of Rc=1 Data-Dependent Mode on future potential LD/ST
 544 operations is not strategically sound.
 545
 546 ## LOAD/STORE Elwidths <a name="elwidth"></a>
 547
 548 Loads and Stores are almost unique in that the Power Scalar ISA
 549 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 550 others like it provide an explicit operation width.  There are therefore
 551 *three* widths involved:
 552
 553 * operation width (lb=8, lh=16, lw=32, ld=64)
 554 * src element width override (8/16/32/default)
 555 * destination element width override (8/16/32/default)
 556
 557 Some care is therefore needed to express and make clear the transformations,
 558 which are expressly in this order:
 559
 560 * Calculate the Effective Address from RA at full width
 561   but (on Indexed Load) allow srcwidth overrides on RB
 562 * Load at the operation width (lb/lh/lw/ld) as usual
 563 * byte-reversal as usual
 564 * Non-saturated mode:
 565    - zero-extension or truncation from operation width to dest elwidth
 566    - place result in destination at dest elwidth
 567 * Saturated mode:
 568    - Sign-extension or truncation from operation width to dest width
 569    - signed/unsigned saturation down to dest elwidth
 570
 571 In order to respect Power v3.0B Scalar behaviour the memory side
 572 is treated effectively as completely separate and distinct from SV
 573 augmentation.  This is primarily down to quirks surrounding LE/BE and
 574 byte-reversal.
 575
 576 It is rather unfortunately possible to request an elwidth override on
 577 the memory side which does not mesh with the overridden operation width:
 578 these result in `UNDEFINED` behaviour.  The reason is that the effect
 579 of attempting a 64-bit `sv.ld` operation with a source elwidth override
 580 of 8/16/32 would result in overlapping memory requests, particularly
 581 on unit and element strided operations.  Thus it is `UNDEFINED` when
 582 the elwidth is smaller than the memory operation width. Examples include
 583 `sv.lw/sw=16/els` which requests (overlapping) 4-byte memory reads offset
 584 from each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
 585 where the dest elwidth override is less than the operation width.
 586
 587 Note the following regarding the pseudocode to follow:
 588
 589 * `scalar identity behaviour` SV Context parameter conditions turn this
 590   into a straight absolute fully-compliant Scalar v3.0B LD operation
 591 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 592   rather than `ld`)
 593 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 594   a "normal" part of Scalar v3.0B LD
 595 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 596   as a "normal" part of Scalar v3.0B LD
 597 * `svctx` specifies the SV Context and includes VL as well as
 598   source and destination elwidth overrides.
 599
 600 Below is the pseudocode for Unit-Strided LD (which includes Vector
 601 capability). Observe in particular that RA, as the base address in both
 602 Immediate and Indexed LD/ST, does not have element-width overriding
 603 applied to it.
 604
 605 Note that predication, predication-zeroing, and other modes except
 606 saturation have all been removed, for clarity and simplicity:
 607
 608 ```
 609     # LD not VLD!
 610     # this covers unit stride mode and a type of vector offset
 611     function op_ld(RT, RA, op_width, imm_offs, svctx)
 612       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
 613         if not svctx.unit/el-strided:
 614             # strange vector mode, compute 64 bit address which is
 615             # not polymorphic! elwidth hardcoded to 64 here
 616             srcbase = get_polymorphed_reg(RA, 64, i)
 617         else:
 618             # unit / element stride mode, compute 64 bit address
 619             srcbase = get_polymorphed_reg(RA, 64, 0)
 620             # adjust for unit/el-stride
 621             srcbase += ....
 622
 623         # read the underlying memory
 624         memread <= MEM(srcbase + imm_offs, op_width)
 625
 626         # check saturation.
 627         if svpctx.saturation_mode:
 628             # ... saturation adjustment...
 629             memread = clamp(memread, op_width, svctx.dest_elwidth)
 630         else:
 631             # truncate/extend to over-ridden dest width.
 632             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
 633
 634         # takes care of inserting memory-read (now correctly byteswapped)
 635         # into regfile underlying LE-defined order, into the right place
 636         # within the NEON-like register, respecting destination element
 637         # bitwidth, and the element index (j)
 638         set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
 639
 640         # increments both src and dest element indices (no predication here)
 641         i++;
 642         j++;
 643 ```
 644
 645 Note above that the source elwidth is *not used at all* in LD-immediate.
 646
 647 For LD/Indexed, the key is that in the calculation of the Effective Address,
 648 RA has no elwidth override but RB does.  Pseudocode below is simplified
 649 for clarity: predication and all modes except saturation are removed:
 650
 651 ```
 652     # LD not VLD! ld*rx if brev else ld*
 653     function op_ld(RT, RA, RB, op_width, svctx, brev)
 654       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL):
 655         if not svctx.el-strided:
 656             # RA not polymorphic! elwidth hardcoded to 64 here
 657             srcbase = get_polymorphed_reg(RA, 64, i)
 658         else:
 659             # element stride mode, again RA not polymorphic
 660             srcbase = get_polymorphed_reg(RA, 64, 0)
 661         # RB *is* polymorphic
 662         offs = get_polymorphed_reg(RB, svctx.src_elwidth, i)
 663         # sign-extend
 664         if svctx.SEA: offs = sext(offs, svctx.src_elwidth, 64)
 665
 666         # takes care of (merges) processor LE/BE and ld/ldbrx
 667         bytereverse = brev XNOR MSR.LE
 668
 669         # read the underlying memory
 670         memread <= MEM(srcbase + offs, op_width)
 671
 672         # optionally performs byteswap at op width
 673         if (bytereverse):
 674             memread = byteswap(memread, op_width)
 675
 676         if svpctx.saturation_mode:
 677             # ... saturation adjustment...
 678             memread = clamp(memread, op_width, svctx.dest_elwidth)
 679         else:
 680             # truncate/extend to over-ridden dest width.
 681             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
 682
 683         # takes care of inserting memory-read (now correctly byteswapped)
 684         # into regfile underlying LE-defined order, into the right place
 685         # within the NEON-like register, respecting destination element
 686         # bitwidth, and the element index (j)
 687         set_polymorphed_reg(RT, svctx.dest_elwidth, j, memread)
 688
 689         # increments both src and dest element indices (no predication here)
 690         i++;
 691         j++;
 692 ```
 693
 694 ## Remapped LD/ST
 695
 696 In the [[sv/remap]] page the concept of "Remapping" is described.  Whilst
 697 it is expensive to set up (2 64-bit opcodes minimum) it provides a way to
 698 arbitrarily perform 1D, 2D and 3D "remapping" of up to 64 elements worth
 699 of LDs or STs.  The usual interest in such re-mapping is for example in
 700 separating out 24-bit RGB channel data into separate contiguous registers.
 701 NEON covers this as shown in the diagram below:
 702
 703 ![Load/Store remap](/openpower/sv/load-store.svg)
 704
 705 REMAP easily covers this capability, and with dest elwidth overrides
 706 and saturation may do so with built-in conversion that would normally
 707 require additional width-extension, sign-extension and min/max Vectorised
 708 instructions as post-processing stages.
 709
 710 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 711 because the generic abstracted concept of "Remapping", when applied to
 712 LD/ST, will give that same capability, with far more flexibility.
 713
 714 It is worth noting that Pack/Unpack Modes of SVSTATE, which may be
 715 established through `svstep`, are also an easy way to perform regular
 716 Structure Packing, at the vec2/vec3/vec4 granularity level.  Beyond that,
 717 REMAP will need to be used.
 718
 719 **Parallel Reduction REMAP**
 720
 721 No REMAP Schedule is prohibited in SVP64 because the RISC-paradigm Prefix
 722 is completely separate from the RISC-paradigm Scalar Defined Words.  Although
 723 obscure there does exist the outside possibility that a potential use for
 724 Parallel Reduction Schedules on LD/ST would find a use in Computer Science.
 725 Readers are invited to contact the authors of this document if one is ever
 726 found.
 727
 728 --------
 729
 730 [[!tag standards]]
 731
 732 \newpage{}