openpower/sv/ldst.mdwn

   1 [[!tag standards]]
   2
   3 # SV Load and Store
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12
  13 # Rationale
  14
  15 All Vector ISAs dating back fifty years have extensive and comprehensive
  16 Load and Store operations that go far beyond the capabilities of Scalar
  17 RISC or CISC processors, yet at their heart on an individual element
  18 basis may be found to be no different from RISC Scalar equivalents.
  19
  20 The resource savings from Vector LD/ST are significant and stem from
  21 the fact that one single instruction can trigger a dozen (or in some
  22 microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
  23
  24 Additionally, and simply: if the Arithmetic side of an ISA supports
  25 Vector Operations, then in order to keep the ALUs 100% occupied the
  26 Memory infrastructure (and the ISA itself) correspondingly needs Vector
  27 Memory Operations as well.
  28
  29 Vectorised Load and Store also presents an extra dimension (literally)
  30 which creates scenarios unique to Vector applications, that a Scalar
  31 (and even a SIMD) ISA simply never encounters.  SVP64 endeavours to
  32 add such modes without changing the behaviour of the underlying Base
  33 (Scalar) v3.0B operations.
  34
  35 # Modes overview
  36
  37 Vectorisation of Load and Store requires creation, from scalar operations,
  38 a number of different modes:
  39
  40 * fixed aka "unit" stride (contiguous sequence with no gaps)
  41 * element strided (sequential but regularly offset, with gaps)
  42 * vector indexed (vector of base addresses and vector of offsets)
  43 * Speculative fail-first (where it makes sense to do so)
  44 * Structure Packing (covered in SV by [[sv/remap]]).
  45
  46 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
  47 as well as Element-width overrides and Twin-Predication.
  48
  49 *Despite being constructed from Scalar LD/ST none of these Modes
  50 exist or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
  51
  52 # Determining the LD/ST Modes
  53
  54 A minor complication (caused by the retro-fitting of modern Vector
  55 features to a Scalar ISA) is that certain features do not exactly make
  56 sense or are considered a security risk.  Fail-first on Vector Indexed
  57 would allow attackers to probe large numbers of pages from userspace, where
  58 strided fail-first (by creating contiguous sequential LDs) does not.
  59
  60 In addition, reduce mode makes no sense, and for LD/ST with immediates
  61  Vector source RA makes no sense either (or, is a quirk). Realistically we need
  62 an alternative table meaning for [[sv/svp64]] mode.  The following modes make sense:
  63
  64 * saturation
  65 * predicate-result (mostly for cache-inhibited LD/ST)
  66 * normal
  67 * fail-first (where Vector Indexed is banned)
  68 * Signed Effective Address computation (Vector Indexed only)
  69
  70 More than that however it is necessary to fit the usual Vector ISA
  71 capabilities onto both Power ISA LD/ST with immediate and to
  72 LD/ST Indexed. They present subtly different Mode tables.
  73
  74 **LD/ST immediate**
  75
  76 The table for [[sv/svp64]] for `immed(RA)` is:
  77
  78 | 0-1 |  2  |  3   4  |  description               |
  79 | --- | --- |---------|--------------------------- |
  80 | 00  | 0   |  zz els | normal mode                |
  81 | 00  | 1   |  rsvd   | reserved                   |
  82 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
  83 | 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
  84 | 10  |   N | zz  els |  sat mode: N=0/1 u/s       |
  85 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
  86 | 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |
  87
  88 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
  89 whether stride is unit or element:
  90
  91     if RA.isvec:
  92         svctx.ldstmode = indexed
  93     elif els == 0:
  94         svctx.ldstmode = unitstride
  95     elif immediate != 0:
  96         svctx.ldstmode = elementstride
  97
  98 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
  99 in effect the multiplication of the immediate-offset by zero results
 100 in reading from the exact same memory location.
 101
 102 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
 103 just the once and be copied, rather than hitting the Data Cache
 104 multiple times with the same memory read at the same location.
 105 This would allow for memory-mapped peripherals to have multiple
 106 data values read in quick succession and stored in sequentially
 107 numbered registers.
 108
 109 For non-cache-inhibited ST from a vector source onto a scalar
 110 destination: with the Vector
 111 loop effectively creating multiple memory writes to the same location,
 112 we can deduce that the last of these will be the "successful" one. Thus,
 113 implementations are free and clear to optimise out the overwriting STs,
 114 leaving just the last one as the "winner".  Bear in mind that predicate
 115 masks will skip some elements (in source non-zeroing mode).
 116 Cache-inhibited ST operations on the other hand **MUST** write out
 117 a Vector source multiple successive times to the exact same Scalar
 118 destination.
 119
 120 Note that there are no immediate versions of cache-inhibited LD/ST.
 121
 122 **LD/ST Indexed**
 123
 124 The modes for `RA+RB` indexed version are slightly different:
 125
 126 | 0-1 |  2  |  3   4  |  description              |
 127 | --- | --- |---------|-------------------------- |
 128 | 00  | SEA |  dz  sz | normal mode        |
 129 | 01  | SEA | dz sz   | Strided (scalar only source)   |
 130 | 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
 131 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 132 | 11  | inv | zz  RC1 |  Rc=0: pred-result z/nonz |
 133
 134 Vector Indexed Strided Mode is qualified as follows:
 135
 136     if mode = 0b01 and !RA.isvec and !RB.isvec:
 137         svctx.ldstmode = elementstride
 138
 139 A summary of the effect of Vectorisation of src or dest:
 140
 141      imm(RA)  RT.v   RA.v   no stride allowed
 142      imm(RA)  RT.s   RA.v   no stride allowed
 143      imm(RA)  RT.v   RA.s   stride-select allowed
 144      imm(RA)  RT.s   RA.s   not vectorised
 145      RA,RB    RT.v  {RA|RB}.v UNDEFINED
 146      RA,RB    RT.s  {RA|RB}.v UNDEFINED
 147      RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
 148      RA,RB    RT.s  {RA&RB}.s not vectorised
 149
 150 Signed Effective Address computation is only relevant for
 151 Vector Indexed Mode, when elwidth overrides are applied.
 152 The source override applies to RB, and before adding to
 153 RA in order to calculate the Effective Address, if SEA is
 154 set RB is sign-extended from elwidth bits to the full 64
 155 bits.  For other Modes (ffirst, saturate),
 156 all EA computation with elwidth overrides is unsigned.
 157
 158 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
 159 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
 160 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
 161
 162 # Vectorisation of Scalar Power ISA v3.0B
 163
 164 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
 165 [[isa/fixedstore]] pseudocode to be of the form:
 166
 167     lbux RT, RA, RB
 168     EA <- (RA) + (RB)
 169     RT <- MEM(EA)
 170
 171 and for immediate variants:
 172
 173     lb RT,D(RA)
 174     EA <- RA + EXTS(D)
 175     RT <- MEM(EA)
 176
 177 Thus in the first example, the source registers may each be independently
 178 marked as scalar or vector, and likewise the destination; in the second
 179 example only the one source and one dest may be marked as scalar or
 180 vector.
 181
 182 Thus we can see that Vector Indexed may be covered, and, as demonstrated
 183 with the pseudocode below, the immediate can be used to give unit stride or element stride.  With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
 184
 185     # LD not VLD!  format - ldop RT, immed(RA)
 186     # op_width: lb=1, lh=2, lw=4, ld=8
 187     op_load(RT, RA, op_width, immed, svctx, RAupdate):
 188       ps = get_pred_val(FALSE, RA); # predication on src
 189       pd = get_pred_val(FALSE, RT); # ... AND on dest
 190       for (i=0, j=0, u=0; i < VL && j < VL;):
 191         # skip nonpredicates elements
 192         if (RA.isvec) while (!(ps & 1<<i)) i++;
 193         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 194         if (RT.isvec) while (!(pd & 1<<j)) j++;
 195         if svctx.ldstmode == elementstride:
 196           # element stride mode
 197           srcbase = ireg[RA]
 198           offs = i * immed              # j*immed for a ST
 199         elif svctx.ldstmode == unitstride:
 200           # unit stride mode
 201           srcbase = ireg[RA]
 202           offs = immed + (i * op_width) # j*op_width for ST
 203         elif RA.isvec:
 204           # quirky Vector indexed mode but with an immediate
 205           srcbase = ireg[RA+i]
 206           offs = immed;
 207         else
 208           # standard scalar mode (but predicated)
 209           # no stride multiplier means VSPLAT mode
 210           srcbase = ireg[RA]
 211           offs = immed
 212
 213         # compute EA
 214         EA = srcbase + offs
 215         # update RA?
 216         if RAupdate: ireg[RAupdate+u] = EA;
 217         # load from memory
 218         ireg[RT+j] <= MEM[EA];
 219         if (!RT.isvec)
 220             break # destination scalar, end now
 221         if (RA.isvec) i++;
 222         if (RAupdate.isvec) u++;
 223         if (RT.isvec) j++;
 224
 225 Indexed LD is:
 226
 227     # format: ldop RT, RA, RB
 228     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
 229       ps = get_pred_val(FALSE, RA); # predication on src
 230       pd = get_pred_val(FALSE, RT); # ... AND on dest
 231       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
 232         # skip nonpredicated RA, RB and RT
 233         if (RA.isvec) while (!(ps & 1<<i)) i++;
 234         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 235         if (RB.isvec) while (!(ps & 1<<k)) k++;
 236         if (RT.isvec) while (!(pd & 1<<j)) j++;
 237         if svctx.ldstmode == elementstride:
 238             EA = ireg[RA] + ireg[RB]*j   # register-strided
 239         else
 240             EA = ireg[RA+i] + ireg[RB+k] # indexed address
 241         if RAupdate: ireg[RAupdate+u] = EA
 242         ireg[RT+j] <= MEM[EA];
 243         if (!RT.isvec)
 244             break # destination scalar, end immediately
 245         if svctx.ldstmode != elementstride:
 246             if (!RA.isvec && !RB.isvec)
 247                 break # scalar-scalar
 248         if (RA.isvec) i++;
 249         if (RAupdate.isvec) u++;
 250         if (RB.isvec) k++;
 251         if (RT.isvec) j++;
 252
 253 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
 254
 255 # LD/ST ffirst
 256
 257 LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
 258 ordinary one.  Exceptions, if any are needed occur "as normal" exactly as
 259 they would on any Scalar v3.0 Power ISA LD/ST. However for elements 1
 260 and above, if an exception would occur, then VL is **truncated** to the
 261 previous element: the exception is **not** then raised because the
 262 LD/ST that would otherwise have caused an exception is *required* to be cancelled.
 263
 264 ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail.  Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST.  See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 265
 266     for(i = 0; i < VL; i++)
 267         reg[rt + i] = mem[reg[ra] + i * reg[rb]];
 268
 269 High security implementations where any kind of speculative probing
 270 of memory pages is considered a risk should take advantage of the fact that
 271 implementations may truncate VL at any point, without requiring software
 272 to be rewritten and made non-portable. Such implementations may choose
 273 to *always* set VL=1 which will have the effect of terminating any
 274 speculative probing (and also adversely affect performance), but will
 275 at least not require applications to be rewritten.
 276
 277 Low-performance simpler hardware implementations may also
 278 choose (always) to also set VL=1 as the bare minimum compliant implementation of
 279 LD/ST Fail-First. It is however critically important to remember that
 280 the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
 281 **MUST** raise exceptions exactly like an ordinary LD/ST.
 282
 283 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins the following ffirst LD/ST operations on an aligned boundary
 284 such as the beginning of a cache line, or beginning of a Virtual Memory
 285 page. Likewise, to reduce workloads or balance resources.
 286
 287 Vertical-First Mode is slightly strange in that only one element
 288 at a time is ever executed anyway.  Given that programmers may
 289 legitimately choose to alter srcstep and dststep in non-sequential
 290 order as part of explicit loops, it is neither possible nor
 291 safe to make speculative assumptions about future LD/STs.
 292 Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
 293 This is very different from Arithmetic (Data-dependent) FFirst
 294 where Vertical-First Mode is fully deterministic, not speculative.
 295
 296 # LOAD/STORE Elwidths <a name="elwidth"></a>
 297
 298 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
 299 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 300 others like it provide an explicit operation width.  There are therefore
 301 *three* widths involved:
 302
 303 * operation width (lb=8, lh=16, lw=32, ld=64)
 304 * src elelent width override
 305 * destination element width override
 306
 307 Some care is therefore needed to express and make clear the transformations,
 308 which are expressly in this order:
 309
 310 * Load at the operation width (lb/lh/lw/ld) as usual
 311 * byte-reversal as usual
 312 * Non-saturated mode:
 313    - zero-extension or truncation from operation width to source elwidth
 314    - zero/truncation to dest elwidth
 315 * Saturated mode:
 316    - Sign-extension or truncation from operation width to source width
 317    - signed/unsigned saturation down to dest elwidth
 318
 319 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
 320 is treated effectively as completely separate and distinct from SV
 321 augmentation.  This is primarily down to quirks surrounding LE/BE and
 322 byte-reversal in OpenPOWER.
 323
 324 It is unfortunately possible to request an elwidth override on the memory side which
 325 does not mesh with the operation width: these result in `UNDEFINED`
 326 behaviour.  The reason is that the effect of attempting a 64-bit `sv.ld`
 327 operation with a source elwidth override of 8/16/32 would result in
 328 overlapping memory requests, particularly on unit and element strided
 329 operations.  Thus it is `UNDEFINED` when the elwidth is smaller than
 330 the memory operation width. Examples include `sv.lw/sw=16/els` which
 331 requests (overlapping) 4-byte memory reads offset from
 332 each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
 333 where the dest elwidth override is less than the operation width.
 334
 335 Note the following regarding the pseudocode to follow:
 336
 337 * `scalar identity behaviour` SV Context parameter conditions turn this
 338   into a straight absolute fully-compliant Scalar v3.0B LD operation
 339 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 340   rather than `ld`)
 341 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 342   a "normal" part of Scalar v3.0B LD
 343 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 344   as a "normal" part of Scalar v3.0B LD
 345 * `svctx` specifies the SV Context and includes VL as well as
 346   source and destination elwidth overrides.
 347
 348 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
 349
 350 Note that twin predication, predication-zeroing, saturation
 351 and other modes have all been removed, for clarity and simplicity:
 352
 353     # LD not VLD! (ldbrx if brev=True)
 354     # this covers unit stride mode and a type of vector offset
 355     function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
 356       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
 357
 358         if not svctx.unit/el-strided:
 359             # strange vector mode, compute 64 bit address which is
 360             # not polymorphic! elwidth hardcoded to 64 here
 361             srcbase = get_polymorphed_reg(RA, 64, i)
 362         else:
 363             # unit / element stride mode, compute 64 bit address
 364             srcbase = get_polymorphed_reg(RA, 64, 0)
 365             # adjust for unit/el-stride
 366             srcbase += ....
 367
 368         # takes care of (merges) processor LE/BE and ld/ldbrx
 369         bytereverse = brev XNOR MSR.LE
 370
 371         # read the underlying memory
 372         memread <= mem[srcbase + imm_offs];
 373
 374         # optionally performs byteswap at op width
 375         if (bytereverse):
 376             memread = byteswap(memread, op_width)
 377
 378         # check saturation.
 379         if svpctx.saturation_mode:
 380             ... saturation adjustment...
 381         else:
 382             # truncate/extend to over-ridden source width.
 383             memread = adjust_wid(memread, op_width, svctx.src_elwidth)
 384
 385         # takes care of inserting memory-read (now correctly byteswapped)
 386         # into regfile underlying LE-defined order, into the right place
 387         # within the NEON-like register, respecting destination element
 388         # bitwidth, and the element index (j)
 389         set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
 390
 391         # increments both src and dest element indices (no predication here)
 392         i++;
 393         j++;
 394
 395 # Remapped LD/ST
 396
 397 In the [[sv/remap]] page the concept of "Remapping" is described.
 398 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 399 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 400 elements worth of LDs or STs.  The usual interest in such re-mapping
 401 is for example in separating out 24-bit RGB channel data into separate
 402 contiguous registers.  NEON covers this as shown in the diagram below:
 403
 404 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
 405
 406 Remap easily covers this capability, and with dest
 407 elwidth overrides and saturation may do so with built-in conversion that
 408 would normally require additional width-extension, sign-extension and
 409 min/max Vectorised instructions as post-processing stages.
 410
 411 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 412 because the generic abstracted concept of "Remapping", when applied to
 413 LD/ST, will give that same capability, with far more flexibility.
 414
 415 Also LD/ST with immediate has a Pack/Unpack option similar to VSX
 416 'vpack' and 'vunpack'.
 417
 418 # notes from lxo
 419
 420 this section covers assembly notation for the immediate and indexed LD/ST.
 421 the summary is that in immediate mode for LD it is not clear that if the
 422 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
 423 the memory being read is *still a vector load*, known as "unit or element strides".
 424
 425 This anomaly is made clear with the following notation:
 426
 427     sv.ld RT.v, imm(RA).v
 428
 429 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
 430
 431     sv.ld RT.v, imm(RA)
 432
 433 Notes taken from IRC conversation
 434
 435     <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
 436     <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
 437     <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
 438     <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
 439     <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
 440     <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
 441     <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
 442
 443 permutations of vector selection, to identify above asm-syntax:
 444
 445      imm(RA)  RT.v   RA.v   nonstrided
 446          sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
 447            mem@     0+r#2   offs+(r#2+1)  offs+(r#2+2)
 448            destreg  r#      r#+1          r#+2
 449      imm(RA)  RT.s   RA.v   nonstrided
 450          sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
 451            (dest r# is scalar) -> VSELECT mode
 452      imm(RA)  RT.v   RA.s   fixed stride: unit or element
 453          sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
 454            mem@r#2  +0   +1   +2
 455            destreg  r#   r#+1 r#+2
 456          sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
 457            mem@r#2  +0 ...   +offs ...  +offs*2
 458            destreg  r#       r#+1       r#+2
 459      imm(RA)  RT.s   RA.s   not vectorised
 460          sv.ld r#, ofst(r#2)
 461
 462 indexed mode:
 463
 464      RA,RB    RT.v  RA.v  RB.v
 465         sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
 466      RA,RB    RT.v  RA.s  RB.v
 467         sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
 468      RA,RB    RT.v  RA.v  RB.s
 469         sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
 470      RA,RB    RT.v  RA.s  RB.s
 471         sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
 472      RA,RB    RT.s  RA.v  RB.v
 473      RA,RB    RT.s  RA.s  RB.v
 474      RA,RB    RT.s  RA.v  RB.s
 475      RA,RB    RT.s  RA.s  RB.s not vectorised