openpower/sv/ldst.mdwn

   1 [[!tag standards]]
   2
   3 # SV Load and Store
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12
  13 # Rationale
  14
  15 All Vector ISAs dating back fifty years have extensive and comprehensive
  16 Load and Store operations that go far beyond the capabilities of Scalar
  17 RISC or CISC processors, yet at their heart on an individual element
  18 basis may be found to be no different from RISC Scalar equivalents.
  19
  20 The resource savings from Vector LD/ST are significant and stem from
  21 the fact that one single instruction can trigger a dozen (or in some
  22 microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
  23
  24 Additionally, and simply: if the Arithmetic side of an ISA supports
  25 Vector Operations, then in order to keep the ALUs 100% occupied the
  26 Memory infrastructure (and the ISA itself) correspondingly needs Vector
  27 Memory Operations as well.
  28
  29 Vectorised Load and Store also presents an extra dimension (literally)
  30 which creates scenarios unique to Vector applications, that a Scalar
  31 (and even a SIMD) ISA simply never encounters.  SVP64 endeavours to
  32 add such modes without changing the behaviour of the underlying Base
  33 (Scalar) v3.0B operations.
  34
  35 # Modes overview
  36
  37 Vectorisation of Load and Store requires creation, from scalar operations,
  38 a number of different modes:
  39
  40 * fixed aka "unit" stride (contiguous sequence with no gaps)
  41 * element strided (sequential but regularly offset, with gaps)
  42 * vector indexed (vector of base addresses and vector of offsets)
  43 * Speculative fail-first (where it makes sense to do so)
  44 * Structure Packing (covered in SV by [[sv/remap]]).
  45
  46 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
  47 as well as Element-width overrides and Twin-Predication.
  48
  49 *Despite being constructed from Scalar LD/ST none of these Modes
  50 exist or make sense in any Scalar ISA. They **only** exist in Vector ISAs*
  51
  52 # Vectorisation of Scalar Power ISA v3.0B
  53
  54 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
  55 [[isa/fixedstore]] pseudocode to be of the form:
  56
  57     lbux RT, RA, RB
  58     EA <- (RA) + (RB)
  59     RT <- MEM(EA)
  60
  61 and for immediate variants:
  62
  63     lb RT,D(RA)
  64     EA <- RA + EXTS(D)
  65     RT <- MEM(EA)
  66
  67 Thus in the first example, the source registers may each be independently
  68 marked as scalar or vector, and likewise the destination; in the second
  69 example only the one source and one dest may be marked as scalar or
  70 vector.
  71
  72 Thus we can see that Vector Indexed may be covered, and, as demonstrated
  73 with the pseudocode below, the immediate can be used to give unit stride or element stride.  With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
  74
  75     # LD not VLD!  format - ldop RT, immed(RA)
  76     # op_width: lb=1, lh=2, lw=4, ld=8
  77     op_load(RT, RA, op_width, immed, svctx, RAupdate):
  78       ps = get_pred_val(FALSE, RA); # predication on src
  79       pd = get_pred_val(FALSE, RT); # ... AND on dest
  80       for (i=0, j=0, u=0; i < VL && j < VL;):
  81         # skip nonpredicates elements
  82         if (RA.isvec) while (!(ps & 1<<i)) i++;
  83         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
  84         if (RT.isvec) while (!(pd & 1<<j)) j++;
  85         if svctx.ldstmode == elementstride:
  86           # element stride mode
  87           srcbase = ireg[RA]
  88           offs = i * immed              # j*immed for a ST
  89         elif svctx.ldstmode == unitstride:
  90           # unit stride mode
  91           srcbase = ireg[RA]
  92           offs = immed + (i * op_width) # j*op_width for ST
  93         elif RA.isvec:
  94           # quirky Vector indexed mode but with an immediate
  95           srcbase = ireg[RA+i]
  96           offs = immed;
  97         else
  98           # standard scalar mode (but predicated)
  99           # no stride multiplier means VSPLAT mode
 100           srcbase = ireg[RA]
 101           offs = immed
 102
 103         # compute EA
 104         EA = srcbase + offs
 105         # update RA?
 106         if RAupdate: ireg[RAupdate+u] = EA;
 107         # load from memory
 108         ireg[RT+j] <= MEM[EA];
 109         if (!RT.isvec)
 110             break # destination scalar, end now
 111         if (RA.isvec) i++;
 112         if (RAupdate.isvec) u++;
 113         if (RT.isvec) j++;
 114
 115 Indexed LD is:
 116
 117     # format: ldop RT, RA, RB
 118     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
 119       ps = get_pred_val(FALSE, RA); # predication on src
 120       pd = get_pred_val(FALSE, RT); # ... AND on dest
 121       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
 122         # skip nonpredicated RA, RB and RT
 123         if (RA.isvec) while (!(ps & 1<<i)) i++;
 124         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 125         if (RB.isvec) while (!(ps & 1<<k)) k++;
 126         if (RT.isvec) while (!(pd & 1<<j)) j++;
 127         if svctx.ldstmode == elementstride:
 128             EA = ireg[RA] + ireg[RB]*j   # register-strided
 129         else
 130             EA = ireg[RA+i] + ireg[RB+k] # indexed address
 131         if RAupdate: ireg[RAupdate+u] = EA
 132         ireg[RT+j] <= MEM[EA];
 133         if (!RT.isvec)
 134             break # destination scalar, end immediately
 135         if svctx.ldstmode != elementstride:
 136             if (!RA.isvec && !RB.isvec)
 137                 break # scalar-scalar
 138         if (RA.isvec) i++;
 139         if (RAupdate.isvec) u++;
 140         if (RB.isvec) k++;
 141         if (RT.isvec) j++;
 142
 143 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
 144
 145 # Determining the LD/ST Modes
 146
 147 A minor complication (caused by the retro-fitting of modern Vector
 148 features to a Scalar ISA) is that certain features do not exactly make
 149 sense or are considered a security risk.  Fail-first on Vector Indexed
 150 would allow attackers to probe large numbers of pages from userspace, where
 151 strided fail-first (by creating contiguous sequential LDs) does not.
 152
 153 In addition, reduce mode makes no sense, and for LD/ST with immediates
 154  Vector source RA makes no sense either (or, is a quirk). Realistically we need
 155 an alternative table meaning for [[sv/svp64]] mode.  The following modes make sense:
 156
 157 * saturation
 158 * predicate-result (mostly for cache-inhibited LD/ST)
 159 * normal
 160 * fail-first (where Vector Indexed is banned)
 161 * Signed Effective Address computation (Vector Indexed only)
 162
 163 The table for [[sv/svp64]] for `immed(RA)` is:
 164
 165 | 0-1 |  2  |  3   4  |  description               |
 166 | --- | --- |---------|--------------------------- |
 167 | 00  | 0   |  zz els | normal mode                |
 168 | 00  | 1   |  rsvd   | reserved                   |
 169 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
 170 | 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
 171 | 10  |   N | zz  els |  sat mode: N=0/1 u/s       |
 172 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
 173 | 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |
 174
 175 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
 176 whether stride is unit or element:
 177
 178     if RA.isvec:
 179         svctx.ldstmode = indexed
 180     elif els == 0:
 181         svctx.ldstmode = unitstride
 182     elif immediate != 0:
 183         svctx.ldstmode = elementstride
 184
 185 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
 186 in effect the multiplication of the immediate-offset by zero results
 187 in reading from the exact same memory location.
 188
 189 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
 190 just the once and be copied, rather than hitting the Data Cache
 191 multiple times with the same memory read at the same location.
 192 This would allow for memory-mapped peripherals to have multiple
 193 data values read in quick succession and stored in sequentially
 194 numbered registers.
 195
 196 For non-cache-inhibited ST from a vector source onto a scalar
 197 destination: with the Vector
 198 loop effectively creating multiple memory writes to the same location,
 199 we can deduce that the last of these will be the "successful" one. Thus,
 200 implementations are free and clear to optimise out the overwriting STs,
 201 leaving just the last one as the "winner".  Bear in mind that predicate
 202 masks will skip some elements (in source non-zeroing mode).
 203 Cache-inhibited ST operations on the other hand **MUST** write out
 204 a Vector source multiple successive times to the exact same Scalar
 205 destination.
 206
 207 Note that there are no immediate versions of cache-inhibited LD/ST.
 208
 209 The modes for `RA+RB` indexed version are slightly different:
 210
 211 | 0-1 |  2  |  3   4  |  description              |
 212 | --- | --- |---------|-------------------------- |
 213 | 00  | SEA |  dz  sz | normal mode        |
 214 | 01  | SEA | dz sz   | Strided (scalar only source)   |
 215 | 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
 216 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 217 | 11  | inv | zz  RC1 |  Rc=0: pred-result z/nonz |
 218
 219 Vector Indexed Strided Mode is qualified as follows:
 220
 221     if mode = 0b01 and !RA.isvec and !RB.isvec:
 222         svctx.ldstmode = elementstride
 223
 224 A summary of the effect of Vectorisation of src or dest:
 225
 226      imm(RA)  RT.v   RA.v   no stride allowed
 227      imm(RA)  RT.s   RA.v   no stride allowed
 228      imm(RA)  RT.v   RA.s   stride-select allowed
 229      imm(RA)  RT.s   RA.s   not vectorised
 230      RA,RB    RT.v  {RA|RB}.v UNDEFINED
 231      RA,RB    RT.s  {RA|RB}.v UNDEFINED
 232      RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
 233      RA,RB    RT.s  {RA&RB}.s not vectorised
 234
 235 Signed Effective Address computation is only relevant for
 236 Vector Indexed Mode, when elwidth overrides are applied.
 237 The source override applies to RB, and before adding to
 238 RA in order to calculate the Effective Address, if SEA is
 239 set RB is sign-extended from elwidth bits to the full 64
 240 bits.  For other Modes (ffirst, saturate),
 241 all EA computation with elwidth overrides is unsigned.
 242
 243 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
 244 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
 245 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
 246
 247 ## LD/ST ffirst
 248
 249 LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
 250 ordinary one.  Exceptions occur "as normal".  However for elements 1
 251 and above, if an exception would occur, then VL is **truncated** to the
 252 previous element: the exception is **not** then raised because the
 253 LD/ST was effectively speculative.
 254
 255 ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail.  Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST.  See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 256
 257     for(i = 0; i < VL; i++)
 258         reg[rt + i] = mem[reg[ra] + i * reg[rb]];
 259
 260 High security implementations where any kind of speculative probing
 261 of memory pages is considered a risk should take advantage of the fact that
 262 implementations may truncate VL at any point, without requiring software
 263 to be rewritten and made non-portable. Such implementations may choose
 264 to *always* set VL=1 which will have the effect of terminating any
 265 speculative probing (and also adversely affect performance), but will
 266 at least not require applications to be rewritten.
 267
 268 Low-performance simpler hardware implementations may also
 269 choose (always) to also set VL=1 as the bare minimum compliant implementation of
 270 LD/ST Fail-First. It is however critically important to remember that
 271 the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
 272 **MUST** raise exceptions exactly like an ordinary LD/ST.
 273
 274 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins the following ffirst LD/ST operations on an aligned boundary
 275 such as the beginning of a cache line, or beginning of a Virtual Memory
 276 page. Likewise, to reduce workloads or balance resources.
 277
 278 Vertical-First Mode is slightly strange in that only one element
 279 at a time is ever executed anyway.  Given that programmers may
 280 legitimately choose to alter srcstep and dststep in non-sequential
 281 order as part of explicit loops, it is neither possible nor
 282 safe to make speculative assumptions about future LD/STs.
 283 Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
 284 This is very different from Arithmetic (Data-dependent) FFirst
 285 where Vertical-First Mode is fully deterministic, not speculative.
 286
 287 # LOAD/STORE Elwidths <a name="elwidth"></a>
 288
 289 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
 290 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 291 others like it provide an explicit operation width.  There are therefore
 292 *three* widths involved:
 293
 294 * operation width (lb=8, lh=16, lw=32, ld=64)
 295 * src elelent width override
 296 * destination element width override
 297
 298 Some care is therefore needed to express and make clear the transformations,
 299 which are expressly in this order:
 300
 301 * Load at the operation width (lb/lh/lw/ld) as usual
 302 * byte-reversal as usual
 303 * Non-saturated mode:
 304    - zero-extension or truncation from operation width to source elwidth
 305    - zero/truncation to dest elwidth
 306 * Saturated mode:
 307    - Sign-extension or truncation from operation width to source width
 308    - signed/unsigned saturation down to dest elwidth
 309
 310 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
 311 is treated effectively as completely separate and distinct from SV
 312 augmentation.  This is primarily down to quirks surrounding LE/BE and
 313 byte-reversal in OpenPOWER.
 314
 315 It is unfortunately possible to request an elwidth override on the memory side which
 316 does not mesh with the operation width: these result in `UNDEFINED`
 317 behaviour.  The reason is that the effect of attempting a 64-bit `sv.ld`
 318 operation with a source elwidth override of 8/16/32 would result in
 319 overlapping memory requests, particularly on unit and element strided
 320 operations.  Thus it is `UNDEFINED` when the elwidth is smaller than
 321 the memory operation width. Examples include `sv.lw/sw=16/els` which
 322 requests (overlapping) 4-byte memory reads offset from
 323 each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
 324 where the dest elwidth override is less than the operation width.
 325
 326 Note the following regarding the pseudocode to follow:
 327
 328 * `scalar identity behaviour` SV Context parameter conditions turn this
 329   into a straight absolute fully-compliant Scalar v3.0B LD operation
 330 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 331   rather than `ld`)
 332 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 333   a "normal" part of Scalar v3.0B LD
 334 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 335   as a "normal" part of Scalar v3.0B LD
 336 * `svctx` specifies the SV Context and includes VL as well as
 337   source and destination elwidth overrides.
 338
 339 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
 340
 341 Note that twin predication, predication-zeroing, saturation
 342 and other modes have all been removed, for clarity and simplicity:
 343
 344     # LD not VLD! (ldbrx if brev=True)
 345     # this covers unit stride mode and a type of vector offset
 346     function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
 347       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
 348
 349         if not svctx.unit/el-strided:
 350             # strange vector mode, compute 64 bit address which is
 351             # not polymorphic! elwidth hardcoded to 64 here
 352             srcbase = get_polymorphed_reg(RA, 64, i)
 353         else:
 354             # unit / element stride mode, compute 64 bit address
 355             srcbase = get_polymorphed_reg(RA, 64, 0)
 356             # adjust for unit/el-stride
 357             srcbase += ....
 358
 359         # takes care of (merges) processor LE/BE and ld/ldbrx
 360         bytereverse = brev XNOR MSR.LE
 361
 362         # read the underlying memory
 363         memread <= mem[srcbase + imm_offs];
 364
 365         # optionally performs byteswap at op width
 366         if (bytereverse):
 367             memread = byteswap(memread, op_width)
 368
 369         # check saturation.
 370         if svpctx.saturation_mode:
 371             ... saturation adjustment...
 372         else:
 373             # truncate/extend to over-ridden source width.
 374             memread = adjust_wid(memread, op_width, svctx.src_elwidth)
 375
 376         # takes care of inserting memory-read (now correctly byteswapped)
 377         # into regfile underlying LE-defined order, into the right place
 378         # within the NEON-like register, respecting destination element
 379         # bitwidth, and the element index (j)
 380         set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
 381
 382         # increments both src and dest element indices (no predication here)
 383         i++;
 384         j++;
 385
 386 # Remapped LD/ST
 387
 388 In the [[sv/remap]] page the concept of "Remapping" is described.
 389 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 390 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 391 elements worth of LDs or STs.  The usual interest in such re-mapping
 392 is for example in separating out 24-bit RGB channel data into separate
 393 contiguous registers.  NEON covers this as shown in the diagram below:
 394
 395 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
 396
 397 Remap easily covers this capability, and with dest
 398 elwidth overrides and saturation may do so with built-in conversion that
 399 would normally require additional width-extension, sign-extension and
 400 min/max Vectorised instructions as post-processing stages.
 401
 402 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 403 because the generic abstracted concept of "Remapping", when applied to
 404 LD/ST, will give that same capability, with far more flexibility.
 405
 406 Also LD/ST with immediate has a Pack/Unpack option similar to VSX
 407 'vpack' and 'vunpack'.
 408
 409 # notes from lxo
 410
 411 this section covers assembly notation for the immediate and indexed LD/ST.
 412 the summary is that in immediate mode for LD it is not clear that if the
 413 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
 414 the memory being read is *still a vector load*, known as "unit or element strides".
 415
 416 This anomaly is made clear with the following notation:
 417
 418     sv.ld RT.v, imm(RA).v
 419
 420 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
 421
 422     sv.ld RT.v, imm(RA)
 423
 424 Notes taken from IRC conversation
 425
 426     <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
 427     <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
 428     <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
 429     <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
 430     <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
 431     <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
 432     <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
 433
 434 permutations of vector selection, to identify above asm-syntax:
 435
 436      imm(RA)  RT.v   RA.v   nonstrided
 437          sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
 438            mem@     0+r#2   offs+(r#2+1)  offs+(r#2+2)
 439            destreg  r#      r#+1          r#+2
 440      imm(RA)  RT.s   RA.v   nonstrided
 441          sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
 442            (dest r# is scalar) -> VSELECT mode
 443      imm(RA)  RT.v   RA.s   fixed stride: unit or element
 444          sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
 445            mem@r#2  +0   +1   +2
 446            destreg  r#   r#+1 r#+2
 447          sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
 448            mem@r#2  +0 ...   +offs ...  +offs*2
 449            destreg  r#       r#+1       r#+2
 450      imm(RA)  RT.s   RA.s   not vectorised
 451          sv.ld r#, ofst(r#2)
 452
 453 indexed mode:
 454
 455      RA,RB    RT.v  RA.v  RB.v
 456         sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
 457      RA,RB    RT.v  RA.s  RB.v
 458         sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
 459      RA,RB    RT.v  RA.v  RB.s
 460         sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
 461      RA,RB    RT.v  RA.s  RB.s
 462         sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
 463      RA,RB    RT.s  RA.v  RB.v
 464      RA,RB    RT.s  RA.s  RB.v
 465      RA,RB    RT.s  RA.v  RB.s
 466      RA,RB    RT.s  RA.s  RB.s not vectorised