openpower/sv/ldst.mdwn

   1 [[!tag standards]]
   2
   3 # SV Load and Store
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12 * [[simple_v_extension/specification/ld.x]]
  13
  14 # Rationale
  15
  16 All Vector ISAs dating back fifty years have extensive and comprehensive
  17 Load and Store operations that go far beyond the capabilities of Scalar
  18 RISC or CISC processors, yet at their heart on an individual element
  19 basis may be found to be no different from RISC Scalar equivalents.
  20
  21 The resource savings from Vector LD/ST are significant and stem from
  22 the fact that one single instruction can trigger a dozen (or in some
  23 microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
  24
  25 Additionally, and simply: if the Arithmetic side of an ISA supports
  26 Vector Operations, then in order to keep the ALUs 100% occupied the
  27 Memory infrastructure (and the ISA itself) correspondingly needs Vector
  28 Memory Operations as well.
  29
  30 Vectorised Load and Store also presents an extra dimension (literally)
  31 which creates scenarios unique to Vector applications, that a Scalar
  32 (and even a SIMD) ISA simply never encounters.  SVP64 endeavours to
  33 add such modes without changing the behaviour of the underlying Base
  34 (Scalar) v3.0B operations.
  35
  36 # Modes overview
  37
  38 Vectorisation of Load and Store requires creation, from scalar operations,
  39 a number of different modes:
  40
  41 * fixed stride (contiguous sequence with no gaps) aka "unit" stride
  42 * element strided (sequential but regularly offset, with gaps)
  43 * vector indexed (vector of base addresses and vector of offsets)
  44 * Speculative fail-first (where it makes sense to do so)
  45 * Structure Packing (covered in SV by [[sv/remap]]).
  46
  47 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
  48 as well as Element-width overrides and Twin-Predication.
  49
  50 # Vectorisation of Scalar Power ISA v3.0B
  51
  52 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
  53 [[isa/fixedstore]] pseudocode to be of the form:
  54
  55     lbux RT, RA, RB
  56     EA <- (RA) + (RB)
  57     RT <- MEM(EA)
  58
  59 and for immediate variants:
  60
  61     lb RT,D(RA)
  62     EA <- RA + EXTS(D)
  63     RT <- MEM(EA)
  64
  65 Thus in the first example, the source registers may each be independently
  66 marked as scalar or vector, and likewise the destination; in the second
  67 example only the one source and one dest may be marked as scalar or
  68 vector.
  69
  70 Thus we can see that Vector Indexed may be covered, and, as demonstrated
  71 with the pseudocode below, the immediate can be used to give unit stride or element stride.  With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
  72
  73     # LD not VLD!  format - ldop RT, immed(RA)
  74     # op_width: lb=1, lh=2, lw=4, ld=8
  75     op_load(RT, RA, RC, op_width, immed, svctx, RAupdate):
  76       ps = get_pred_val(FALSE, RA); # predication on src
  77       pd = get_pred_val(FALSE, RT); # ... AND on dest
  78       for (i=0, j=0, u=0; i < VL && j < VL;):
  79         # skip nonpredicates elements
  80         if (RA.isvec) while (!(ps & 1<<i)) i++;
  81         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
  82         if (RT.isvec) while (!(pd & 1<<j)) j++;
  83         if svctx.ldstmode == shifted: # for FFT/DCT
  84           # FFT/DCT shifted mode
  85           if (RA.isvec)
  86             srcbase = ireg[RA+i]
  87           else
  88             srcbase = ireg[RA]
  89           offs = (i * immed) << RC
  90         elif svctx.ldstmode == elementstride:
  91           # element stride mode
  92           srcbase = ireg[RA]
  93           offs = i * immed              # j*immed for a ST
  94         elif svctx.ldstmode == unitstride:
  95           # unit stride mode
  96           srcbase = ireg[RA]
  97           offs = immed + (i * op_width) # j*op_width for ST
  98         elif RA.isvec:
  99           # quirky Vector indexed mode but with an immediate
 100           srcbase = ireg[RA+i]
 101           offs = immed;
 102         else
 103           # standard scalar mode (but predicated)
 104           # no stride multiplier means VSPLAT mode
 105           srcbase = ireg[RA]
 106           offs = immed
 107
 108         # compute EA
 109         EA = srcbase + offs
 110         # update RA?
 111         if RAupdate: ireg[RAupdate+u] = EA;
 112         # load from memory
 113         ireg[RT+j] <= MEM[EA];
 114         if (!RT.isvec)
 115             break # destination scalar, end now
 116         if (RA.isvec) i++;
 117         if (RAupdate.isvec) u++;
 118         if (RT.isvec) j++;
 119
 120     # reverses the bitorder up to "width" bits
 121     def bitrev(val, VL):
 122       width = log2(VL)
 123       result = 0
 124       for _ in range(width):
 125         result = (result << 1) | (val & 1)
 126         val >>= 1
 127       return result
 128
 129 Indexed LD is:
 130
 131     # format: ldop RT, RA, RB
 132     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
 133       ps = get_pred_val(FALSE, RA); # predication on src
 134       pd = get_pred_val(FALSE, RT); # ... AND on dest
 135       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
 136         # skip nonpredicated RA, RB and RT
 137         if (RA.isvec) while (!(ps & 1<<i)) i++;
 138         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 139         if (RB.isvec) while (!(ps & 1<<k)) k++;
 140         if (RT.isvec) while (!(pd & 1<<j)) j++;
 141         if svctx.ldstmode == elementstride:
 142             EA = ireg[RA] + ireg[RB]*j   # register-strided
 143         else
 144             EA = ireg[RA+i] + ireg[RB+k] # indexed address
 145         if RAupdate: ireg[RAupdate+u] = EA
 146         ireg[RT+j] <= MEM[EA];
 147         if (!RT.isvec)
 148             break # destination scalar, end immediately
 149         if svctx.ldstmode != elementstride:
 150             if (!RA.isvec && !RB.isvec)
 151                 break # scalar-scalar
 152         if (RA.isvec) i++;
 153         if (RAupdate.isvec) u++;
 154         if (RB.isvec) k++;
 155         if (RT.isvec) j++;
 156
 157 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
 158
 159 # Determining the LD/ST Modes
 160
 161 A minor complication (caused by the retro-fitting of modern Vector
 162 features to a Scalar ISA) is that certain features do not exactly make
 163 sense or are considered a security risk.  Fail-first on Vector Indexed
 164 would allow attackers to probe large numbers of pages from userspace, where
 165 strided fail-first (by creating contiguous sequential LDs) does not.
 166
 167 In addition, reduce mode makes no sense, and for LD/ST with immediates
 168  Vector source RA makes no sense either (or, is a quirk). Realistically we need
 169 an alternative table meaning for [[sv/svp64]] mode.  The following modes make sense:
 170
 171 * saturation
 172 * predicate-result (mostly for cache-inhibited LD/ST)
 173 * normal
 174 * fail-first (where Vector Indexed is banned)
 175 * Signed Effective Address computation (Vector Indexed only)
 176
 177 Also, given that FFT, DCT and other related algorithms
 178 are of such high importance in so many areas of Computer
 179 Science, a special "shift" mode has been added which
 180 allows part of the immediate to be used instead as RC, a register
 181 which shifts the immediate `DS << GPR(RC)`.
 182
 183 The table for [[sv/svp64]] for `immed(RA)` is:
 184
 185 | 0-1 |  2  |  3   4  |  description               |
 186 | --- | --- |---------|--------------------------- |
 187 | 00  | 0   |  dz els | normal mode                |
 188 | 00  | 1   |  dz shf | shift mode                 |
 189 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
 190 | 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
 191 | 10  |   N | dz  els |  sat mode: N=0/1 u/s       |
 192 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
 193 | 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |
 194
 195 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
 196 whether stride is unit or element:
 197
 198     if bitreversed:
 199         svctx.ldstmode = bitreversed
 200     elif RA.isvec:
 201         svctx.ldstmode = indexed
 202     elif els == 0:
 203         svctx.ldstmode = unitstride
 204     elif immediate != 0:
 205         svctx.ldstmode = elementstride
 206
 207 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
 208 in effect the multiplication of the immediate-offset by zero results
 209 in reading from the exact same memory location.
 210
 211 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
 212 just the once and be copied, rather than hitting the Data Cache
 213 multiple times with the same memory read at the same location.
 214 This would allow for memory-mapped peripherals to have multiple
 215 data values read in quick succession and stored in sequentially
 216 numbered registers.
 217
 218 For non-cache-inhibited ST from a vector source onto a scalar
 219 destination: with the Vector
 220 loop effectively creating multiple memory writes to the same location,
 221 we can deduce that the last of these will be the "successful" one. Thus,
 222 implementations are free and clear to optimise out the overwriting STs,
 223 leaving just the last one as the "winner".  Bear in mind that predicate
 224 masks will skip some elements (in source non-zeroing mode).
 225 Cache-inhibited ST operations on the other hand **MUST** write out
 226 a Vector source multiple successive times to the exact same Scalar
 227 destination.
 228
 229 Note that there are no immediate versions of cache-inhibited LD/ST.
 230
 231 The modes for `RA+RB` indexed version are slightly different:
 232
 233 | 0-1 |  2  |  3   4  |  description              |
 234 | --- | --- |---------|-------------------------- |
 235 | 00  | SEA |  dz  sz | normal mode        |
 236 | 01  | SEA | dz sz  | Strided (scalar only source)   |
 237 | 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
 238 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 239 | 11  | inv | dz  RC1 |  Rc=0: pred-result z/nonz |
 240
 241 Vector Indexed Strided Mode is qualified as follows:
 242
 243     if mode = 0b01 and !RA.isvec and !RB.isvec:
 244         svctx.ldstmode = elementstride
 245
 246 A summary of the effect of Vectorisation of src or dest:
 247
 248      imm(RA)  RT.v   RA.v   no stride allowed
 249      imm(RA)  RT.s   RA.v   no stride allowed
 250      imm(RA)  RT.v   RA.s   stride-select allowed
 251      imm(RA)  RT.s   RA.s   not vectorised
 252      RA,RB    RT.v  {RA|RB}.v UNDEFINED
 253      RA,RB    RT.s  {RA|RB}.v UNDEFINED
 254      RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
 255      RA,RB    RT.s  {RA&RB}.s not vectorised
 256
 257 Signed Effective Address computation is only relevant for
 258 Vector Indexed Mode, when elwidth overrides are applied.
 259 The source override applies to RB, and before adding to
 260 RA in order to calculate the Effective Address, if SEA is
 261 set RB is sign-extended from elwidth bits to the full 64
 262 bits.  For other Modes (ffirst, saturate),
 263 all EA computation with elwidth overrides is unsigned.
 264
 265 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
 266 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
 267 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
 268
 269 ## LD/ST ffirst
 270
 271 LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
 272 ordinary one.  Exceptions occur "as normal".  However for elements 1
 273 and above, if an exception would occur, then VL is **truncated** to the
 274 previous element: the exception is **not** then raised because the
 275 LD/ST was effectively speculative.
 276
 277 ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail.  Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST.  See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 278
 279     for(i = 0; i < VL; i++)
 280         reg[rt + i] = mem[reg[ra] + i * reg[rb]];
 281
 282 High security implementations where any kind of speculative probing
 283 of memory pages is considered a risk should take advantage of the fact that
 284 implementations may truncate VL at any point, without requiring software
 285 to be rewritten and made non-portable. Such implementations may choose
 286 to *always* set VL=1 which will have the effect of terminating any
 287 speculative probing (and also adversely affect performance), but will
 288 at least not require applications to be rewritten.
 289
 290 Low-performance simpler hardware implementations may
 291 choose (always) to also set VL=1 as the bare minimum compliant implementation of
 292 LD/ST Fail-First. It is however critically important to remember that
 293 the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
 294 **MUST** raise exceptions exactly like an ordinary LD/ST.
 295
 296 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins subsequent ffirst LD/ST operations on an aligned boundary
 297 such as the beginning of a cache line, or beginning of a Virtual Memory
 298 page. Likewise, to reduce workloads or balance resources.
 299
 300 Vertical-First Mode is slightly strange in that only one element
 301 at a time is ever executed anyway.  Given that programmers may
 302 legitimately choose to alter srcstep and dststep in non-sequential
 303 order as part of explicit loops, it is neither possible nor
 304 safe to make speculative assumptions about future LD/STs.
 305 Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
 306 This is very different from Arithmetic (Data-dependent) FFirst
 307 where Vertical-First Mode is deterministic, not speculative.
 308
 309 # LOAD/STORE Elwidths <a name="elwidth"></a>
 310
 311 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
 312 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 313 others like it provide an explicit operation width.  There are therefore
 314 *three* widths involved:
 315
 316 * operation width (lb=8, lh=16, lw=32, ld=64)
 317 * src elelent width override
 318 * destination element width override
 319
 320 Some care is therefore needed to express and make clear the transformations,
 321 which are expressly in this order:
 322
 323 * Load at the operation width (lb/lh/lw/ld) as usual
 324 * byte-reversal as usual
 325 * Non-saturated mode:
 326    - zero-extension or truncation from operation width to source elwidth
 327    - zero/truncation to dest elwidth
 328 * Saturated mode:
 329    - Sign-extension or truncation from operation width to source width
 330    - signed/unsigned saturation down to dest elwidth
 331
 332 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
 333 is treated effectively as completely separate and distinct from SV
 334 augmentation.  This is primarily down to quirks surrounding LE/BE and
 335 byte-reversal in OpenPOWER.
 336
 337 It is unfortunately possible to request an elwidth override on the memory side which
 338 does not mesh with the operation width: these result in `UNDEFINED`
 339 behaviour.  The reason is that the effect of attempting a 64-bit `sv.ld`
 340 operation with a source elwidth override of 8/16/32 would result in
 341 overlapping memory requests, particularly on unit and element strided
 342 operations.  Thus it is `UNDEFINED` when the elwidth is smaller than
 343 the memory operation width. Examples include `sv.lw/sw=16/els` which
 344 requests (overlapping) 4-byte memory reads offset from
 345 each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
 346 where the dest elwidth override is less than the operation width.
 347
 348 Note the following regarding the pseudocode to follow:
 349
 350 * `scalar identity behaviour` SV Context parameter conditions turn this
 351   into a straight absolute fully-compliant Scalar v3.0B LD operation
 352 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 353   rather than `ld`)
 354 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 355   a "normal" part of Scalar v3.0B LD
 356 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 357   as a "normal" part of Scalar v3.0B LD
 358 * `svctx` specifies the SV Context and includes VL as well as
 359   source and destination elwidth overrides.
 360
 361 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
 362
 363 Note that twin predication, predication-zeroing, saturation
 364 and other modes have all been removed, for clarity and simplicity:
 365
 366     # LD not VLD! (ldbrx if brev=True)
 367     # this covers unit stride mode and a type of vector offset
 368     function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
 369       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
 370
 371         if not svctx.unit/el-strided:
 372             # strange vector mode, compute 64 bit address which is
 373             # not polymorphic! elwidth hardcoded to 64 here
 374             srcbase = get_polymorphed_reg(RA, 64, i)
 375         else:
 376             # unit / element stride mode, compute 64 bit address
 377             srcbase = get_polymorphed_reg(RA, 64, 0)
 378             # adjust for unit/el-stride
 379             srcbase += ....
 380
 381         # takes care of (merges) processor LE/BE and ld/ldbrx
 382         bytereverse = brev XNOR MSR.LE
 383
 384         # read the underlying memory
 385         memread <= mem[srcbase + imm_offs];
 386
 387         # optionally performs byteswap at op width
 388         if (bytereverse):
 389             memread = byteswap(memread, op_width)
 390
 391         # check saturation.
 392         if svpctx.saturation_mode:
 393             ... saturation adjustment...
 394         else:
 395             # truncate/extend to over-ridden source width.
 396             memread = adjust_wid(memread, op_width, svctx.src_elwidth)
 397
 398         # takes care of inserting memory-read (now correctly byteswapped)
 399         # into regfile underlying LE-defined order, into the right place
 400         # within the NEON-like register, respecting destination element
 401         # bitwidth, and the element index (j)
 402         set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
 403
 404         # increments both src and dest element indices (no predication here)
 405         i++;
 406         j++;
 407
 408 # Remapped LD/ST
 409
 410 In the [[sv/propagation]] page the concept of "Remapping" is described.
 411 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 412 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 413 elements worth of LDs or STs.  The usual interest in such re-mapping
 414 is for example in separating out 24-bit RGB channel data into separate
 415 contiguous registers.  NEON covers this as shown in the diagram below:
 416
 417 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
 418
 419 Remap easily covers this capability, and with dest
 420 elwidth overrides and saturation may do so with built-in conversion that
 421 would normally require additional width-extension, sign-extension and
 422 min/max Vectorised instructions as post-processing stages.
 423
 424 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 425 because the generic abstracted concept of "Remapping", when applied to
 426 LD/ST, will give that same capability, with far more flexibility.
 427
 428 # notes from lxo
 429
 430 this section covers assembly notation for the immediate and indexed LD/ST.
 431 the summary is that in immediate mode for LD it is not clear that if the
 432 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
 433 the memory being read is *still a vector load*, known as "unit or element strides".
 434
 435 This anomaly is made clear with the following notation:
 436
 437     sv.ld RT.v, imm(RA).v
 438
 439 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
 440
 441     sv.ld RT.v, imm(RA)
 442
 443 Notes taken from IRC conversation
 444
 445     <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
 446     <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
 447     <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
 448     <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
 449     <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
 450     <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
 451     <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
 452
 453 permutations of vector selection, to identify above asm-syntax:
 454
 455      imm(RA)  RT.v   RA.v   nonstrided
 456          sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
 457            mem@     0+r#2   offs+(r#2+1)  offs+(r#2+2)
 458            destreg  r#      r#+1          r#+2
 459      imm(RA)  RT.s   RA.v   nonstrided
 460          sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
 461            (dest r# is scalar) -> VSELECT mode
 462      imm(RA)  RT.v   RA.s   fixed stride: unit or element
 463          sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
 464            mem@r#2  +0   +1   +2
 465            destreg  r#   r#+1 r#+2
 466          sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
 467            mem@r#2  +0 ...   +offs ...  +offs*2
 468            destreg  r#       r#+1       r#+2
 469      imm(RA)  RT.s   RA.s   not vectorised
 470          sv.ld r#, ofst(r#2)
 471
 472 indexed mode:
 473
 474      RA,RB    RT.v  RA.v  RB.v
 475         sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
 476      RA,RB    RT.v  RA.s  RB.v
 477         sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
 478      RA,RB    RT.v  RA.v  RB.s
 479         sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
 480      RA,RB    RT.v  RA.s  RB.s
 481         sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
 482      RA,RB    RT.s  RA.v  RB.v
 483      RA,RB    RT.s  RA.s  RB.v
 484      RA,RB    RT.s  RA.v  RB.s
 485      RA,RB    RT.s  RA.s  RB.s not vectorised