openpower/sv/ldst.mdwn

   1 [[!tag standards]]
   2
   3 # SV Load and Store
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12 * [[simple_v_extension/specification/ld.x]]
  13
  14 # Rationale
  15
  16 All Vector ISAs dating back fifty years have extensive and comprehensive
  17 Load and Store operations that go far beyond the capabilities of Scalar
  18 RISC or CISC processors, yet at their heart on an individual element
  19 basis may be found to be no different from RISC Scalar equivalents.
  20
  21 The resource savings from Vector LD/ST are significant and stem from
  22 the fact that one single instruction can trigger a dozen (or in some
  23 microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
  24
  25 Additionally, and simply: if the Arithmetic side of an ISA supports
  26 Vector Operations, then in order to keep the ALUs 100% occupied the
  27 Memory infrastructure (and the ISA itself) correspondingly needs Vector
  28 Memory Operations as well.
  29
  30 Vectorised Load and Store also presents an extra dimension (literally)
  31 which creates scenarios unique to Vector applications, that a Scalar
  32 (and even a SIMD) ISA simply never encounters.  SVP64 endeavours to
  33 add such modes without changing the behaviour of the underlying Base
  34 (Scalar) v3.0B operations.
  35
  36 # Modes overview
  37
  38 Vectorisation of Load and Store requires creation, from scalar operations,
  39 a number of different modes:
  40
  41 * fixed stride (contiguous sequence with no gaps) aka "unit" stride
  42 * element strided (sequential but regularly offset, with gaps)
  43 * vector indexed (vector of base addresses and vector of offsets)
  44 * Speculative fail-first (where it makes sense to do so)
  45 * Structure Packing (covered in SV by [[sv/remap]]).
  46
  47 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
  48 as well as Element-width overrides and Twin-Predication.
  49
  50 # Vectorisation of Scalar Power ISA v3.0B
  51
  52 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
  53 [[isa/fixedstore]] pseudocode to be of the form:
  54
  55     lbux RT, RA, RB
  56     EA <- (RA) + (RB)
  57     RT <- MEM(EA)
  58
  59 and for immediate variants:
  60
  61     lb RT,D(RA)
  62     EA <- RA + EXTS(D)
  63     RT <- MEM(EA)
  64
  65 Thus in the first example, the source registers may each be independently
  66 marked as scalar or vector, and likewise the destination; in the second
  67 example only the one source and one dest may be marked as scalar or
  68 vector.
  69
  70 Thus we can see that Vector Indexed may be covered, and, as demonstrated
  71 with the pseudocode below, the immediate can be used to give unit stride or element stride.  With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
  72
  73     # LD not VLD!  format - ldop RT, immed(RA)
  74     # op_width: lb=1, lh=2, lw=4, ld=8
  75     op_load(RT, RA, RC, op_width, immed, svctx, RAupdate):
  76       ps = get_pred_val(FALSE, RA); # predication on src
  77       pd = get_pred_val(FALSE, RT); # ... AND on dest
  78       for (i=0, j=0, u=0; i < VL && j < VL;):
  79         # skip nonpredicates elements
  80         if (RA.isvec) while (!(ps & 1<<i)) i++;
  81         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
  82         if (RT.isvec) while (!(pd & 1<<j)) j++;
  83         if svctx.ldstmode == shifted: # for FFT/DCT
  84           # FFT/DCT shifted mode
  85           if (RA.isvec)
  86             srcbase = ireg[RA+i]
  87           else
  88             srcbase = ireg[RA]
  89           offs = (i * immed) << RC
  90         elif svctx.ldstmode == elementstride:
  91           # element stride mode
  92           srcbase = ireg[RA]
  93           offs = i * immed              # j*immed for a ST
  94         elif svctx.ldstmode == unitstride:
  95           # unit stride mode
  96           srcbase = ireg[RA]
  97           offs = immed + (i * op_width) # j*op_width for ST
  98         elif RA.isvec:
  99           # quirky Vector indexed mode but with an immediate
 100           srcbase = ireg[RA+i]
 101           offs = immed;
 102         else
 103           # standard scalar mode (but predicated)
 104           # no stride multiplier means VSPLAT mode
 105           srcbase = ireg[RA]
 106           offs = immed
 107
 108         # compute EA
 109         EA = srcbase + offs
 110         # update RA?
 111         if RAupdate: ireg[RAupdate+u] = EA;
 112         # load from memory
 113         ireg[RT+j] <= MEM[EA];
 114         if (!RT.isvec)
 115             break # destination scalar, end now
 116         if (RA.isvec) i++;
 117         if (RAupdate.isvec) u++;
 118         if (RT.isvec) j++;
 119
 120     # reverses the bitorder up to "width" bits
 121     def bitrev(val, VL):
 122       width = log2(VL)
 123       result = 0
 124       for _ in range(width):
 125         result = (result << 1) | (val & 1)
 126         val >>= 1
 127       return result
 128
 129 Indexed LD is:
 130
 131     # format: ldop RT, RA, RB
 132     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
 133       ps = get_pred_val(FALSE, RA); # predication on src
 134       pd = get_pred_val(FALSE, RT); # ... AND on dest
 135       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
 136         # skip nonpredicated RA, RB and RT
 137         if (RA.isvec) while (!(ps & 1<<i)) i++;
 138         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 139         if (RB.isvec) while (!(ps & 1<<k)) k++;
 140         if (RT.isvec) while (!(pd & 1<<j)) j++;
 141         if svctx.ldstmode == elementstride:
 142             EA = ireg[RA] + ireg[RB]*j   # register-strided
 143         else
 144             EA = ireg[RA+i] + ireg[RB+k] # indexed address
 145         if RAupdate: ireg[RAupdate+u] = EA
 146         ireg[RT+j] <= MEM[EA];
 147         if (!RT.isvec)
 148             break # destination scalar, end immediately
 149         if svctx.ldstmode != elementstride:
 150             if (!RA.isvec && !RB.isvec)
 151                 break # scalar-scalar
 152         if (RA.isvec) i++;
 153         if (RAupdate.isvec) u++;
 154         if (RB.isvec) k++;
 155         if (RT.isvec) j++;
 156
 157 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
 158
 159 # Determining the LD/ST Modes
 160
 161 A minor complication (caused by the retro-fitting of modern Vector
 162 features to a Scalar ISA) is that certain features do not exactly make
 163 sense or are considered a security risk.  Fail-first on Vector Indexed
 164 would allow attackers to probe large numbers of pages from userspace, where
 165 strided fail-first (by creating contiguous sequential LDs) does not.
 166
 167 In addition, reduce mode makes no sense, and for LD/ST with immediates
 168  Vector source RA makes no sense either (or, is a quirk). Realistically we need
 169 an alternative table meaning for [[sv/svp64]] mode.  The following modes make sense:
 170
 171 * saturation
 172 * predicate-result (mostly for cache-inhibited LD/ST)
 173 * normal
 174 * fail-first (where Vector Indexed is banned)
 175 * Signed Effective Address computation (Vector Indexed only)
 176
 177 Also, given that FFT, DCT and other related algorithms
 178 are of such high importance in so many areas of Computer
 179 Science, a special "shift" mode has been added which
 180 allows part of the immediate to be used instead as RC, a register
 181 which shifts the immediate `DS << GPR(RC)`.
 182
 183 The table for [[sv/svp64]] for `immed(RA)` is:
 184
 185 | 0-1 |  2  |  3   4  |  description               |
 186 | --- | --- |---------|--------------------------- |
 187 | 00  | 0   |  dz els | normal mode                |
 188 | 00  | 1   |  dz shf | shift mode                 |
 189 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
 190 | 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
 191 | 10  |   N | dz  els |  sat mode: N=0/1 u/s       |
 192 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
 193 | 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |
 194
 195 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
 196 whether stride is unit or element:
 197
 198     if bitreversed:
 199         svctx.ldstmode = bitreversed
 200     elif RA.isvec:
 201         svctx.ldstmode = indexed
 202     elif els == 0:
 203         svctx.ldstmode = unitstride
 204     elif immediate != 0:
 205         svctx.ldstmode = elementstride
 206
 207 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
 208 in effect the multiplication of the immediate-offset by zero results
 209 in reading from the exact same memory location.
 210
 211 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
 212 just the once and be copied, rather than hitting the Data Cache
 213 multiple times with the same memory read at the same location.
 214 This would allow for memory-mapped peripherals to have multiple
 215 data values read in quick succession and stored in sequentially
 216 numbered registers.
 217
 218 For non-cache-inhibited ST from a vector source onto a scalar
 219 destination: with the Vector
 220 loop effectively creating multiple memory writes to the same location,
 221 we can deduce that the last of these will be the "successful" one. Thus,
 222 implementations are free and clear to optimise out the overwriting STs,
 223 leaving just the last one as the "winner".  Bear in mind that predicate
 224 masks will skip some elements (in source non-zeroing mode).
 225 Cache-inhibited ST operations on the other hand **MUST** write out
 226 a Vector source multiple successive times to the exact same Scalar
 227 destination.
 228
 229 Note that there are no immediate versions of cache-inhibited LD/ST.
 230
 231 The modes for `RA+RB` indexed version are slightly different:
 232
 233 | 0-1 |  2  |  3   4  |  description              |
 234 | --- | --- |---------|-------------------------- |
 235 | 00  | SEA |  dz  sz | normal mode        |
 236 | 01  | SEA | dz sz  | Strided (scalar only source)   |
 237 | 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
 238 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 239 | 11  | inv | dz  RC1 |  Rc=0: pred-result z/nonz |
 240
 241 Vector Indexed Strided Mode is qualified as follows:
 242
 243     if mode = 0b01 and !RA.isvec and !RB.isvec:
 244         svctx.ldstmode = elementstride
 245
 246 A summary of the effect of Vectorisation of src or dest:
 247
 248      imm(RA)  RT.v   RA.v   no stride allowed
 249      imm(RA)  RT.s   RA.v   no stride allowed
 250      imm(RA)  RT.v   RA.s   stride-select allowed
 251      imm(RA)  RT.s   RA.s   not vectorised
 252      RA,RB    RT.v  {RA|RB}.v UNDEFINED
 253      RA,RB    RT.s  {RA|RB}.v UNDEFINED
 254      RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
 255      RA,RB    RT.s  {RA&RB}.s not vectorised
 256
 257 Signed Effective Address computation is only relevant for
 258 Vector Indexed Mode, when elwidth overrides are applied.
 259 The source override applies to RB, and before adding to
 260 RA in order to calculate the Effective Address, if SEA is
 261 set RB is sign-extended from elwidth bits to the full 64
 262 bits.  For other Modes (ffirst, saturate),
 263 all EA computation with elwidth overrides is unsigned.
 264
 265 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
 266 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
 267 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
 268
 269 ## LD/ST ffirst
 270
 271 LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
 272 ordinary one.  Exceptions occur "as normal".  However for elements 1
 273 and above, if an exception would occur, then VL is **truncated** to the
 274 previous element: the exception is **not** then raised because the
 275 LD/ST was effectively speculative.
 276
 277 ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail.  Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST.  See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 278
 279     for(i = 0; i < VL; i++)
 280         reg[rt + i] = mem[reg[ra] + i * reg[rb]];
 281
 282 High security implementations where any kind of speculative probing
 283 of memory pages is considered a risk should take advantage of the fact that
 284 implementations may truncate VL at any point, without requiring software
 285 to be rewritten and made non-portable. Such implementations may choose
 286 to *always* set VL=1 which will have the effect of terminating any
 287 speculative probing (and also adversely affect performance), but will
 288 at least not require applications to be rewritten.
 289
 290 Low-performance simpler hardware implementations may
 291 choose (always) to also set VL=1 as the bare minimum compliant implementation of
 292 LD/ST Fail-First. It is however critically important to remember that
 293 the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
 294 **MUST** raise exceptions exactly like an ordinary LD/ST.
 295
 296 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins subsequent ffirst LD/ST operations on an aligned boundary. Likewise, to reduce workloads or balance resources.
 297
 298 Vertical-First Mode is slightly strange in that only one element
 299 at a time is ever executed anyway.  Given that programmers may
 300 legitimately choose to alter srcstep and dststep in non-sequential
 301 order as part of explicit loops, it is neither possible nor
 302 safe to make speculative assumptions about future LD/STs.
 303 Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
 304 This is very different from Arithmetic (Data-dependent) FFirst
 305 where Vertical-First Mode is deterministic, not speculative.
 306
 307 # LOAD/STORE Elwidths <a name="elwidth"></a>
 308
 309 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
 310 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 311 others like it provide an explicit operation width.  There are therefore
 312 *three* widths involved:
 313
 314 * operation width (lb=8, lh=16, lw=32, ld=64)
 315 * src elelent width override
 316 * destination element width override
 317
 318 Some care is therefore needed to express and make clear the transformations,
 319 which are expressly in this order:
 320
 321 * Load at the operation width (lb/lh/lw/ld) as usual
 322 * byte-reversal as usual
 323 * Non-saturated mode:
 324    - zero-extension or truncation from operation width to source elwidth
 325    - zero/truncation to dest elwidth
 326 * Saturated mode:
 327    - Sign-extension or truncation from operation width to source width
 328    - signed/unsigned saturation down to dest elwidth
 329
 330 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
 331 is treated effectively as completely separate and distinct from SV
 332 augmentation.  This is primarily down to quirks surrounding LE/BE and
 333 byte-reversal in OpenPOWER.
 334
 335 It is unfortunately possible to request an elwidth override on the memory side which
 336 does not mesh with the operation width: these result in `UNDEFINED`
 337 behaviour.  The reason is that the effect of attempting a 64-bit `sv.ld`
 338 operation with a source elwidth override of 8/16/32 would result in
 339 overlapping memory requests, particularly on unit and element strided
 340 operations.  Thus it is `UNDEFINED` when the elwidth is smaller than
 341 the memory operation width. Examples include `sv.lw/sw=16/els` which
 342 requests (overlapping) 4-byte memory reads offset from
 343 each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
 344 where the dest elwidth override is less than the operation width.
 345
 346 Note the following regarding the pseudocode to follow:
 347
 348 * `scalar identity behaviour` SV Context parameter conditions turn this
 349   into a straight absolute fully-compliant Scalar v3.0B LD operation
 350 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 351   rather than `ld`)
 352 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 353   a "normal" part of Scalar v3.0B LD
 354 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 355   as a "normal" part of Scalar v3.0B LD
 356 * `svctx` specifies the SV Context and includes VL as well as
 357   source and destination elwidth overrides.
 358
 359 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
 360
 361 Note that twin predication, predication-zeroing, saturation
 362 and other modes have all been removed, for clarity and simplicity:
 363
 364     # LD not VLD! (ldbrx if brev=True)
 365     # this covers unit stride mode and a type of vector offset
 366     function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
 367       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
 368
 369         if not svctx.unit/el-strided:
 370             # strange vector mode, compute 64 bit address which is
 371             # not polymorphic! elwidth hardcoded to 64 here
 372             srcbase = get_polymorphed_reg(RA, 64, i)
 373         else:
 374             # unit / element stride mode, compute 64 bit address
 375             srcbase = get_polymorphed_reg(RA, 64, 0)
 376             # adjust for unit/el-stride
 377             srcbase += ....
 378
 379         # takes care of (merges) processor LE/BE and ld/ldbrx
 380         bytereverse = brev XNOR MSR.LE
 381
 382         # read the underlying memory
 383         memread <= mem[srcbase + imm_offs];
 384
 385         # optionally performs byteswap at op width
 386         if (bytereverse):
 387             memread = byteswap(memread, op_width)
 388
 389         # check saturation.
 390         if svpctx.saturation_mode:
 391             ... saturation adjustment...
 392         else:
 393             # truncate/extend to over-ridden source width.
 394             memread = adjust_wid(memread, op_width, svctx.src_elwidth)
 395
 396         # takes care of inserting memory-read (now correctly byteswapped)
 397         # into regfile underlying LE-defined order, into the right place
 398         # within the NEON-like register, respecting destination element
 399         # bitwidth, and the element index (j)
 400         set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
 401
 402         # increments both src and dest element indices (no predication here)
 403         i++;
 404         j++;
 405
 406 # Remapped LD/ST
 407
 408 In the [[sv/propagation]] page the concept of "Remapping" is described.
 409 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 410 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 411 elements worth of LDs or STs.  The usual interest in such re-mapping
 412 is for example in separating out 24-bit RGB channel data into separate
 413 contiguous registers.  NEON covers this as shown in the diagram below:
 414
 415 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
 416
 417 Remap easily covers this capability, and with dest
 418 elwidth overrides and saturation may do so with built-in conversion that
 419 would normally require additional width-extension, sign-extension and
 420 min/max Vectorised instructions as post-processing stages.
 421
 422 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 423 because the generic abstracted concept of "Remapping", when applied to
 424 LD/ST, will give that same capability, with far more flexibility.
 425
 426 # notes from lxo
 427
 428 this section covers assembly notation for the immediate and indexed LD/ST.
 429 the summary is that in immediate mode for LD it is not clear that if the
 430 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
 431 the memory being read is *still a vector load*, known as "unit or element strides".
 432
 433 This anomaly is made clear with the following notation:
 434
 435     sv.ld RT.v, imm(RA).v
 436
 437 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
 438
 439     sv.ld RT.v, imm(RA)
 440
 441 Notes taken from IRC conversation
 442
 443     <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
 444     <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
 445     <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
 446     <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
 447     <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
 448     <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
 449     <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
 450
 451 permutations of vector selection, to identify above asm-syntax:
 452
 453      imm(RA)  RT.v   RA.v   nonstrided
 454          sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
 455            mem@     0+r#2   offs+(r#2+1)  offs+(r#2+2)
 456            destreg  r#      r#+1          r#+2
 457      imm(RA)  RT.s   RA.v   nonstrided
 458          sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
 459            (dest r# is scalar) -> VSELECT mode
 460      imm(RA)  RT.v   RA.s   fixed stride: unit or element
 461          sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
 462            mem@r#2  +0   +1   +2
 463            destreg  r#   r#+1 r#+2
 464          sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
 465            mem@r#2  +0 ...   +offs ...  +offs*2
 466            destreg  r#       r#+1       r#+2
 467      imm(RA)  RT.s   RA.s   not vectorised
 468          sv.ld r#, ofst(r#2)
 469
 470 indexed mode:
 471
 472      RA,RB    RT.v  RA.v  RB.v
 473         sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
 474      RA,RB    RT.v  RA.s  RB.v
 475         sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
 476      RA,RB    RT.v  RA.v  RB.s
 477         sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
 478      RA,RB    RT.v  RA.s  RB.s
 479         sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
 480      RA,RB    RT.s  RA.v  RB.v
 481      RA,RB    RT.s  RA.s  RB.v
 482      RA,RB    RT.s  RA.v  RB.s
 483      RA,RB    RT.s  RA.s  RB.s not vectorised