openpower/sv/ldst.mdwn

   1 [[!tag standards]]
   2
   3 # SV Load and Store
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12 * [[simple_v_extension/specification/ld.x]]
  13
  14 # Rationale
  15
  16 All Vector ISAs dating back fifty years have extensive and comprehensive
  17 Load and Store operations that go far beyond the capabilities of Scalar
  18 RISC or CISC processors, yet at their heart on an individual element
  19 basis may be found to be no different from RISC Scalar equivalents.
  20
  21 The resource savings from Vector LD/ST are significant and stem from
  22 the fact that one single instruction can trigger a dozen (or in some
  23 microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
  24
  25 Additionally, and simply: if the Arithmetic side of an ISA supports
  26 Vector Operations, then in order to keep the ALUs 100% occupied the
  27 Memory infrastructure (and the ISA itself) correspondingly needs Vector
  28 Memory Operations as well.
  29
  30 Vectorised Load and Store also presents an extra dimension (literally)
  31 which creates scenarios unique to Vector applications, that a Scalar
  32 (and even a SIMD) ISA simply never encounters.  SVP64 endeavours to
  33 add such modes without changing the behaviour of the underlying Base
  34 (Scalar) v3.0B operations.
  35
  36 # Modes overview
  37
  38 Vectorisation of Load and Store requires creation, from scalar operations,
  39 a number of different modes:
  40
  41 * fixed stride (contiguous sequence with no gaps) aka "unit" stride
  42 * element strided (sequential but regularly offset, with gaps)
  43 * vector indexed (vector of base addresses and vector of offsets)
  44 * Speculative fail-first (where it makes sense to do so)
  45 * Structure Packing (covered in SV by [[sv/remap]]).
  46
  47 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
  48 as well as Element-width overrides and Twin-Predication.
  49
  50 # Vectorisation of Scalar Power ISA v3.0B
  51
  52 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
  53 [[isa/fixedstore]] pseudocode to be of the form:
  54
  55     lbux RT, RA, RB
  56     EA <- (RA) + (RB)
  57     RT <- MEM(EA)
  58
  59 and for immediate variants:
  60
  61     lb RT,D(RA)
  62     EA <- RA + EXTS(D)
  63     RT <- MEM(EA)
  64
  65 Thus in the first example, the source registers may each be independently
  66 marked as scalar or vector, and likewise the destination; in the second
  67 example only the one source and one dest may be marked as scalar or
  68 vector.
  69
  70 Thus we can see that Vector Indexed may be covered, and, as demonstrated
  71 with the pseudocode below, the immediate can be used to give unit stride or element stride.  With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
  72
  73     # LD not VLD!  format - ldop RT, immed(RA)
  74     # op_width: lb=1, lh=2, lw=4, ld=8
  75     op_load(RT, RA, RC, op_width, immed, svctx, RAupdate):
  76       ps = get_pred_val(FALSE, RA); # predication on src
  77       pd = get_pred_val(FALSE, RT); # ... AND on dest
  78       for (i=0, j=0, u=0; i < VL && j < VL;):
  79         # skip nonpredicates elements
  80         if (RA.isvec) while (!(ps & 1<<i)) i++;
  81         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
  82         if (RT.isvec) while (!(pd & 1<<j)) j++;
  83         if svctx.ldstmode == shifted: # for FFT/DCT
  84           # FFT/DCT shifted mode
  85           if (RA.isvec)
  86             srcbase = ireg[RA+i]
  87           else
  88             srcbase = ireg[RA]
  89           offs = (i * immed) << RC
  90         elif svctx.ldstmode == elementstride:
  91           # element stride mode
  92           srcbase = ireg[RA]
  93           offs = i * immed              # j*immed for a ST
  94         elif svctx.ldstmode == unitstride:
  95           # unit stride mode
  96           srcbase = ireg[RA]
  97           offs = immed + (i * op_width) # j*op_width for ST
  98         elif RA.isvec:
  99           # quirky Vector indexed mode but with an immediate
 100           srcbase = ireg[RA+i]
 101           offs = immed;
 102         else
 103           # standard scalar mode (but predicated)
 104           # no stride multiplier means VSPLAT mode
 105           srcbase = ireg[RA]
 106           offs = immed
 107
 108         # compute EA
 109         EA = srcbase + offs
 110         # update RA?
 111         if RAupdate: ireg[RAupdate+u] = EA;
 112         # load from memory
 113         ireg[RT+j] <= MEM[EA];
 114         if (!RT.isvec)
 115             break # destination scalar, end now
 116         if (RA.isvec) i++;
 117         if (RAupdate.isvec) u++;
 118         if (RT.isvec) j++;
 119
 120     # reverses the bitorder up to "width" bits
 121     def bitrev(val, VL):
 122       width = log2(VL)
 123       result = 0
 124       for _ in range(width):
 125         result = (result << 1) | (val & 1)
 126         val >>= 1
 127       return result
 128
 129 Indexed LD is:
 130
 131     # format: ldop RT, RA, RB
 132     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
 133       ps = get_pred_val(FALSE, RA); # predication on src
 134       pd = get_pred_val(FALSE, RT); # ... AND on dest
 135       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
 136         # skip nonpredicated RA, RB and RT
 137         if (RA.isvec) while (!(ps & 1<<i)) i++;
 138         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 139         if (RB.isvec) while (!(ps & 1<<k)) k++;
 140         if (RT.isvec) while (!(pd & 1<<j)) j++;
 141         if svctx.ldstmode == elementstride:
 142             EA = ireg[RA] + ireg[RB]*j   # register-strided
 143         else
 144             EA = ireg[RA+i] + ireg[RB+k] # indexed address
 145         if RAupdate: ireg[RAupdate+u] = EA
 146         ireg[RT+j] <= MEM[EA];
 147         if (!RT.isvec)
 148             break # destination scalar, end immediately
 149         if svctx.ldstmode != elementstride:
 150             if (!RA.isvec && !RB.isvec)
 151                 break # scalar-scalar
 152         if (RA.isvec) i++;
 153         if (RAupdate.isvec) u++;
 154         if (RB.isvec) k++;
 155         if (RT.isvec) j++;
 156
 157 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
 158
 159 # Determining the LD/ST Modes
 160
 161 A minor complication (caused by the retro-fitting of modern Vector
 162 features to a Scalar ISA) is that certain features do not exactly make
 163 sense or are considered a security risk.  Fail-first on Vector Indexed
 164 would allow attackers to probe large numbers of pages from userspace, where
 165 strided fail-first (by creating contiguous sequential LDs) does not.
 166
 167 In addition, reduce mode makes no sense, and for LD/ST with immediates
 168  Vector source RA makes no sense either (or, is a quirk). Realistically we need
 169 an alternative table meaning for [[sv/svp64]] mode.  The following modes make sense:
 170
 171 * saturation
 172 * predicate-result (mostly for cache-inhibited LD/ST)
 173 * normal
 174 * fail-first (where Vector Indexed is banned)
 175 * Signed Effective Address computation (Vector Indexed only)
 176
 177 Also, given that FFT, DCT and other related algorithms
 178 are of such high importance in so many areas of Computer
 179 Science, a special "shift" mode has been added which
 180 allows part of the immediate to be used instead as RC, a register
 181 which shifts the immediate `DS << GPR(RC)`.
 182
 183 The table for [[sv/svp64]] for `immed(RA)` is:
 184
 185 | 0-1 |  2  |  3   4  |  description               |
 186 | --- | --- |---------|--------------------------- |
 187 | 00  | 0   |  dz els | normal mode                |
 188 | 00  | 1   |  dz shf | shift mode                 |
 189 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
 190 | 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
 191 | 10  |   N | dz  els |  sat mode: N=0/1 u/s       |
 192 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
 193 | 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |
 194
 195 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
 196 whether stride is unit or element:
 197
 198     if bitreversed:
 199         svctx.ldstmode = bitreversed
 200     elif RA.isvec:
 201         svctx.ldstmode = indexed
 202     elif els == 0:
 203         svctx.ldstmode = unitstride
 204     elif immediate != 0:
 205         svctx.ldstmode = elementstride
 206
 207 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
 208 in effect the multiplication of the immediate-offset by zero results
 209 in reading from the exact same memory location.
 210
 211 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
 212 just the once and be copied, rather than hitting the Data Cache
 213 multiple times with the same memory read at the same location.
 214 This would allow for memory-mapped peripherals to have multiple
 215 data values read in quick succession and stored in sequentially
 216 numbered registers.
 217
 218 For non-cache-inhibited ST from a vector source onto a scalar
 219 destination: with the Vector
 220 loop effectively creating multiple memory writes to the same location,
 221 we can deduce that the last of these will be the "successful" one. Thus,
 222 implementations are free and clear to optimise out the overwriting STs,
 223 leaving just the last one as the "winner".  Bear in mind that predicate
 224 masks will skip some elements (in source non-zeroing mode).
 225 Cache-inhibited ST operations on the other hand **MUST** write out
 226 a Vector source multiple successive times to the exact same Scalar
 227 destination.
 228
 229 Note that there are no immediate versions of cache-inhibited LD/ST.
 230
 231 The modes for `RA+RB` indexed version are slightly different:
 232
 233 | 0-1 |  2  |  3   4  |  description              |
 234 | --- | --- |---------|-------------------------- |
 235 | 00  | SEA |  dz  sz | normal mode        |
 236 | 01  | SEA | dz sz  | Strided (scalar only source)   |
 237 | 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
 238 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 239 | 11  | inv | dz  RC1 |  Rc=0: pred-result z/nonz |
 240
 241 Vector Indexed Strided Mode is qualified as follows:
 242
 243     if mode = 0b01 and !RA.isvec and !RB.isvec:
 244         svctx.ldstmode = elementstride
 245
 246 A summary of the effect of Vectorisation of src or dest:
 247
 248      imm(RA)  RT.v   RA.v   no stride allowed
 249      imm(RA)  RT.s   RA.v   no stride allowed
 250      imm(RA)  RT.v   RA.s   stride-select allowed
 251      imm(RA)  RT.s   RA.s   not vectorised
 252      RA,RB    RT.v  {RA|RB}.v UNDEFINED
 253      RA,RB    RT.s  {RA|RB}.v UNDEFINED
 254      RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
 255      RA,RB    RT.s  {RA&RB}.s not vectorised
 256
 257 Signed Effective Address computation is only relevant for
 258 Vector Indexed Mode, when elwidth overrides are applied.
 259 The source override applies to RB, and before adding to
 260 RA in order to calculate the Effective Address, if SEA is
 261 set RB is sign-extended from elwidth bits to the full 64
 262 bits.  For other Modes (ffirst, saturate),
 263 all EA computation with elwidth overrides is unsigned.
 264
 265 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
 266 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
 267 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
 268
 269 ## LD/ST ffirst
 270
 271 ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail.  Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST.  See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 272
 273     for(i = 0; i < VL; i++)
 274         reg[rt + i] = mem[reg[ra] + i * reg[rb]];
 275
 276 High security implementations where any kind of speculative probing
 277 of memory pages is considered a risk should take advantage of the fact that
 278 implementations may truncate VL at any point, without requiring software
 279 to be rewritten and made non-portable. Such implementations may choose
 280 to *always* set VL=1 which will have the effect of terminating any
 281 speculative probing (and also adversely affect performance), but will
 282 at least not require applications to be rewritten.
 283
 284 Low-performance simpler hardware implementations may
 285 choose to also set VL=1 as the bare minimum compliant implementation of
 286 LD/ST Fail-First. It is however critically important to remember that
 287 the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
 288 **MUST** raise exceptions exactly like an ordinary LD/ST.
 289
 290 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins subsequent ffirst LD/ST operations on an aligned boundary. Likewise, to reduce workloads or balance resources.
 291
 292 # LOAD/STORE Elwidths <a name="elwidth"></a>
 293
 294 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
 295 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 296 others like it provide an explicit operation width.  There are therefore
 297 *three* widths involved:
 298
 299 * operation width (lb=8, lh=16, lw=32, ld=64)
 300 * src elelent width override
 301 * destination element width override
 302
 303 Some care is therefore needed to express and make clear the transformations,
 304 which are expressly in this order:
 305
 306 * Load at the operation width (lb/lh/lw/ld) as usual
 307 * byte-reversal as usual
 308 * Non-saturated mode:
 309    - zero-extension or truncation from operation width to source elwidth
 310    - zero/truncation to dest elwidth
 311 * Saturated mode:
 312    - Sign-extension or truncation from operation width to source width
 313    - signed/unsigned saturation down to dest elwidth
 314
 315 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
 316 is treated effectively as completely separate and distinct from SV
 317 augmentation.  This is primarily down to quirks surrounding LE/BE and
 318 byte-reversal in OpenPOWER.
 319
 320 It is unfortunately possible to request an elwidth override on the memory side which
 321 does not mesh with the operation width: these result in `UNDEFINED`
 322 behaviour.  The reason is that the effect of attempting a 64-bit `sv.ld`
 323 operation with a source elwidth override of 8/16/32 would result in
 324 overlapping memory requests, particularly on unit and element strided
 325 operations.  Thus it is `UNDEFINED` when the elwidth is smaller than
 326 the memory operation width. Examples include `sv.lw/sw=16/els` which
 327 requests (overlapping) 4-byte memory reads offset from
 328 each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
 329 where the dest elwidth override is less than the operation width.
 330
 331 Note the following regarding the pseudocode to follow:
 332
 333 * `scalar identity behaviour` SV Context parameter conditions turn this
 334   into a straight absolute fully-compliant Scalar v3.0B LD operation
 335 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 336   rather than `ld`)
 337 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 338   a "normal" part of Scalar v3.0B LD
 339 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 340   as a "normal" part of Scalar v3.0B LD
 341 * `svctx` specifies the SV Context and includes VL as well as
 342   source and destination elwidth overrides.
 343
 344 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
 345
 346 Note that twin predication, predication-zeroing, saturation
 347 and other modes have all been removed, for clarity and simplicity:
 348
 349     # LD not VLD! (ldbrx if brev=True)
 350     # this covers unit stride mode and a type of vector offset
 351     function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
 352       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
 353
 354         if not svctx.unit/el-strided:
 355             # strange vector mode, compute 64 bit address which is
 356             # not polymorphic! elwidth hardcoded to 64 here
 357             srcbase = get_polymorphed_reg(RA, 64, i)
 358         else:
 359             # unit / element stride mode, compute 64 bit address
 360             srcbase = get_polymorphed_reg(RA, 64, 0)
 361             # adjust for unit/el-stride
 362             srcbase += ....
 363
 364         # takes care of (merges) processor LE/BE and ld/ldbrx
 365         bytereverse = brev XNOR MSR.LE
 366
 367         # read the underlying memory
 368         memread <= mem[srcbase + imm_offs];
 369
 370         # optionally performs byteswap at op width
 371         if (bytereverse):
 372             memread = byteswap(memread, op_width)
 373
 374         # check saturation.
 375         if svpctx.saturation_mode:
 376             ... saturation adjustment...
 377         else:
 378             # truncate/extend to over-ridden source width.
 379             memread = adjust_wid(memread, op_width, svctx.src_elwidth)
 380
 381         # takes care of inserting memory-read (now correctly byteswapped)
 382         # into regfile underlying LE-defined order, into the right place
 383         # within the NEON-like register, respecting destination element
 384         # bitwidth, and the element index (j)
 385         set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
 386
 387         # increments both src and dest element indices (no predication here)
 388         i++;
 389         j++;
 390
 391 # Remapped LD/ST
 392
 393 In the [[sv/propagation]] page the concept of "Remapping" is described.
 394 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 395 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 396 elements worth of LDs or STs.  The usual interest in such re-mapping
 397 is for example in separating out 24-bit RGB channel data into separate
 398 contiguous registers.  NEON covers this as shown in the diagram below:
 399
 400 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
 401
 402 Remap easily covers this capability, and with dest
 403 elwidth overrides and saturation may do so with built-in conversion that
 404 would normally require additional width-extension, sign-extension and
 405 min/max Vectorised instructions as post-processing stages.
 406
 407 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 408 because the generic abstracted concept of "Remapping", when applied to
 409 LD/ST, will give that same capability, with far more flexibility.
 410
 411 # notes from lxo
 412
 413 this section covers assembly notation for the immediate and indexed LD/ST.
 414 the summary is that in immediate mode for LD it is not clear that if the
 415 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
 416 the memory being read is *still a vector load*, known as "unit or element strides".
 417
 418 This anomaly is made clear with the following notation:
 419
 420     sv.ld RT.v, imm(RA).v
 421
 422 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
 423
 424     sv.ld RT.v, imm(RA)
 425
 426 Notes taken from IRC conversation
 427
 428     <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
 429     <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
 430     <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
 431     <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
 432     <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
 433     <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
 434     <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
 435
 436 permutations of vector selection, to identify above asm-syntax:
 437
 438      imm(RA)  RT.v   RA.v   nonstrided
 439          sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
 440            mem@     0+r#2   offs+(r#2+1)  offs+(r#2+2)
 441            destreg  r#      r#+1          r#+2
 442      imm(RA)  RT.s   RA.v   nonstrided
 443          sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
 444            (dest r# is scalar) -> VSELECT mode
 445      imm(RA)  RT.v   RA.s   fixed stride: unit or element
 446          sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
 447            mem@r#2  +0   +1   +2
 448            destreg  r#   r#+1 r#+2
 449          sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
 450            mem@r#2  +0 ...   +offs ...  +offs*2
 451            destreg  r#       r#+1       r#+2
 452      imm(RA)  RT.s   RA.s   not vectorised
 453          sv.ld r#, ofst(r#2)
 454
 455 indexed mode:
 456
 457      RA,RB    RT.v  RA.v  RB.v
 458         sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
 459      RA,RB    RT.v  RA.s  RB.v
 460         sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
 461      RA,RB    RT.v  RA.v  RB.s
 462         sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
 463      RA,RB    RT.v  RA.s  RB.s
 464         sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
 465      RA,RB    RT.s  RA.v  RB.v
 466      RA,RB    RT.s  RA.s  RB.v
 467      RA,RB    RT.s  RA.v  RB.s
 468      RA,RB    RT.s  RA.s  RB.s not vectorised