openpower/sv/ldst.mdwn

   1 [[!tag standards]]
   2
   3 # SV Load and Store
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12 * [[simple_v_extension/specification/ld.x]]
  13
  14 Vectorisation of Load and Store requires creation, from scalar operations,
  15 a number of different modes:
  16
  17 * fixed stride (contiguous sequence with no gaps) aka "unit" stride
  18 * element strided (sequential but regularly offset, with gaps)
  19 * vector indexed (vector of base addresses and vector of offsets)
  20 * fail-first on the same (where it makes sense to do so)
  21 * Structure Packing (covered in SV by [[sv/remap]]).
  22
  23 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
  24 [[isa/fixedstore]] pseudocode to be of the form:
  25
  26     lbux RT, RA, RB
  27     EA <- (RA) + (RB)
  28     RT <- MEM(EA)
  29
  30 and for immediate variants:
  31
  32     lb RT,D(RA)
  33     EA <- RA + EXTS(D)
  34     RT <- MEM(EA)
  35
  36 Thus in the first example, the source registers may each be independently
  37 marked as scalar or vector, and likewise the destination; in the second
  38 example only the one source and one dest may be marked as scalar or
  39 vector.
  40
  41 Thus we can see that Vector Indexed may be covered, and, as demonstrated
  42 with the pseudocode below, the immediate can be used to give unit stride or element stride.  With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
  43
  44     # LD not VLD!  format - ldop RT, immed(RA)
  45     # op_width: lb=1, lh=2, lw=4, ld=8
  46     op_load(RT, RA, RC, op_width, immed, svctx, RAupdate):
  47       ps = get_pred_val(FALSE, RA); # predication on src
  48       pd = get_pred_val(FALSE, RT); # ... AND on dest
  49       for (i=0, j=0, u=0; i < VL && j < VL;):
  50         # skip nonpredicates elements
  51         if (RA.isvec) while (!(ps & 1<<i)) i++;
  52         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
  53         if (RT.isvec) while (!(pd & 1<<j)) j++;
  54         if svctx.ldstmode == shifted: # for FFT/DCT
  55           # FFT/DCT shifted mode
  56           if (RA.isvec)
  57             srcbase = ireg[RA+i]
  58           else
  59             srcbase = ireg[RA]
  60           offs = (i * immed) << RC
  61         elif svctx.ldstmode == elementstride:
  62           # element stride mode
  63           srcbase = ireg[RA]
  64           offs = i * immed              # j*immed for a ST
  65         elif svctx.ldstmode == unitstride:
  66           # unit stride mode
  67           srcbase = ireg[RA]
  68           offs = immed + (i * op_width) # j*op_width for ST
  69         elif RA.isvec:
  70           # quirky Vector indexed mode but with an immediate
  71           srcbase = ireg[RA+i]
  72           offs = immed;
  73         else
  74           # standard scalar mode (but predicated)
  75           # no stride multiplier means VSPLAT mode
  76           srcbase = ireg[RA]
  77           offs = immed
  78
  79         # compute EA
  80         EA = srcbase + offs
  81         # update RA?
  82         if RAupdate: ireg[RAupdate+u] = EA;
  83         # load from memory
  84         ireg[RT+j] <= MEM[EA];
  85         if (!RT.isvec)
  86             break # destination scalar, end now
  87         if (RA.isvec) i++;
  88         if (RAupdate.isvec) u++;
  89         if (RT.isvec) j++;
  90
  91     # reverses the bitorder up to "width" bits
  92     def bitrev(val, VL):
  93       width = log2(VL)
  94       result = 0
  95       for _ in range(width):
  96         result = (result << 1) | (val & 1)
  97         val >>= 1
  98       return result
  99
 100 Indexed LD is:
 101
 102     # format: ldop RT, RA, RB
 103     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
 104       ps = get_pred_val(FALSE, RA); # predication on src
 105       pd = get_pred_val(FALSE, RT); # ... AND on dest
 106       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
 107         # skip nonpredicated RA, RB and RT
 108         if (RA.isvec) while (!(ps & 1<<i)) i++;
 109         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 110         if (RB.isvec) while (!(ps & 1<<k)) k++;
 111         if (RT.isvec) while (!(pd & 1<<j)) j++;
 112         if svctx.ldstmode == elementstride:
 113             EA = ireg[RA] + ireg[RB]*j   # register-strided
 114         else
 115             EA = ireg[RA+i] + ireg[RB+k] # indexed address
 116         if RAupdate: ireg[RAupdate+u] = EA
 117         ireg[RT+j] <= MEM[EA];
 118         if (!RT.isvec)
 119             break # destination scalar, end immediately
 120         if (!RA.isvec && !RB.isvec)
 121             break # scalar-scalar
 122         if (RA.isvec) i++;
 123         if (RAupdate.isvec) u++;
 124         if (RB.isvec) k++;
 125         if (RT.isvec) j++;
 126
 127 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
 128
 129 # Determining the LD/ST Modes
 130
 131 A minor complication (caused by the retro-fitting of modern Vector
 132 features to a Scalar ISA) is that certain features do not exactly make
 133 sense or are considered a security risk.  Fail-first on Vector Indexed
 134 would allow attackers to probe large numbers of pages from userspace, where
 135 strided fail-first (by creating contiguous sequential LDs) does not.
 136
 137 In addition, reduce mode makes no sense, and for LD/ST with immediates
 138  Vector source RA makes no sense either (or, is a quirk). Realistically we need
 139 an alternative table meaning for [[sv/svp64]] mode.  The following modes make sense:
 140
 141 * saturation
 142 * predicate-result (mostly for cache-inhibited LD/ST)
 143 * normal
 144 * fail-first (where Vector Indexed is banned)
 145 * Signed Effective Address computation (Vector Indexed only)
 146
 147 Also, given that FFT, DCT and other related algorithms
 148 are of such high importance in so many areas of Computer
 149 Science, a special "shift" mode has been added which
 150 allows part of the immediate to be used instead as RC, a register
 151 which shifts the immediate `DS << GPR(RC)`.
 152
 153 The table for [[sv/svp64]] for `immed(RA)` is:
 154
 155 | 0-1 |  2  |  3   4  |  description               |
 156 | --- | --- |---------|--------------------------- |
 157 | 00  | 0   |  dz els | normal mode                |
 158 | 00  | 1   |  dz shf | shift mode                 |
 159 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
 160 | 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
 161 | 10  |   N | dz  els |  sat mode: N=0/1 u/s       |
 162 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
 163 | 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |
 164
 165 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
 166 whether stride is unit or element:
 167
 168     if bitreversed:
 169         svctx.ldstmode = bitreversed
 170     elif RA.isvec:
 171         svctx.ldstmode = indexed
 172     elif els == 0:
 173         svctx.ldstmode = unitstride
 174     elif immediate != 0:
 175         svctx.ldstmode = elementstride
 176
 177 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
 178 in effect the multiplication of the immediate-offset by zero results
 179 in reading from the exact same memory location.
 180
 181 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
 182 just the once and be copied, rather than hitting the Data Cache
 183 multiple times with the same memory read at the same location.
 184 This would allow for memory-mapped peripherals to have multiple
 185 data values read in quick succession and stored in sequentially
 186 numbered registers.
 187
 188 For non-cache-inhibited ST from a vector source onto a scalar
 189 destination: with the Vector
 190 loop effectively creating multiple memory writes to the same location,
 191 we can deduce that the last of these will be the "successful" one. Thus,
 192 implementations are free and clear to optimise out the overwriting STs,
 193 leaving just the last one as the "winner".  Bear in mind that predicate
 194 masks will skip some elements (in source non-zeroing mode).
 195 Cache-inhibited ST operations on the other hand **MUST** write out
 196 a Vector source multiple successive times to the exact same Scalar
 197 destination.
 198
 199 Note that there are no immediate versions of cache-inhibited LD/ST.
 200
 201 The modes for `RA+RB` indexed version are slightly different:
 202
 203 | 0-1 |  2  |  3   4  |  description              |
 204 | --- | --- |---------|-------------------------- |
 205 | 00  | SEA |  dz  sz | normal mode        |
 206 | 01  | SEA | dz sz  | Strided (scalar only source)   |
 207 | 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
 208 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 209 | 11  | inv | dz  RC1 |  Rc=0: pred-result z/nonz |
 210
 211 Vector Indexed Strided Mode is qualified as follows:
 212
 213     if mode = 0b01 and !RA.isvec and !RB.isvec:
 214         svctx.ldstmode = elementstride
 215
 216 A summary of the effect of Vectorisation of src or dest:
 217
 218      imm(RA)  RT.v   RA.v   no stride allowed
 219      imm(RA)  RT.s   RA.v   no stride allowed
 220      imm(RA)  RT.v   RA.s   stride-select allowed
 221      imm(RA)  RT.s   RA.s   not vectorised
 222      RA,RB    RT.v  {RA|RB}.v UNDEFINED
 223      RA,RB    RT.s  {RA|RB}.v UNDEFINED
 224      RA,RB    RT.v  {RA&RB}.s VSPLAT possible. stride selectable
 225      RA,RB    RT.s  {RA&RB}.s not vectorised
 226
 227 Signed Effective Address computation is only relevant for
 228 Vector Indexed Mode, when elwidth overrides are applied.
 229 The source override applies to RB, and before adding to
 230 RA in order to calculate the Effective Address, if SEA is
 231 set RB is sign-extended from elwidth bits to the full 64
 232 bits.  For other Modes (ffirst, saturate),
 233 all EA computation is unsigned.
 234
 235 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
 236 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
 237 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
 238
 239 ## LD/ST ffirst
 240
 241 ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail.  Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST.  See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 242
 243     for(i = 0; i < VL; i++)
 244         reg[rt + i] = mem[reg[ra] + i * reg[rb]];
 245
 246 High security implementations where any kind of speculative probing
 247 of memory pages is considered a risk should take advantage of the fact that
 248 implementations may truncate VL at any point, without requiring software
 249 to be rewritten and made non-portable. Such implementations may choose
 250 to *always* set VL=1 which will have the effect of terminating any
 251 speculative probing (and also adversely affect performance), but will
 252 at least not require applications to be rewritten.
 253
 254 Low-performance simpler hardware implementations may
 255 choose to also set VL=1 as the bare minimum compliant implementation of
 256 LD/ST Fail-First. It is however critically important to remember that
 257 the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
 258 **MUST** raise exceptions exactly like an ordinary LD/ST.
 259
 260 # LOAD/STORE Elwidths <a name="elwidth"></a>
 261
 262 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
 263 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 264 others like it provide an explicit operation width.  There are therefore
 265 *three* widths involved:
 266
 267 * operation width (lb=8, lh=16, lw=32, ld=64)
 268 * src elelent width override
 269 * destination element width override
 270
 271 Some care is therefore needed to express and make clear the transformations,
 272 which are expressly in this order:
 273
 274 * Load at the operation width (lb/lh/lw/ld) as usual
 275 * byte-reversal as usual
 276 * Non-saturated mode:
 277    - zero-extension or truncation from operation width to source elwidth
 278    - zero/truncation to dest elwidth
 279 * Saturated mode:
 280    - Sign-extension or truncation from operation width to source width
 281    - signed/unsigned saturation down to dest elwidth
 282
 283 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
 284 is treated effectively as completely separate and distinct from SV
 285 augmentation.  This is primarily down to quirks surrounding LE/BE and
 286 byte-reversal in OpenPOWER.
 287
 288 It is unfortunately possible to request an elwidth override on the memory side which
 289 does not mesh with the operation width: these result in `UNDEFINED`
 290 behaviour.  The reason is that the effect of attempting a 64-bit `sv.ld`
 291 operation with a source elwidth override of 8/16/32 would result in
 292 overlapping memory requests, particularly on unit and element strided
 293 operations.  Thus it is `UNDEFINED` when the elwidth is smaller than
 294 the memory operation width. Examples include `sv.lw/sw=16/els` which
 295 requests (overlapping) 4-byte memory reads offset from
 296 each other at 2-byte intervals.  Store likewise is also `UNDEFINED`
 297 where the dest elwidth override is less than the operation width.
 298
 299 Note the following regarding the pseudocode to follow:
 300
 301 * `scalar identity behaviour` SV Context parameter conditions turn this
 302   into a straight absolute fully-compliant Scalar v3.0B LD operation
 303 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 304   rather than `ld`)
 305 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 306   a "normal" part of Scalar v3.0B LD
 307 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 308   as a "normal" part of Scalar v3.0B LD
 309 * `svctx` specifies the SV Context and includes VL as well as
 310   source and destination elwidth overrides.
 311
 312 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
 313
 314 Note that twin predication, predication-zeroing, saturation
 315 and other modes have all been removed, for clarity and simplicity:
 316
 317     # LD not VLD! (ldbrx if brev=True)
 318     # this covers unit stride mode and a type of vector offset
 319     function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
 320       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
 321
 322         if not svctx.unit/el-strided:
 323             # strange vector mode, compute 64 bit address which is
 324             # not polymorphic! elwidth hardcoded to 64 here
 325             srcbase = get_polymorphed_reg(RA, 64, i)
 326         else:
 327             # unit / element stride mode, compute 64 bit address
 328             srcbase = get_polymorphed_reg(RA, 64, 0)
 329             # adjust for unit/el-stride
 330             srcbase += ....
 331
 332         # takes care of (merges) processor LE/BE and ld/ldbrx
 333         bytereverse = brev XNOR MSR.LE
 334
 335         # read the underlying memory
 336         memread <= mem[srcbase + imm_offs];
 337
 338         # optionally performs byteswap at op width
 339         if (bytereverse):
 340             memread = byteswap(memread, op_width)
 341
 342         # check saturation.
 343         if svpctx.saturation_mode:
 344             ... saturation adjustment...
 345         else:
 346             # truncate/extend to over-ridden source width.
 347             memread = adjust_wid(memread, op_width, svctx.src_elwidth)
 348
 349         # takes care of inserting memory-read (now correctly byteswapped)
 350         # into regfile underlying LE-defined order, into the right place
 351         # within the NEON-like register, respecting destination element
 352         # bitwidth, and the element index (j)
 353         set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
 354
 355         # increments both src and dest element indices (no predication here)
 356         i++;
 357         j++;
 358
 359 # Remapped LD/ST
 360
 361 In the [[sv/propagation]] page the concept of "Remapping" is described.
 362 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 363 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 364 elements worth of LDs or STs.  The usual interest in such re-mapping
 365 is for example in separating out 24-bit RGB channel data into separate
 366 contiguous registers.  NEON covers this as shown in the diagram below:
 367
 368 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
 369
 370 Remap easily covers this capability, and with dest
 371 elwidth overrides and saturation may do so with built-in conversion that
 372 would normally require additional width-extension, sign-extension and
 373 min/max Vectorised instructions as post-processing stages.
 374
 375 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 376 because the generic abstracted concept of "Remapping", when applied to
 377 LD/ST, will give that same capability, with far more flexibility.
 378
 379 # notes from lxo
 380
 381 this section covers assembly notation for the immediate and indexed LD/ST.
 382 the summary is that in immediate mode for LD it is not clear that if the
 383 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
 384 the memory being read is *still a vector load*, known as "unit or element strides".
 385
 386 This anomaly is made clear with the following notation:
 387
 388     sv.ld RT.v, imm(RA).v
 389
 390 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
 391
 392     sv.ld RT.v, imm(RA)
 393
 394 Notes taken from IRC conversation
 395
 396     <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
 397     <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
 398     <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
 399     <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
 400     <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
 401     <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
 402     <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
 403
 404 permutations of vector selection, to identify above asm-syntax:
 405
 406      imm(RA)  RT.v   RA.v   nonstrided
 407          sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
 408            mem@     0+r#2   offs+(r#2+1)  offs+(r#2+2)
 409            destreg  r#      r#+1          r#+2
 410      imm(RA)  RT.s   RA.v   nonstrided
 411          sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
 412            (dest r# is scalar) -> VSELECT mode
 413      imm(RA)  RT.v   RA.s   fixed stride: unit or element
 414          sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
 415            mem@r#2  +0   +1   +2
 416            destreg  r#   r#+1 r#+2
 417          sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
 418            mem@r#2  +0 ...   +offs ...  +offs*2
 419            destreg  r#       r#+1       r#+2
 420      imm(RA)  RT.s   RA.s   not vectorised
 421          sv.ld r#, ofst(r#2)
 422
 423 indexed mode:
 424
 425      RA,RB    RT.v  RA.v  RB.v
 426         sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
 427      RA,RB    RT.v  RA.s  RB.v
 428         sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
 429      RA,RB    RT.v  RA.v  RB.s
 430         sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
 431      RA,RB    RT.v  RA.s  RB.s
 432         sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
 433      RA,RB    RT.s  RA.v  RB.v
 434      RA,RB    RT.s  RA.s  RB.v
 435      RA,RB    RT.s  RA.v  RB.s
 436      RA,RB    RT.s  RA.s  RB.s not vectorised