openpower/sv/ldst.mdwn

   1 [[!tag standards]]
   2
   3 # SV Load and Store
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12 * [[simple_v_extension/specification/ld.x]]
  13
  14 Vectorisation of Load and Store requires creation, from scalar operations,
  15 a number of different modes:
  16
  17 * fixed stride (contiguous sequence with no gaps) aka "unit" stride
  18 * element strided (sequential but regularly offset, with gaps)
  19 * vector indexed (vector of base addresses and vector of offsets)
  20 * fail-first on the same (where it makes sense to do so)
  21 * Structure Packing (covered in SV by [[sv/remap]]).
  22
  23 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
  24 [[isa/fixedstore]] pseudocode to be of the form:
  25
  26     lbux RT, RA, RB
  27     EA <- (RA) + (RB)
  28     RT <- MEM(EA)
  29
  30 and for immediate variants:
  31
  32     lb RT,D(RA)
  33     EA <- RA + EXTS(D)
  34     RT <- MEM(EA)
  35
  36 Thus in the first example, the source registers may each be independently
  37 marked as scalar or vector, and likewise the destination; in the second
  38 example only the one source and one dest may be marked as scalar or
  39 vector.
  40
  41 Thus we can see that Vector Indexed may be covered, and, as demonstrated
  42 with the pseudocode below, the immediate can be used to give unit stride or element stride.  With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
  43
  44     # LD not VLD!  format - ldop RT, immed(RA)
  45     # op_width: lb=1, lh=2, lw=4, ld=8
  46     op_load(RT, RA, RC, op_width, immed, svctx, RAupdate):
  47       ps = get_pred_val(FALSE, RA); # predication on src
  48       pd = get_pred_val(FALSE, RT); # ... AND on dest
  49       for (i=0, j=0, u=0; i < VL && j < VL;):
  50         # skip nonpredicates elements
  51         if (RA.isvec) while (!(ps & 1<<i)) i++;
  52         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
  53         if (RT.isvec) while (!(pd & 1<<j)) j++;
  54         if svctx.ldstmode == shifted: # for FFT/DCT
  55           # FFT/DCT shifted mode
  56           if (RA.isvec)
  57             srcbase = ireg[RA+i]
  58           else
  59             srcbase = ireg[RA]
  60           offs = (i * immed) << RC
  61         elif svctx.ldstmode == elementstride:
  62           # element stride mode
  63           srcbase = ireg[RA]
  64           offs = i * immed              # j*immed for a ST
  65         elif svctx.ldstmode == unitstride:
  66           # unit stride mode
  67           srcbase = ireg[RA]
  68           offs = immed + (i * op_width) # j*op_width for ST
  69         elif RA.isvec:
  70           # quirky Vector indexed mode but with an immediate
  71           srcbase = ireg[RA+i]
  72           offs = immed;
  73         else
  74           # standard scalar mode (but predicated)
  75           # no stride multiplier means VSPLAT mode
  76           srcbase = ireg[RA]
  77           offs = immed
  78
  79         # compute EA
  80         EA = srcbase + offs
  81         # update RA?
  82         if RAupdate: ireg[RAupdate+u] = EA;
  83         # load from memory
  84         ireg[RT+j] <= MEM[EA];
  85         if (!RT.isvec)
  86             break # destination scalar, end now
  87         if (RA.isvec) i++;
  88         if (RAupdate.isvec) u++;
  89         if (RT.isvec) j++;
  90
  91     # reverses the bitorder up to "width" bits
  92     def bitrev(val, VL):
  93       width = log2(VL)
  94       result = 0
  95       for _ in range(width):
  96         result = (result << 1) | (val & 1)
  97         val >>= 1
  98       return result
  99
 100 Indexed LD is:
 101
 102     # format: ldop RT, RA, RB
 103     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
 104       ps = get_pred_val(FALSE, RA); # predication on src
 105       pd = get_pred_val(FALSE, RT); # ... AND on dest
 106       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
 107         # skip nonpredicated RA, RB and RT
 108         if (RA.isvec) while (!(ps & 1<<i)) i++;
 109         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 110         if (RB.isvec) while (!(ps & 1<<k)) k++;
 111         if (RT.isvec) while (!(pd & 1<<j)) j++;
 112         EA = ireg[RA+i] + ireg[RB+k] # indexed address
 113         if RAupdate: ireg[RAupdate+u] = EA
 114         ireg[RT+j] <= MEM[EA];
 115         if (!RT.isvec)
 116             break # destination scalar, end immediately
 117         if (!RA.isvec && !RB.isvec)
 118             break # scalar-scalar
 119         if (RA.isvec) i++;
 120         if (RAupdate.isvec) u++;
 121         if (RB.isvec) k++;
 122         if (RT.isvec) j++;
 123
 124 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
 125
 126 # Determining the LD/ST Modes
 127
 128 A minor complication (caused by the retro-fitting of modern Vector
 129 features to a Scalar ISA) is that certain features do not exactly make
 130 sense or are considered a security risk.  Fail-first on Vector Indexed
 131 allows attackers to probe large numbers of pages from userspace, where
 132 strided fail-first (by creating contiguous sequential LDs) does not.
 133
 134 In addition, reduce mode makes no sense, and for LD/ST with immediates
 135  Vector source RA makes no sense either (or, is a quirk). Realistically we need
 136 an alternative table meaning for [[sv/svp64]] mode.  The following modes make sense:
 137
 138 * saturation
 139 * predicate-result (mostly for cache-inhibited LD/ST)
 140 * normal
 141 * fail-first, where a vector source on RA or RB is banned
 142
 143 Also, given that FFT, DCT and other related algorithms
 144 are of such high importance in so many areas of Computer
 145 Science, a special "shift" mode has been added which
 146 allows part of the immediate to be used instead as RC, a register
 147 which shifts the immediate `DS << GPR(RC)`.
 148
 149 The table for [[sv/svp64]] for `immed(RA)` is:
 150
 151 | 0-1 |  2  |  3   4  |  description               |
 152 | --- | --- |---------|--------------------------- |
 153 | 00  | 0   |  dz els | normal mode                |
 154 | 00  | 1   |  dz shf | shift mode                 |
 155 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
 156 | 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
 157 | 10  |   N | dz  els |  sat mode: N=0/1 u/s       |
 158 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
 159 | 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |
 160
 161 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
 162 whether stride is unit or element:
 163
 164     if bitreversed:
 165         svctx.ldstmode = bitreversed
 166     elif RA.isvec:
 167         svctx.ldstmode = indexed
 168     elif els == 0:
 169         svctx.ldstmode = unitstride
 170     elif immediate != 0:
 171         svctx.ldstmode = elementstride
 172
 173 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
 174 in effect the multiplication of the immediate-offset by zero results
 175 in reading from the exact same memory location.
 176
 177 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
 178 just the once and be copied, rather than hitting the Data Cache
 179 multiple times with the same memory read at the same location.
 180
 181 For ST from a vector source onto a scalar destination: with the Vector
 182 loop effectively creating multiple memory writes to the same location,
 183 we can deduce that the last of these will be the "successful" one. Thus,
 184 implementations are free and clear to optimise out the overwriting STs,
 185 leaving just the last one as the "winner".  Bear in mind that predicate
 186 masks will skip some elements (in source non-zeroing mode).
 187
 188 Note that there are no immediate versions of cache-inhibited LD/ST.
 189
 190 The modes for `RA+RB` indexed version are slightly different:
 191
 192 | 0-1 |  2  |  3   4  |  description              |
 193 | --- | --- |---------|-------------------------- |
 194 | 00  |   0 |  dz  sz | normal mode                      |
 195 | 00  |   1 |  rsvd   | reserved                     |
 196 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel              |
 197 | 01  | inv | dz  RC1 |  Rc=0: ffirst z/nonz |
 198 | 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
 199 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 200 | 11  | inv | dz  RC1 |  Rc=0: pred-result z/nonz |
 201
 202 A summary of the effect of Vectorisation of src or dest:
 203
 204      imm(RA)  RT.v   RA.v   no stride allowed
 205      imm(RA)  RT.s   RA.v   no stride allowed
 206      imm(RA)  RT.v   RA.s   stride-select allowed
 207      imm(RA)  RT.s   RA.s   not vectorised
 208      RA,RB    RT.v  RA/RB.v ffirst banned
 209      RA,RB    RT.s  RA/RB.v ffirst banned
 210      RA,RB    RT.v  RA/RB.s VSPLAT possible
 211      RA,RB    RT.s  RA/RB.s not vectorised
 212
 213 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
 214 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
 215 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
 216
 217 ## LD/ST ffirst
 218
 219 ffirst LD/ST to multiple pages via a Vectorised base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail.  Therefore in these special circumstances requesting ffirst with a vector base is instead interpreted as element-strided LD/ST.  See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
 220
 221 # LOAD/STORE Elwidths <a name="ldst"></a>
 222
 223 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
 224 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 225 others like it provide an explicit operation width.  There are therefore
 226 *three* widths involved:
 227
 228 * operation width (lb=8, lh=16, lw=32, ld=64)
 229 * src elelent width override
 230 * destination element width override
 231
 232 Some care is therefore needed to express and make clear the transformations,
 233 which are expressly in this order:
 234
 235 * Load at the operation width (lb/lh/lw/ld) as usual
 236 * byte-reversal as usual
 237 * Non-saturated mode:
 238    - zero-extension or truncation from operation width to source elwidth
 239    - zero/truncation to dest elwidth
 240 * Saturated mode:
 241    - Sign-extension or truncation from operation width to source width
 242    - signed/unsigned saturation down to dest elwidth
 243
 244 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
 245 is treated effectively as completely separate and distinct from SV
 246 augmentation.  This is primarily down to quirks surrounding LE/BE and
 247 byte-reversal in OpenPOWER.
 248
 249 Note the following regarding the pseudocode to follow:
 250
 251 * `scalar identity behaviour` SV Context parameter conditions turn this
 252   into a straight absolute fully-compliant Scalar v3.0B LD operation
 253 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 254   rather than `ld`)
 255 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 256   a "normal" part of Scalar v3.0B LD
 257 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 258   as a "normal" part of Scalar v3.0B LD
 259 * `svctx` specifies the SV Context and includes VL as well as
 260   source and destination elwidth overrides.
 261
 262 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
 263
 264 Note that twin predication, predication-zeroing, saturation
 265 and other modes have all been removed, for clarity and simplicity:
 266
 267     # LD not VLD! (ldbrx if brev=True)
 268     # this covers unit stride mode and a type of vector offset
 269     function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
 270       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
 271
 272         if not svctx.unit/el-strided:
 273             # strange vector mode, compute 64 bit address which is
 274             # not polymorphic! elwidth hardcoded to 64 here
 275             srcbase = get_polymorphed_reg(RA, 64, i)
 276         else:
 277             # unit / element stride mode, compute 64 bit address
 278             srcbase = get_polymorphed_reg(RA, 64, 0)
 279             # adjust for unit/el-stride
 280             srcbase += ....
 281
 282         # takes care of (merges) processor LE/BE and ld/ldbrx
 283         bytereverse = brev XNOR MSR.LE
 284
 285         # read the underlying memory
 286         memread <= mem[srcbase + imm_offs];
 287
 288         # optionally performs byteswap at op width
 289         if (bytereverse):
 290             memread = byteswap(memread, op_width)
 291
 292
 293         # check saturation.
 294         if svpctx.saturation_mode:
 295             ... saturation adjustment...
 296         else:
 297             # truncate/extend to over-ridden source width.
 298             memread = adjust_wid(memread, op_width, svctx.src_elwidth)
 299
 300         # takes care of inserting memory-read (now correctly byteswapped)
 301         # into regfile underlying LE-defined order, into the right place
 302         # within the NEON-like register, respecting destination element
 303         # bitwidth, and the element index (j)
 304         set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
 305
 306         # increments both src and dest element indices (no predication here)
 307         i++;
 308         j++;
 309
 310 # Remapped LD/ST
 311
 312 In the [[sv/propagation]] page the concept of "Remapping" is described.
 313 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 314 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 315 elements worth of LDs or STs.  The usual interest in such re-mapping
 316 is for example in separating out 24-bit RGB channel data into separate
 317 contiguous registers.  NEON covers this as shown in the diagram below:
 318
 319 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
 320
 321 Remap easily covers this capability, and with dest
 322 elwidth overrides and saturation may do so with built-in conversion that
 323 would normally require additional width-extension, sign-extension and
 324 min/max Vectorised instructions as post-processing stages.
 325
 326 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 327 because the generic abstracted concept of "Remapping", when applied to
 328 LD/ST, will give that same capability, with far more flexibility.
 329
 330 # notes from lxo
 331
 332 this section covers assembly notation for the immediate and indexed LD/ST.
 333 the summary is that in immediate mode for LD it is not clear that if the
 334 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
 335 the memory being read is *still a vector load*, known as "unit or element strides".
 336
 337 This anomaly is made clear with the following notation:
 338
 339     sv.ld RT.v, imm(RA).v
 340
 341 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
 342
 343     sv.ld RT.v, imm(RA)
 344
 345 Notes taken from IRC conversation
 346
 347     <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
 348     <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
 349     <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
 350     <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
 351     <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
 352     <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
 353     <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
 354
 355 permutations of vector selection, to identify above asm-syntax:
 356
 357      imm(RA)  RT.v   RA.v   nonstrided
 358          sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
 359            mem@     0+r#2   offs+(r#2+1)  offs+(r#2+2)
 360            destreg  r#      r#+1          r#+2
 361      imm(RA)  RT.s   RA.v   nonstrided
 362          sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
 363            (dest r# is scalar) -> VSELECT mode
 364      imm(RA)  RT.v   RA.s   fixed stride: unit or element
 365          sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
 366            mem@r#2  +0   +1   +2
 367            destreg  r#   r#+1 r#+2
 368          sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
 369            mem@r#2  +0 ...   +offs ...  +offs*2
 370            destreg  r#       r#+1       r#+2
 371      imm(RA)  RT.s   RA.s   not vectorised
 372          sv.ld r#, ofst(r#2)
 373
 374 indexed mode:
 375
 376      RA,RB    RT.v  RA.v  RB.v
 377         sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
 378      RA,RB    RT.v  RA.s  RB.v
 379         sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
 380      RA,RB    RT.v  RA.v  RB.s
 381         sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
 382      RA,RB    RT.v  RA.s  RB.s
 383         sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
 384      RA,RB    RT.s  RA.v  RB.v
 385      RA,RB    RT.s  RA.s  RB.v
 386      RA,RB    RT.s  RA.v  RB.s
 387      RA,RB    RT.s  RA.s  RB.s not vectorised