openpower/sv/ldst.mdwn

   1 [[!tag standards]]
   2
   3 # SV Load and Store
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12
  13 Vectorisation of Load and Store requires creation, from scalar operations,
  14 a number of different modes:
  15
  16 * fixed stride (contiguous sequence with no gaps) aka "unit" stride
  17 * element strided (sequential but regularly offset, with gaps)
  18 * vector indexed (vector of base addresses and vector of offsets)
  19 * fail-first on the same (where it makes sense to do so)
  20 * Structure Packing (covered in SV by [[sv/remap]]).
  21
  22 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
  23 [[isa/fixedstore]] pseudocode to be of the form:
  24
  25     lbux RT, RA, RB
  26     EA <- (RA) + (RB)
  27     RT <- MEM(EA)
  28
  29 and for immediate variants:
  30
  31     lb RT,D(RA)
  32     EA <- RA + EXTS(D)
  33     RT <- MEM(EA)
  34
  35 Thus in the first example, the source registers may each be independently
  36 marked as scalar or vector, and likewise the destination; in the second
  37 example only the one source and one dest may be marked as scalar or
  38 vector.
  39
  40 Thus we can see that Vector Indexed may be covered, and, as demonstrated
  41 with the pseudocode below, the immediate can be used to give unit stride or element stride.  With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
  42
  43     # LD not VLD!  format - ldop RT, immed(RA)
  44     # op_width: lb=1, lh=2, lw=4, ld=8
  45     op_load(RT, RA, op_width, immed, svctx, RAupdate):
  46       ps = get_pred_val(FALSE, RA); # predication on src
  47       pd = get_pred_val(FALSE, RT); # ... AND on dest
  48       for (i=0, j=0, u=0; i < VL && j < VL;):
  49         # skip nonpredicates elements
  50         if (RA.isvec) while (!(ps & 1<<i)) i++;
  51         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
  52         if (RT.isvec) while (!(pd & 1<<j)) j++;
  53         if svctx.ldstmode == elementstride:
  54           # element stride mode
  55           srcbase = ireg[RA]
  56           offs = i * immed
  57         elif svctx.ldstmode == unitstride:
  58           # unit stride mode
  59           srcbase = ireg[RA]
  60           offs = i * op_width
  61         elif RA.isvec:
  62           # quirky Vector indexed mode but with an immediate
  63           srcbase = ireg[RA+i]
  64           offs = immed;
  65         else
  66           # standard scalar mode (but predicated)
  67           # no stride multiplier means VSPLAT mode
  68           srcbase = ireg[RA]
  69           offs = immed
  70
  71         # compute EA
  72         EA = srcbase + offs
  73         # update RA?
  74         if RAupdate: ireg[RAupdate+u] = EA;
  75         # load from memory
  76         ireg[RT+j] <= MEM[EA];
  77         if (!RT.isvec)
  78             break # destination scalar, end now
  79         if (RA.isvec) i++;
  80         if (RAupdate.isvec) u++;
  81         if (RT.isvec) j++;
  82
  83 Indexed LD is:
  84
  85     # format: ldop RT, RA, RB
  86     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
  87       ps = get_pred_val(FALSE, RA); # predication on src
  88       pd = get_pred_val(FALSE, RT); # ... AND on dest
  89       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
  90         # skip nonpredicated RA, RB and RT
  91         if (RA.isvec) while (!(ps & 1<<i)) i++;
  92         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
  93         if (RB.isvec) while (!(ps & 1<<k)) k++;
  94         if (RT.isvec) while (!(pd & 1<<j)) j++;
  95         EA = ireg[RA+i] + ireg[RB+k] # indexed address
  96         if RAupdate: ireg[RAupdate+u] = EA
  97         ireg[RT+j] <= MEM[EA];
  98         if (!RT.isvec)
  99             break # destination scalar, end immediately
 100         if (!RA.isvec && !RB.isvec)
 101             break # scalar-scalar
 102         if (RA.isvec) i++;
 103         if (RAupdate.isvec) u++;
 104         if (RB.isvec) k++;
 105         if (RT.isvec) j++;
 106
 107 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
 108
 109 # Determining the LD/ST Modes
 110
 111 A minor complication (caused by the retro-fitting of modern Vector
 112 features to a Scalar ISA) is that certain features do not exactly make
 113 sense or are considered a security risk.  Fail-first on Vector Indexed
 114 allows attackers to probe large numbers of pages from userspace, where
 115 strided fail-first (by creating contiguous sequential LDs) does not.
 116
 117 In addition, reduce mode makes no sense, and for LD/ST with immediates
 118  Vector source RA makes no sense either (or, is a quirk). Realistically we need
 119 an alternative table meaning for [[sv/svp64]] mode.  The following modes make sense:
 120
 121 * saturation
 122 * predicate-result (mostly for cache-inhibited LD/ST)
 123 * normal
 124 * fail-first, where a vector source on RA or RB is banned
 125
 126 The table for [[sv/svp64]] for `immed(RA)` is:
 127
 128 | 0-1 |  2  |  3   4  |  description              |
 129 | --- | --- |---------|-------------------------- |
 130 | 00  | els |  sz  dz | normal mode               |
 131 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel       |
 132 | 01  | inv | els RC1 |  Rc=0: ffirst z/nonz      |
 133 | 10  |   N | sz  els |  sat mode: N=0/1 u/s      |
 134 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 135 | 11  | inv | els RC1 |  Rc=0: pred-result z/nonz |
 136
 137 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
 138 whether stride is unit or element:
 139
 140     if RA.isvec:
 141         svctx.ldstmode = indexed
 142     elif els == 0:
 143         svctx.ldstmode = unitstride
 144     elif immediate != 0:
 145         svctx.ldstmode = elementstride
 146
 147 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
 148 in effect the multiplication of the immediate-offset by zero results
 149 in reading from the exact same memory location.
 150
 151 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
 152 just the once and be copied, rather than hitting the Data Cache
 153 multiple times with the same memory read at the same location.
 154
 155 For ST from a vector source onto a scalar destination: with the Vector
 156 loop effectively creating multiple memory writes to the same location,
 157 we can deduce that the last of these will be the "successful" one. Thus,
 158 implementations are free and clear to optimise out the overwriting STs,
 159 leaving just the last one as the "winner".  Bear in mind that predicate
 160 masks will skip some elements (in source non-zeroing mode).
 161
 162 Note that there are no immediate versions of cache-inhibited LD/ST.
 163
 164 The modes for `RA+RB` indexed version are slightly different:
 165
 166 | 0-1 |  2  |  3   4  |  description              |
 167 | --- | --- |---------|-------------------------- |
 168 | 00  |   0 |  sz  dz | normal mode                      |
 169 | 00  | rsv |  rsvd   | reserved                     |
 170 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel              |
 171 | 01  | inv | sz  RC1 |  Rc=0: ffirst z/nonz |
 172 | 10  |   N | sz   dz |  sat mode: N=0/1 u/s |
 173 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 174 | 11  | inv | sz  RC1 |  Rc=0: pred-result z/nonz |
 175
 176 A summary of the effect of Vectorisation of src or dest:
 177
 178      imm(RA)  RT.v   RA.v   no stride allowed
 179      imm(RA)  RT.s   RA.v   no stride allowed
 180      imm(RA)  RT.v   RA.s   stride-select allowed
 181      imm(RA)  RT.s   RA.s   not vectorised
 182      RA,RB    RT.v  RA/RB.v ffirst banned
 183      RA,RB    RT.s  RA/RB.v ffirst banned
 184      RA,RB    RT.v  RA/RB.s VSPLAT possible
 185      RA,RB    RT.s  RA/RB.s not vectorised
 186
 187 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
 188 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
 189 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
 190
 191 # LOAD/STORE Elwidths <a name="ldst"></a>
 192
 193 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
 194 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 195 others like it provide an explicit operation width.  There are therefore
 196 *three* widths involved:
 197
 198 * operation width (lb=8, lh=16, lw=32, ld=64)
 199 * src elelent width override
 200 * destination element width override
 201
 202 Some care is therefore needed to express and make clear the transformations,
 203 which are expressly in this order:
 204
 205 * Load at the operation width (lb/lh/lw/ld) as usual
 206 * byte-reversal as usual
 207 * Non-saturated mode:
 208    - zero-extension or truncation from operation width to source elwidth
 209    - zero/truncation to dest elwidth
 210 * Saturated mode:
 211    - Sign-extension or truncation from operation width to source width
 212    - signed/unsigned saturation down to dest elwidth
 213
 214 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
 215 is treated effectively as completely separate and distinct from SV
 216 augmentation.  This is primarily down to quirks surrounding LE/BE and
 217 byte-reversal in OpenPOWER.
 218
 219 Note the following regarding the pseudocode to follow:
 220
 221 * `scalar identity behaviour` SV Context parameter conditions turn this
 222   into a straight absolute fully-compliant Scalar v3.0B LD operation
 223 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 224   rather than `ld`)
 225 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 226   a "normal" part of Scalar v3.0B LD
 227 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 228   as a "normal" part of Scalar v3.0B LD
 229 * `svctx` specifies the SV Context and includes VL as well as
 230   source and destination elwidth overrides.
 231
 232 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
 233
 234 Note that twin predication, predication-zeroing, saturation
 235 and other modes have all been removed, for clarity and simplicity:
 236
 237     # LD not VLD! (ldbrx if brev=True)
 238     # this covers unit stride mode and a type of vector offset
 239     function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
 240       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
 241
 242         if not svctx.unit/el-strided:
 243             # strange vector mode, compute 64 bit address which is
 244             # not polymorphic! elwidth hardcoded to 64 here
 245             srcbase = get_polymorphed_reg(RA, 64, i)
 246         else:
 247             # unit / element stride mode, compute 64 bit address
 248             srcbase = get_polymorphed_reg(RA, 64, 0)
 249             # adjust for unit/el-stride
 250             srcbase += ....
 251
 252         # takes care of (merges) processor LE/BE and ld/ldbrx
 253         bytereverse = brev XNOR MSR.LE
 254
 255         # read the underlying memory
 256         memread <= mem[srcbase + imm_offs];
 257
 258         # optionally performs byteswap at op width
 259         if (bytereverse):
 260             memread = byteswap(memread, op_width)
 261
 262
 263         # check saturation.
 264         if svpctx.saturation_mode:
 265             ... saturation adjustment...
 266         else:
 267             # truncate/extend to over-ridden source width.
 268             memread = adjust_wid(memread, op_width, svctx.src_elwidth)
 269
 270         # takes care of inserting memory-read (now correctly byteswapped)
 271         # into regfile underlying LE-defined order, into the right place
 272         # within the NEON-like register, respecting destination element
 273         # bitwidth, and the element index (j)
 274         set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
 275
 276         # increments both src and dest element indices (no predication here)
 277         i++;
 278         j++;
 279
 280 # Remapped LD/ST
 281
 282 In the [[sv/propagation]] page the concept of "Remapping" is described.
 283 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 284 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 285 elements worth of LDs or STs.  The usual interest in such re-mapping
 286 is for example in separating out 24-bit RGB channel data into separate
 287 contiguous registers.  NEON covers this as shown in the diagram below:
 288
 289 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
 290
 291 Remap easily covers this capability, and with dest
 292 elwidth overrides and saturation may do so with built-in conversion that
 293 would normally require additional width-extension, sign-extension and
 294 min/max Vectorised instructions as post-processing stages.
 295
 296 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 297 because the generic abstracted concept of "Remapping", when applied to
 298 LD/ST, will give that same capability, with far more flexibility.
 299
 300 # notes from lxo
 301
 302 this section covers assembly notation for the immediate and indexed LD/ST.
 303 the summary is that in immediate mode for LD it is not clear that if the
 304 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
 305 the memory being read is *still a vector load*, known as "unit or element strides".
 306
 307 This anomaly is made clear with the following notation:
 308
 309     sv.ld RT.v, imm(RA).v
 310
 311 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
 312
 313     sv.ld RT.v, imm(RA)
 314
 315 Notes taken from IRC conversation
 316
 317     <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
 318     <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
 319     <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
 320     <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
 321     <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
 322     <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
 323     <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
 324
 325 permutations of vector selection, to identify above asm-syntax:
 326
 327      imm(RA)  RT.v   RA.v   nonstrided
 328          sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
 329            mem@     0+r#2   offs+(r#2+1)  offs+(r#2+2)
 330            destreg  r#      r#+1          r#+2
 331      imm(RA)  RT.s   RA.v   nonstrided
 332          sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
 333            (dest r# is scalar) -> VSELECT mode
 334      imm(RA)  RT.v   RA.s   fixed stride: unit or element
 335          sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
 336            mem@r#2  +0   +1   +2
 337            destreg  r#   r#+1 r#+2
 338          sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
 339            mem@r#2  +0 ...   +offs ...  +offs*2
 340            destreg  r#       r#+1       r#+2
 341      imm(RA)  RT.s   RA.s   not vectorised
 342          sv.ld r#, ofst(r#2)
 343
 344 indexed mode:
 345
 346      RA,RB    RT.v  RA.v  RB.v
 347         sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
 348      RA,RB    RT.v  RA.s  RB.v
 349         sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
 350      RA,RB    RT.v  RA.v  RB.s
 351         sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
 352      RA,RB    RT.v  RA.s  RB.s
 353         sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
 354      RA,RB    RT.s  RA.v  RB.v
 355      RA,RB    RT.s  RA.s  RB.v
 356      RA,RB    RT.s  RA.v  RB.s
 357      RA,RB    RT.s  RA.s  RB.s not vectorised