openpower/sv/ldst.mdwn

   1 [[!tag standards]]
   2
   3 # SV Load and Store
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12
  13 Vectorisation of Load and Store requires creation, from scalar operations,
  14 a number of different modes:
  15
  16 * fixed stride (contiguous sequence with no gaps) aka "unit" stride
  17 * element strided (sequential but regularly offset, with gaps)
  18 * vector indexed (vector of base addresses and vector of offsets)
  19 * fail-first on the same (where it makes sense to do so)
  20 * Structure Packing (covered in SV by [[sv/remap]]).
  21
  22 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
  23 [[isa/fixedstore]] pseudocode to be of the form:
  24
  25     lbux RT, RA, RB
  26     EA <- (RA) + (RB)
  27     RT <- MEM(EA)
  28
  29 and for immediate variants:
  30
  31     lb RT,D(RA)
  32     EA <- RA + EXTS(D)
  33     RT <- MEM(EA)
  34
  35 Thus in the first example, the source registers may each be independently
  36 marked as scalar or vector, and likewise the destination; in the second
  37 example only the one source and one dest may be marked as scalar or
  38 vector.
  39
  40 Thus we can see that Vector Indexed may be covered, and, as demonstrated
  41 with the pseudocode below, the immediate can be used to give unit stride or element stride.  With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
  42
  43     # LD not VLD!  format - ldop RT, immed(RA)
  44     # op_width: lb=1, lh=2, lw=4, ld=8
  45     op_load(RT, RA, RC, op_width, immed, svctx, RAupdate):
  46       ps = get_pred_val(FALSE, RA); # predication on src
  47       pd = get_pred_val(FALSE, RT); # ... AND on dest
  48       for (i=0, j=0, u=0; i < VL && j < VL;):
  49         # skip nonpredicates elements
  50         if (RA.isvec) while (!(ps & 1<<i)) i++;
  51         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
  52         if (RT.isvec) while (!(pd & 1<<j)) j++;
  53         if svctx.ldstmode == bitreversed: # for FFT/DCT
  54           # FFT/DCT bitreversed mode
  55           if (RA.isvec)
  56             srcbase = ireg[RA+i]
  57           else
  58             srcbase = ireg[RA]
  59           offs = (bitrev(i, VL) * immed) << RC
  60         elif svctx.ldstmode == elementstride:
  61           # element stride mode
  62           srcbase = ireg[RA]
  63           offs = i * immed              # j*immed for a ST
  64         elif svctx.ldstmode == unitstride:
  65           # unit stride mode
  66           srcbase = ireg[RA]
  67           offs = immed + (i * op_width) # j*op_width for ST
  68         elif RA.isvec:
  69           # quirky Vector indexed mode but with an immediate
  70           srcbase = ireg[RA+i]
  71           offs = immed;
  72         else
  73           # standard scalar mode (but predicated)
  74           # no stride multiplier means VSPLAT mode
  75           srcbase = ireg[RA]
  76           offs = immed
  77
  78         # compute EA
  79         EA = srcbase + offs
  80         # update RA?
  81         if RAupdate: ireg[RAupdate+u] = EA;
  82         # load from memory
  83         ireg[RT+j] <= MEM[EA];
  84         if (!RT.isvec)
  85             break # destination scalar, end now
  86         if (RA.isvec) i++;
  87         if (RAupdate.isvec) u++;
  88         if (RT.isvec) j++;
  89
  90     # reverses the bitorder up to "width" bits
  91     def bitrev(val, VL):
  92       width = log2(VL)
  93       result = 0
  94       for _ in range(width):
  95         result = (result << 1) | (val & 1)
  96         val >>= 1
  97       return result
  98
  99 Indexed LD is:
 100
 101     # format: ldop RT, RA, RB
 102     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
 103       ps = get_pred_val(FALSE, RA); # predication on src
 104       pd = get_pred_val(FALSE, RT); # ... AND on dest
 105       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
 106         # skip nonpredicated RA, RB and RT
 107         if (RA.isvec) while (!(ps & 1<<i)) i++;
 108         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
 109         if (RB.isvec) while (!(ps & 1<<k)) k++;
 110         if (RT.isvec) while (!(pd & 1<<j)) j++;
 111         EA = ireg[RA+i] + ireg[RB+k] # indexed address
 112         if RAupdate: ireg[RAupdate+u] = EA
 113         ireg[RT+j] <= MEM[EA];
 114         if (!RT.isvec)
 115             break # destination scalar, end immediately
 116         if (!RA.isvec && !RB.isvec)
 117             break # scalar-scalar
 118         if (RA.isvec) i++;
 119         if (RAupdate.isvec) u++;
 120         if (RB.isvec) k++;
 121         if (RT.isvec) j++;
 122
 123 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
 124
 125 # Determining the LD/ST Modes
 126
 127 A minor complication (caused by the retro-fitting of modern Vector
 128 features to a Scalar ISA) is that certain features do not exactly make
 129 sense or are considered a security risk.  Fail-first on Vector Indexed
 130 allows attackers to probe large numbers of pages from userspace, where
 131 strided fail-first (by creating contiguous sequential LDs) does not.
 132
 133 In addition, reduce mode makes no sense, and for LD/ST with immediates
 134  Vector source RA makes no sense either (or, is a quirk). Realistically we need
 135 an alternative table meaning for [[sv/svp64]] mode.  The following modes make sense:
 136
 137 * saturation
 138 * predicate-result (mostly for cache-inhibited LD/ST)
 139 * normal
 140 * fail-first, where a vector source on RA or RB is banned
 141
 142 The table for [[sv/svp64]] for `immed(RA)` is:
 143
 144 | 0-1 |  2  |  3   4  |  description               |
 145 | --- | --- |---------|--------------------------- |
 146 | 00  | 0   |  dz els | normal mode                |
 147 | 00  | 1   |  dz rsv | bitreverse mode (FFT, DCT) |
 148 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel        |
 149 | 01  | inv | els RC1 |  Rc=0: ffirst z/nonz       |
 150 | 10  |   N | dz  els |  sat mode: N=0/1 u/s       |
 151 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel  |
 152 | 11  | inv | els RC1 |  Rc=0: pred-result z/nonz  |
 153
 154 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
 155 whether stride is unit or element:
 156
 157     if bitreversed:
 158         svctx.ldstmode = bitreversed
 159     elif RA.isvec:
 160         svctx.ldstmode = indexed
 161     elif els == 0:
 162         svctx.ldstmode = unitstride
 163     elif immediate != 0:
 164         svctx.ldstmode = elementstride
 165
 166 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
 167 in effect the multiplication of the immediate-offset by zero results
 168 in reading from the exact same memory location.
 169
 170 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
 171 just the once and be copied, rather than hitting the Data Cache
 172 multiple times with the same memory read at the same location.
 173
 174 For ST from a vector source onto a scalar destination: with the Vector
 175 loop effectively creating multiple memory writes to the same location,
 176 we can deduce that the last of these will be the "successful" one. Thus,
 177 implementations are free and clear to optimise out the overwriting STs,
 178 leaving just the last one as the "winner".  Bear in mind that predicate
 179 masks will skip some elements (in source non-zeroing mode).
 180
 181 Note that there are no immediate versions of cache-inhibited LD/ST.
 182
 183 The modes for `RA+RB` indexed version are slightly different:
 184
 185 | 0-1 |  2  |  3   4  |  description              |
 186 | --- | --- |---------|-------------------------- |
 187 | 00  |   0 |  dz  sz | normal mode                      |
 188 | 00  |   1 |  rsvd   | reserved                     |
 189 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel              |
 190 | 01  | inv | dz  RC1 |  Rc=0: ffirst z/nonz |
 191 | 10  |   N | dz   sz |  sat mode: N=0/1 u/s |
 192 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 193 | 11  | inv | dz  RC1 |  Rc=0: pred-result z/nonz |
 194
 195 A summary of the effect of Vectorisation of src or dest:
 196
 197      imm(RA)  RT.v   RA.v   no stride allowed
 198      imm(RA)  RT.s   RA.v   no stride allowed
 199      imm(RA)  RT.v   RA.s   stride-select allowed
 200      imm(RA)  RT.s   RA.s   not vectorised
 201      RA,RB    RT.v  RA/RB.v ffirst banned
 202      RA,RB    RT.s  RA/RB.v ffirst banned
 203      RA,RB    RT.v  RA/RB.s VSPLAT possible
 204      RA,RB    RT.s  RA/RB.s not vectorised
 205
 206 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
 207 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
 208 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
 209
 210 # LOAD/STORE Elwidths <a name="ldst"></a>
 211
 212 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
 213 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 214 others like it provide an explicit operation width.  There are therefore
 215 *three* widths involved:
 216
 217 * operation width (lb=8, lh=16, lw=32, ld=64)
 218 * src elelent width override
 219 * destination element width override
 220
 221 Some care is therefore needed to express and make clear the transformations,
 222 which are expressly in this order:
 223
 224 * Load at the operation width (lb/lh/lw/ld) as usual
 225 * byte-reversal as usual
 226 * Non-saturated mode:
 227    - zero-extension or truncation from operation width to source elwidth
 228    - zero/truncation to dest elwidth
 229 * Saturated mode:
 230    - Sign-extension or truncation from operation width to source width
 231    - signed/unsigned saturation down to dest elwidth
 232
 233 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
 234 is treated effectively as completely separate and distinct from SV
 235 augmentation.  This is primarily down to quirks surrounding LE/BE and
 236 byte-reversal in OpenPOWER.
 237
 238 Note the following regarding the pseudocode to follow:
 239
 240 * `scalar identity behaviour` SV Context parameter conditions turn this
 241   into a straight absolute fully-compliant Scalar v3.0B LD operation
 242 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 243   rather than `ld`)
 244 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 245   a "normal" part of Scalar v3.0B LD
 246 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 247   as a "normal" part of Scalar v3.0B LD
 248 * `svctx` specifies the SV Context and includes VL as well as
 249   source and destination elwidth overrides.
 250
 251 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
 252
 253 Note that twin predication, predication-zeroing, saturation
 254 and other modes have all been removed, for clarity and simplicity:
 255
 256     # LD not VLD! (ldbrx if brev=True)
 257     # this covers unit stride mode and a type of vector offset
 258     function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
 259       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
 260
 261         if not svctx.unit/el-strided:
 262             # strange vector mode, compute 64 bit address which is
 263             # not polymorphic! elwidth hardcoded to 64 here
 264             srcbase = get_polymorphed_reg(RA, 64, i)
 265         else:
 266             # unit / element stride mode, compute 64 bit address
 267             srcbase = get_polymorphed_reg(RA, 64, 0)
 268             # adjust for unit/el-stride
 269             srcbase += ....
 270
 271         # takes care of (merges) processor LE/BE and ld/ldbrx
 272         bytereverse = brev XNOR MSR.LE
 273
 274         # read the underlying memory
 275         memread <= mem[srcbase + imm_offs];
 276
 277         # optionally performs byteswap at op width
 278         if (bytereverse):
 279             memread = byteswap(memread, op_width)
 280
 281
 282         # check saturation.
 283         if svpctx.saturation_mode:
 284             ... saturation adjustment...
 285         else:
 286             # truncate/extend to over-ridden source width.
 287             memread = adjust_wid(memread, op_width, svctx.src_elwidth)
 288
 289         # takes care of inserting memory-read (now correctly byteswapped)
 290         # into regfile underlying LE-defined order, into the right place
 291         # within the NEON-like register, respecting destination element
 292         # bitwidth, and the element index (j)
 293         set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
 294
 295         # increments both src and dest element indices (no predication here)
 296         i++;
 297         j++;
 298
 299 # Remapped LD/ST
 300
 301 In the [[sv/propagation]] page the concept of "Remapping" is described.
 302 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 303 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 304 elements worth of LDs or STs.  The usual interest in such re-mapping
 305 is for example in separating out 24-bit RGB channel data into separate
 306 contiguous registers.  NEON covers this as shown in the diagram below:
 307
 308 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
 309
 310 Remap easily covers this capability, and with dest
 311 elwidth overrides and saturation may do so with built-in conversion that
 312 would normally require additional width-extension, sign-extension and
 313 min/max Vectorised instructions as post-processing stages.
 314
 315 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 316 because the generic abstracted concept of "Remapping", when applied to
 317 LD/ST, will give that same capability, with far more flexibility.
 318
 319 # notes from lxo
 320
 321 this section covers assembly notation for the immediate and indexed LD/ST.
 322 the summary is that in immediate mode for LD it is not clear that if the
 323 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
 324 the memory being read is *still a vector load*, known as "unit or element strides".
 325
 326 This anomaly is made clear with the following notation:
 327
 328     sv.ld RT.v, imm(RA).v
 329
 330 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
 331
 332     sv.ld RT.v, imm(RA)
 333
 334 Notes taken from IRC conversation
 335
 336     <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
 337     <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
 338     <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
 339     <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
 340     <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
 341     <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
 342     <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
 343
 344 permutations of vector selection, to identify above asm-syntax:
 345
 346      imm(RA)  RT.v   RA.v   nonstrided
 347          sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
 348            mem@     0+r#2   offs+(r#2+1)  offs+(r#2+2)
 349            destreg  r#      r#+1          r#+2
 350      imm(RA)  RT.s   RA.v   nonstrided
 351          sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
 352            (dest r# is scalar) -> VSELECT mode
 353      imm(RA)  RT.v   RA.s   fixed stride: unit or element
 354          sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
 355            mem@r#2  +0   +1   +2
 356            destreg  r#   r#+1 r#+2
 357          sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
 358            mem@r#2  +0 ...   +offs ...  +offs*2
 359            destreg  r#       r#+1       r#+2
 360      imm(RA)  RT.s   RA.s   not vectorised
 361          sv.ld r#, ofst(r#2)
 362
 363 indexed mode:
 364
 365      RA,RB    RT.v  RA.v  RB.v
 366         sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
 367      RA,RB    RT.v  RA.s  RB.v
 368         sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
 369      RA,RB    RT.v  RA.v  RB.s
 370         sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
 371      RA,RB    RT.v  RA.s  RB.s
 372         sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
 373      RA,RB    RT.s  RA.v  RB.v
 374      RA,RB    RT.s  RA.s  RB.v
 375      RA,RB    RT.s  RA.v  RB.s
 376      RA,RB    RT.s  RA.s  RB.s not vectorised