openpower/sv/ldst.mdwn

   1 [[!tag standards]]
   2
   3 # SV Load and Store
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12
  13 Vectorisation of Load and Store requires creation, from scalar operations,
  14 a number of different types:
  15
  16 * fixed stride (contiguous sequence with no gaps)
  17 * element strided (sequential but regularly offset, with gaps)
  18 * vector indexed (vector of base addresses and vector of offsets)
  19 * fail-first on the same (where it makes sense to do so)
  20
  21 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
  22 [[isa/fixedstore]] pseudocode to be of the form:
  23
  24     lbux RT, RA, RB
  25     EA <- (RA) + (RB)
  26     RT <- MEM(EA)
  27
  28 and for immediate variants:
  29
  30     lb RT,D(RA)
  31     EA <- RA + EXTS(D)
  32     RT <- MEM(EA)
  33
  34 Thus in the first example, the source registers may each be independently
  35 marked as scalar or vector, and likewise the destination; in the second
  36 example only the one source and one dest may be marked as scalar or
  37 vector.
  38
  39 Thus we can see that Vector Indexed may be covered, and, as demonstrated
  40 with the pseudocode below, the immediate can be used to give unit stride or element stride.  With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
  41
  42     # LD not VLD!  format - ldop RT, immed(RA)
  43     # op_width: lb=1, lh=2, lw=4, ld=8
  44     op_load(RT, RA, op_width, immed, svctx, RAupdate):
  45       ps = get_pred_val(FALSE, RA); # predication on src
  46       pd = get_pred_val(FALSE, RT); # ... AND on dest
  47       for (i=0, j=0, u=0; i < VL && j < VL;):
  48         # skip nonpredicates elements
  49         if (RA.isvec) while (!(ps & 1<<i)) i++;
  50         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
  51         if (RT.isvec) while (!(pd & 1<<j)) j++;
  52         if svctx.ldstmode == elementstride:
  53           # element stride mode
  54           srcbase = ireg[RA]
  55           offs = i * immed
  56         elif svctx.ldstmode == unitstride:
  57           # unit stride mode
  58           srcbase = ireg[RA]
  59           offs = i * op_width
  60         elif RA.isvec:
  61           # quirky Vector indexed mode but with an immediate
  62           srcbase = ireg[RA+i]
  63           offs = immed;
  64         else
  65           # standard scalar mode (but predicated)
  66           # no stride multiplier means VSPLAT mode
  67           srcbase = ireg[RA]
  68           offs = immed
  69
  70         # compute EA
  71         EA = srcbase + offs
  72         # update RA?
  73         if RAupdate: ireg[RAupdate+u] = EA;
  74         # load from memory
  75         ireg[RT+j] <= MEM[EA];
  76         if (!RT.isvec)
  77             break # destination scalar, end now
  78         if (RA.isvec) i++;
  79         if (RAupdate.isvec) u++;
  80         if (RT.isvec) j++;
  81
  82 Indexed LD is:
  83
  84     # format: ldop RT, RA, RB
  85     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
  86       ps = get_pred_val(FALSE, RA); # predication on src
  87       pd = get_pred_val(FALSE, RT); # ... AND on dest
  88       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
  89         # skip nonpredicated RA, RB and RT
  90         if (RA.isvec) while (!(ps & 1<<i)) i++;
  91         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
  92         if (RB.isvec) while (!(ps & 1<<k)) k++;
  93         if (RT.isvec) while (!(pd & 1<<j)) j++;
  94         EA = ireg[RA+i] + ireg[RB+k] # indexed address
  95         if RAupdate: ireg[RAupdate+u] = EA
  96         ireg[RT+j] <= MEM[EA];
  97         if (!RT.isvec)
  98             break # destination scalar, end immediately
  99         if (!RA.isvec && !RB.isvec)
 100             break # scalar-scalar
 101         if (RA.isvec) i++;
 102         if (RAupdate.isvec) u++;
 103         if (RB.isvec) k++;
 104         if (RT.isvec) j++;
 105
 106 Note in both cases that [[sv/svp64]] allows RA in "update" mode (`ldux`) to be effectively a completely different register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
 107
 108 # Determining the LD/ST Modes
 109
 110 A minor complication (caused by the retro-fitting of modern Vector
 111 features to a Scalar ISA) is that certain features do not exactly make
 112 sense or are considered a security risk.  Fail-first on Vector Indexed
 113 allows attackers to probe large numbers of pages from userspace, where
 114 strided fail-first (by creating contiguous sequential LDs) does not.
 115
 116 In addition, reduce mode makes no sense, and for LD/ST with immediates
 117  Vector source RA makes no sense either. Realistically we need
 118 an alternative table meaning for [[sv/svp64]] mode.
 119
 120 * saturation
 121 * predicate-result
 122 * normal
 123 * fail-first, where vector source on RA or RB is banned
 124
 125 The table for [[sv/svp64] for immed(RA) is:
 126
 127 | 0-1 |  2  |  3   4  |  description              |
 128 | --- | --- |---------|-------------------------- |
 129 | 00  | str |  sz  dz | normal mode               |
 130 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel       |
 131 | 01  | inv | str RC1 |  Rc=0: ffirst z/nonz |
 132 | 10  |   N | sz  str |  sat mode: N=0/1 u/s |
 133 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 134 | 11  | inv | str RC1 |  Rc=0: pred-result z/nonz |
 135
 136 The `str` bit is only relevant when `RA.isvec` is clear: this indicates
 137 whether stride is unit or element:
 138
 139     if RA.isvec:
 140         svctx.ldstmode = indexed
 141     elif str == 0:
 142         svctx.ldstmode = unitstride
 143     else:
 144         svctx.ldstmode = elementstride
 145
 146 The modes for RA+RB indexed version are slightly different:
 147
 148 | 0-1 |  2  |  3   4  |  description              |
 149 | --- | --- |---------|-------------------------- |
 150 | 00  |   0 |  sz  dz | normal mode                      |
 151 | 00  | rsv |  rsvd   | reserved                     |
 152 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel              |
 153 | 01  | inv | sz  RC1 |  Rc=0: ffirst z/nonz |
 154 | 10  |   N | sz   dz |  sat mode: N=0/1 u/s |
 155 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 156 | 11  | inv | sz  RC1 |  Rc=0: pred-result z/nonz |
 157
 158 A summary of the effect of Vectorisation of src or dest:
 159
 160      imm(RA)  RT.v   RA.v   no stride allowed
 161      imm(RA)  RY.s   RA.v   no stride allowed
 162      imm(RA)  RT.v   RA.s   stride-select needed
 163      imm(RA)  RT.s   RA.s   not vectorised
 164      RA,RB    RT.v  RA/RB.v ffirst banned
 165      RA,RB    RT.s  RA/RB.v ffirst banned
 166      RA,RB    RT.v  RA/RB.s VSPLAT possible
 167      RA,RB    RT.s  RA/RB.s not vectorised
 168
 169 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially.  `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers.  This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
 170
 171 # LOAD/STORE Elwidths <a name="ldst"></a>
 172
 173 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
 174 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 175 others like it provide an explicit operation width.  There are therefore
 176 *three* widths involved:
 177
 178 * operation width (lb=8, lh=16, lw=32, ld=64)
 179 s src elelent width override
 180 * destination element width override
 181
 182 Some care is therefore needed to express and make clear the transformations,
 183 which are expressly in this order:
 184
 185 * Load at the operation width (lb/lh/lw/ld) as usual
 186 * byte-reversal as usual
 187 * Non-saturated mode:
 188    - zero-extension or truncation from operation width to source elwidth
 189    - zero/truncation to dest elwidth
 190 * Saturated mode:
 191    - Sign-extension or truncation from operation width to source width
 192    - signed/unsigned saturation down to dest elwidth
 193
 194 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
 195 is treated effectively as completely separate and distinct from SV
 196 augmentation.  This is primarily down to quirks surrounding LE/BE and
 197 byte-reversal in OpenPOWER.
 198
 199 Note the following regarding the pseudocode to follow:
 200
 201 * `scalar identity behaviour` SV Context parameter conditions turn this
 202   into a straight absolute fully-compliant Scalar v3.0B LD operation
 203 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 204   rather than `ld`)
 205 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 206   a "normal" part of Scalar v3.0B LD
 207 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 208   as a "normal" part of Scalar v3.0B LD
 209 * `svctx` specifies the SV Context and includes VL as well as
 210   source and destination elwidth overrides.
 211
 212 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
 213
 214 Note that twin predication, predication-zeroing, saturation
 215 and other modes have all been removed, for clarity and simplicity:
 216
 217     # LD not VLD! (ldbrx if brev=True)
 218     # this covers unit stride mode and a type of vector offset
 219     function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
 220       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
 221
 222         if not svctx.unit/el-strided:
 223             # strange vector mode, compute 64 bit address which is
 224             # not polymorphic! elwidth hardcoded to 64 here
 225             srcbase = get_polymorphed_reg(RA, 64, i)
 226         else:
 227             # unit / element stride mode, compute 64 bit address
 228             srcbase = get_polymorphed_reg(RA, 64, 0)
 229             # adjust for unit/el-stride
 230             srcbase += ....
 231
 232         # takes care of (merges) processor LE/BE and ld/ldbrx
 233         bytereverse = brev XNOR MSR.LE
 234
 235         # read the underlying memory
 236         memread <= mem[srcbase + imm_offs];
 237
 238         # optionally performs byteswap at op width
 239         if (bytereverse):
 240             memread = byteswap(memread, op_width)
 241
 242
 243         # check saturation.
 244         if svpctx.saturation_mode:
 245             ... saturation adjustment...
 246         else:
 247             # truncate/extend to over-ridden source width.
 248             memread = adjust_wid(memread, op_width, svctx.src_elwidth)
 249
 250         # takes care of inserting memory-read (now correctly byteswapped)
 251         # into regfile underlying LE-defined order, into the right place
 252         # within the NEON-like register, respecting destination element
 253         # bitwidth, and the element index (j)
 254         set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
 255
 256         # increments both src and dest element indices (no predication here)
 257         i++;
 258         j++;
 259
 260 # Remapped LD/ST
 261
 262 In the [[sv/propagation]] page the concept of "Remapping" is described.
 263 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 264 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 265 elements worth of LDs or STs.  The usual interest in such re-mapping
 266 is for example in separating out 24-bit RGB channel data into separate
 267 contiguous registers.  NEON covers this as shown in the diagram below:
 268
 269 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
 270
 271 Remap easily covers this capability, and with dest
 272 elwidth overrides and saturation may do so with built-in conversion that
 273 would normally require additional width-extension, sign-extension and
 274 min/max Vectorised instructions as post-processing stages.
 275
 276 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 277 because the generic abstracted concept of "Remapping", when applied to
 278 LD/ST, will give that same capability, with far more flexibility.