openpower/sv/ldst.mdwn

   1 [[!tag standards]]
   2
   3 # SV Load and Store
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12
  13 Vectorisation of Load and Store requires creation, from scalar operations,
  14 a number of different types:
  15
  16 * fixed stride (contiguous sequence with no gaps)
  17 * element strided (sequential but regularly offset, with gaps)
  18 * vector indexed (vector of base addresses and vector of offsets)
  19 * fail-first on the same (where it makes sense to do so)
  20
  21 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
  22 [[isa/fixedstore]] pseudocode to be of the form:
  23
  24     lbux RT, RA, RB
  25     EA <- (RA) + (RB)
  26     RT <- MEM(EA)
  27
  28 and for immediate variants:
  29
  30     lb RT,D(RA)
  31     EA <- RA + EXTS(D)
  32     RT <- MEM(EA)
  33
  34 Thus in the first example, the source registers may each be independently
  35 marked as scalar or vector, and likewise the destination; in the second
  36 example only the one source and one dest may be marked as scalar or
  37 vector.
  38
  39 Thus we can see that Vector Indexed may be covered, and, as demonstrated
  40 with the pseudocode below, the immediate can be set to the element width
  41 in order to give unit or element stride.  With there being no way to tell which from the Scalar opcode, the choice is provided instead by the SV Context.
  42
  43     # LD not VLD!  format - ldop RT, immed(RA)
  44     # op_width: lb=1, lh=2, lw=4, ld=8
  45     op_load(RT, RA, op_width, immed, svctx, RAupdate):
  46       ps = get_pred_val(FALSE, RA); # predication on src
  47       pd = get_pred_val(FALSE, RT); # ... AND on dest
  48       for (i=0, j=0, u=0; i < VL && j < VL;):
  49         # skip nonpredicates elements
  50         if (RA.isvec) while (!(ps & 1<<i)) i++;
  51         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
  52         if (RT.isvec) while (!(pd & 1<<j)) j++;
  53         if svctx.ldstmode == elementstride:
  54           # element stride mode
  55           srcbase = ireg[RA]
  56           offs = i * immed
  57         elif svctx.ldstmode == unitstride:
  58           # unit stride mode
  59           srcbase = ireg[RA]
  60           offs = i * op_width
  61         elif RA.isvec:
  62           # quirky Vector indexed mode but with an immediate
  63           srcbase = ireg[RA+i]
  64           offs = immed;
  65         else
  66           # standard scalar mode (but predicated)
  67           # no stride multiplier means VSPLAT mode
  68           srcbase = ireg[RA]
  69           offs = immed
  70
  71         # compute EA
  72         EA = srcbase + offs
  73         # update RA?
  74         if RAupdate: ireg[RAupdate+u] = EA;
  75         # load from memory
  76         ireg[RT+j] <= MEM[EA];
  77         if (!RT.isvec)
  78             break # destination scalar, end now
  79         if (RA.isvec) i++;
  80         if (RAupdate.isvec) u++;
  81         if (RT.isvec) j++;
  82
  83 Indexed LD is:
  84
  85     # format: ldop RT, RA, RB
  86     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
  87       ps = get_pred_val(FALSE, RA); # predication on src
  88       pd = get_pred_val(FALSE, RT); # ... AND on dest
  89       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
  90         # skip nonpredicated RA, RB and RT
  91         if (RA.isvec) while (!(ps & 1<<i)) i++;
  92         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
  93         if (RB.isvec) while (!(ps & 1<<k)) k++;
  94         if (RT.isvec) while (!(pd & 1<<j)) j++;
  95         EA = ireg[RA+i] + ireg[RB+k] # indexed address
  96         if RAupdate: ireg[RAupdate+u] = EA
  97         ireg[RT+j] <= MEM[EA];
  98         if (!RT.isvec)
  99             break # destination scalar, end immediately
 100         if (!RA.isvec && !RB.isvec)
 101             break # scalar-scalar
 102         if (RA.isvec) i++;
 103         if (RAupdate.isvec) u++;
 104         if (RB.isvec) k++;
 105         if (RT.isvec) j++;
 106
 107 Note in both cases that [[sv/svp64]] allows RA in "update" mode (`ldux`) to be effectively a completely different register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
 108
 109 # Determining the LD/ST Modes
 110
 111 A minor complication (caused by the retro-fitting of modern Vector
 112 features to a Scalar ISA) is that certain features do not exactly make
 113 sense or are considered a security risk.  Fail-first on Vector Indexed
 114 allows attackers to probe large numbers of pages from userspace, where
 115 strided fail-first (by creating contiguous sequential LDs) does not.
 116
 117 In addition, even in other modes, Vector source RA makes no sense for
 118 computing offsets, and reduce mode even less.  Realistically we need
 119 an alternative table meaning for [[sv/svp64]] mode.
 120
 121 TODO
 122
 123     in all cases:
 124      - vector immed(RA) nonsense.
 125      - unit-stride/el-stride needed on immed(RA)
 126
 127     modes for immed(RA) version:
 128
 129     * saturation
 130     * predicate-result?
 131     * normal
 132     * fail-first
 133       - vector RA is "banned"
 134
 135 | 0-1 |  2  |  3   4  |  description              |
 136 | --- | --- |---------|-------------------------- |
 137 | 00  | str |  sz  dz | normal mode               |
 138 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel       |
 139 | 01  | inv | str RC1 |  Rc=0: ffirst z/nonz |
 140 | 10  |   N | sz  str |  sat mode: N=0/1 u/s |
 141 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 142 | 11  | inv | str RC1 |  Rc=0: pred-result z/nonz |
 143
 144 The `str` bit is only relevant when `RA.isvec` is clear: this indicates
 145 whether stride is unit or element:
 146
 147     if RA.isvec:
 148         svctx.ldstmode = indexed
 149     elif str == 0:
 150         svctx.ldstmode = unitstride
 151     else:
 152         svctx.ldstmode = elementstride
 153
 154 Thr modes for RA+RB indexed version are slightly different:
 155
 156     * saturation
 157     * predicate-result
 158     * normal
 159     * fail-first
 160       - vector RA or RB is "banned"
 161
 162
 163 | 0-1 |  2  |  3   4  |  description              |
 164 | --- | --- |---------|-------------------------- |
 165 | 00  |   0 |  sz  dz | normal mode                      |
 166 | 00  | rsv |  rsvd   | reserved                     |
 167 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel              |
 168 | 01  | inv | sz  RC1 |  Rc=0: ffirst z/nonz |
 169 | 10  |   N | sz   dz |  sat mode: N=0/1 u/s |
 170 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 171 | 11  | inv | sz  RC1 |  Rc=0: pred-result z/nonz |
 172
 173      imm(RA)  RT.v   RA.v   no stride allowed
 174      imm(RA)  RY.s   RA.v   no stride allowed
 175      imm(RA)  RT.v   RA.s   stride-select needed
 176      imm(RA)  RT.s   RA.s   not vectorised
 177      RA,RB    RT.v  RA/RB.v ffirst banned
 178      RA,RB    RT.s  RA/RB.v ffirst banned
 179      RA,RB    RT.v  RA/RB.s vsplat activated
 180      RA,RB    RT.s  RA/RB.s not vectirised
 181
 182 # LOAD/STORE Elwidths <a name="ldst"></a>
 183
 184 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
 185 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 186 others like it provide an explicit operation width.  There are therefore
 187 *three* widths involved:
 188
 189 * operation width (lb=8, lh=16, lw=32, ld=64)
 190 s src elelent width override
 191 * destination element width override
 192
 193 Some care is therefore needed to express and make clear the transformations,
 194 which are expressly in this order:
 195
 196 * Load at the operation width (lb/lh/lw/ld) as usual
 197 * byte-reversal as usual
 198 * Non-saturated mode:
 199    - zero-extension or truncation from operation width to source elwidth
 200    - zero/truncation to dest elwidth
 201 * Saturated mode:
 202    - Sign-extension or truncation from operation width to source width
 203    - signed/unsigned saturation down to dest elwidth
 204
 205 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
 206 is treated effectively as completely separate and distinct from SV
 207 augmentation.  This is primarily down to quirks surrounding LE/BE and
 208 byte-reversal in OpenPOWER.
 209
 210 Note the following regarding the pseudocode to follow:
 211
 212 * `scalar identity behaviour` SV Context parameter conditions turn this
 213   into a straight absolute fully-compliant Scalar v3.0B LD operation
 214 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 215   rather than `ld`)
 216 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 217   a "normal" part of Scalar v3.0B LD
 218 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 219   as a "normal" part of Scalar v3.0B LD
 220 * `svctx` specifies the SV Context and includes VL as well as
 221   source and destination elwidth overrides.
 222
 223 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
 224
 225 Note that twin predication, predication-zeroing, saturation
 226 and other modes have all been removed, for clarity and simplicity:
 227
 228     # LD not VLD! (ldbrx if brev=True)
 229     # this covers unit stride mode and a type of vector offset
 230     function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
 231       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
 232
 233         if not svctx.unit/el-strided:
 234             # strange vector mode, compute 64 bit address which is
 235             # not polymorphic! elwidth hardcoded to 64 here
 236             srcbase = get_polymorphed_reg(RA, 64, i)
 237         else:
 238             # unit / element stride mode, compute 64 bit address
 239             srcbase = get_polymorphed_reg(RA, 64, 0)
 240             # adjust for unit/el-stride
 241             srcbase += ....
 242
 243         # takes care of (merges) processor LE/BE and ld/ldbrx
 244         bytereverse = brev XNOR MSR.LE
 245
 246         # read the underlying memory
 247         memread <= mem[srcbase + imm_offs];
 248
 249         # optionally performs byteswap at op width
 250         if (bytereverse):
 251             memread = byteswap(memread, op_width)
 252
 253
 254         # check saturation.
 255         if svpctx.saturation_mode:
 256             ... saturation adjustment...
 257         else:
 258             # truncate/extend to over-ridden source width.
 259             memread = adjust_wid(memread, op_width, svctx.src_elwidth)
 260
 261         # takes care of inserting memory-read (now correctly byteswapped)
 262         # into regfile underlying LE-defined order, into the right place
 263         # within the NEON-like register, respecting destination element
 264         # bitwidth, and the element index (j)
 265         set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
 266
 267         # increments both src and dest element indices (no predication here)
 268         i++;
 269         j++;
 270
 271 # Remapped LD/ST
 272
 273 In the [[sv/propagation]] page the concept of "Remapping" is described.
 274 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 275 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 276 elements worth of LDs or STs.  The usual interest in such re-mapping
 277 is for example in separating out 24-bit RGB channel data into separate
 278 contiguous registers.  NEON covers this as shown in the diagram below:
 279
 280 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
 281
 282 Remap easily covers this capability, and with dest
 283 elwidth overrides and saturation may do so with built-in conversion that
 284 would normally require additional width-extension, sign-extension and
 285 min/max Vectorised instructions as post-processing stages.
 286
 287 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 288 because the generic abstracted concept of "Remapping", when applied to
 289 LD/ST, will give that same capability, with far more flexibility.