openpower/sv/ldst.mdwn

   1 [[!tag standards]]
   2
   3 # SV Load and Store
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
  10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  12
  13 Vectorisation of Load and Store requires creation, from scalar operations,
  14 a number of different types:
  15
  16 * fixed stride (contiguous sequence with no gaps)
  17 * element strided (sequential but regularly offset, with gaps)
  18 * vector indexed (vector of base addresses and vector of offsets)
  19 * fail-first on the same (where it makes sense to do so)
  20
  21 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
  22 [[isa/fixedstore]] pseudocode to be of the form:
  23
  24     lbux RT, RA, RB
  25     EA <- (RA) + (RB)
  26     RT <- MEM(EA)
  27
  28 and for immediate variants:
  29
  30     lb RT,D(RA)
  31     EA <- RA + EXTS(D)
  32     RT <- MEM(EA)
  33
  34 Thus in the first example, the source registers may each be independently
  35 marked as scalar or vector, and likewise the destination; in the second
  36 example only the one source and one dest may be marked as scalar or
  37 vector.
  38
  39 Thus we can see that Vector Indexed may be covered, and, as demonstrated
  40 with the pseudocode below, the immediate can be set to the element width
  41 in order to give unit or element stride.  With there being no way to tell which from the Scalar opcode, the choice is provided instead by the SV Context.
  42
  43     # LD not VLD!
  44     # op_width: lb=1, lh=2, lw=4, ld=8
  45     op_load(RT, RA, op_width, immed, svctx, RAupdate):
  46       ps = get_pred_val(FALSE, RA); # predication on src
  47       pd = get_pred_val(FALSE, RT); # ... AND on dest
  48       for (i=0, j=0, u=0; i < VL && j < VL;):
  49         # skip nonpredicates elements
  50         if (RA.isvec) while (!(ps & 1<<i)) i++;
  51         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
  52         if (RT.isvec) while (!(pd & 1<<j)) j++;
  53         if svctx.ldstmode == elementstride:
  54           # element stride mode
  55           srcbase = ireg[RA]
  56           offs = i * immed
  57         elif svctx.ldstmode == unitstride:
  58           # unit stride mode
  59           srcbase = ireg[RA]
  60           offs = i * op_width
  61         elif RA.isvec:
  62           # quirky Vector indexed mode but with an immediate
  63           srcbase = ireg[RA+i]
  64           offs = immed;
  65         else
  66           # standard scalar mode (but predicated)
  67           # no stride multiplier means VSPLAT mode
  68           srcbase = ireg[RA]
  69           offs = immed
  70
  71         # compute EA
  72         EA = srcbase + offs
  73         # update RA?
  74         if RAupdate: ireg[RAupdate+u] = EA;
  75         # load from memory
  76         ireg[RT+j] <= MEM[EA];
  77         if (!RT.isvec)
  78             break # destination scalar, end now
  79         if (RA.isvec) i++;
  80         if (RT.isvec) j++;
  81
  82 Indexed LD is:
  83
  84     function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
  85       ps = get_pred_val(FALSE, RA); # predication on src
  86       pd = get_pred_val(FALSE, RT); # ... AND on dest
  87       for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
  88         # skip nonpredicated RA, RB and RT
  89         if (RA.isvec) while (!(ps & 1<<i)) i++;
  90         if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
  91         if (RB.isvec) while (!(ps & 1<<k)) k++;
  92         if (RT.isvec) while (!(pd & 1<<j)) j++;
  93         EA = ireg[RA+i] + ireg[RB+k] # indexed address
  94         if RAupdate: ireg[RAupdate+u] = EA
  95         ireg[RT+j] <= MEM[EA];
  96         if (!RT.isvec)
  97             break # destination scalar, end immediately
  98         if (!RA.isvec && !RB.isvec)
  99             break # scalar-scalar
 100         if (RA.isvec) i++;
 101         if (RB.isvec) k++;
 102         if (RT.isvec) j++;
 103
 104 Note in both cases that [[sv/svp64]] allows RA in "update" mode (`ldux`) to be effectively a completely different register from RA-as-a-source.  This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
 105
 106 # Determining the LD/ST Modes
 107
 108 A minor complication (caused by the retro-fitting of modern Vector
 109 features to a Scalar ISA) is that certain features do not exactly make
 110 sense or are considered a security risk.  Fail-first on Vector Indexed
 111 allows attackers to probe large numbers of pages from userspace, where
 112 strided fail-first (by creating contiguous sequential LDs) does not.
 113
 114 In addition, even in other modes, Vector source RA makes no sense for
 115 computing offsets, and reduce mode even less.  Realistically we need
 116 an alternative table meaning for [[sv/svp64]] mode.
 117
 118 TODO
 119
 120     in all cases:
 121      - vector immed(RA) nonsense.
 122      - unit-stride/el-stride needed on immed(RA)
 123
 124     modes for immed(RA) version:
 125
 126     * saturation
 127     * predicate-result?
 128     * normal
 129     * fail-first
 130       - vector RA is "banned"
 131
 132 | 0-1 |  2  |  3   4  |  description              |
 133 | --- | --- |---------|-------------------------- |
 134 | 00  | str |  sz  dz | normal mode               |
 135 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel       |
 136 | 01  | inv | str RC1 |  Rc=0: ffirst z/nonz |
 137 | 10  |   N | sz  str |  sat mode: N=0/1 u/s |
 138 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 139 | 11  | inv | str RC1 |  Rc=0: pred-result z/nonz |
 140
 141 The `str` bit is only relevant when `RA.isvec` is clear: this indicates
 142 whether stride is unit or element:
 143
 144     if RA.isvec:
 145         svctx.ldstmode = indexed
 146     elif str == 0:
 147         svctx.ldstmode = unitstride
 148     else:
 149         svctx.ldstmode = elementstride
 150
 151 Thr modes for RA+RB indexed version are slightly different:
 152
 153     * saturation
 154     * predicate-result
 155     * normal
 156     * fail-first
 157       - vector RA or RB is "banned"
 158
 159
 160 | 0-1 |  2  |  3   4  |  description              |
 161 | --- | --- |---------|-------------------------- |
 162 | 00  |   0 |  sz  dz | normal mode                      |
 163 | 00  | rsv |  rsvd   | reserved                     |
 164 | 01  | inv | CR-bit  | Rc=1: ffirst CR sel              |
 165 | 01  | inv | sz  RC1 |  Rc=0: ffirst z/nonz |
 166 | 10  |   N | sz   dz |  sat mode: N=0/1 u/s |
 167 | 11  | inv | CR-bit  |  Rc=1: pred-result CR sel |
 168 | 11  | inv | sz  RC1 |  Rc=0: pred-result z/nonz |
 169
 170      imm(RA)  RT.v   RA.v   no stride allowed
 171      imm(RA)  RY.s   RA.v   no stride allowed
 172      imm(RA)  RT.v   RA.s   stride-select needed
 173      imm(RA)  RT.s   RA.s   not vectorised
 174      RA,RB    RT.v  RA/RB.v ffirst banned
 175      RA,RB    RT.s  RA/RB.v ffirst banned
 176      RA,RB    RT.v  RA/RB.s vsplat activated
 177      RA,RB    RT.s  RA/RB.s not vectirised
 178
 179 # LOAD/STORE Elwidths <a name="ldst"></a>
 180
 181 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
 182 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 183 others like it provide an explicit operation width.  There are therefore
 184 *three* widths involved:
 185
 186 * operation width (lb=8, lh=16, lw=32, ld=64)
 187 s src elelent width override
 188 * destination element width override
 189
 190 Some care is therefore needed to express and make clear the transformations,
 191 which are expressly in this order:
 192
 193 * Load at the operation width (lb/lh/lw/ld) as usual
 194 * byte-reversal as usual
 195 * Non-saturated mode:
 196    - zero-extension or truncation from operation width to source elwidth
 197    - zero/truncation to dest elwidth
 198 * Saturated mode:
 199    - Sign-extension or truncation from operation width to source width
 200    - signed/unsigned saturation down to dest elwidth
 201
 202 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
 203 is treated effectively as completely separate and distinct from SV
 204 augmentation.  This is primarily down to quirks surrounding LE/BE and
 205 byte-reversal in OpenPOWER.
 206
 207 Note the following regarding the pseudocode to follow:
 208
 209 * `scalar identity behaviour` SV Context parameter conditions turn this
 210   into a straight absolute fully-compliant Scalar v3.0B LD operation
 211 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 212   rather than `ld`)
 213 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 214   a "normal" part of Scalar v3.0B LD
 215 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 216   as a "normal" part of Scalar v3.0B LD
 217 * `svctx` specifies the SV Context and includes VL as well as
 218   source and destination elwidth overrides.
 219
 220 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
 221
 222 Note that twin predication, predication-zeroing, saturation
 223 and other modes have all been removed, for clarity and simplicity:
 224
 225     # LD not VLD! (ldbrx if brev=True)
 226     # this covers unit stride mode and a type of vector offset
 227     function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
 228       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
 229
 230         if not svctx.unit/el-strided:
 231             # strange vector mode, compute 64 bit address which is
 232             # not polymorphic! elwidth hardcoded to 64 here
 233             srcbase = get_polymorphed_reg(RA, 64, i)
 234         else:
 235             # unit / element stride mode, compute 64 bit address
 236             srcbase = get_polymorphed_reg(RA, 64, 0)
 237             # adjust for unit/el-stride
 238             srcbase += ....
 239
 240         # takes care of (merges) processor LE/BE and ld/ldbrx
 241         bytereverse = brev XNOR MSR.LE
 242
 243         # read the underlying memory
 244         memread <= mem[srcbase + imm_offs];
 245
 246         # optionally performs byteswap at op width
 247         if (bytereverse):
 248             memread = byteswap(memread, op_width)
 249
 250
 251         # check saturation.
 252         if svpctx.saturation_mode:
 253             ... saturation adjustment...
 254         else:
 255             # truncate/extend to over-ridden source width.
 256             memread = adjust_wid(memread, op_width, svctx.src_elwidth)
 257
 258         # takes care of inserting memory-read (now correctly byteswapped)
 259         # into regfile underlying LE-defined order, into the right place
 260         # within the NEON-like register, respecting destination element
 261         # bitwidth, and the element index (j)
 262         set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
 263
 264         # increments both src and dest element indices (no predication here)
 265         i++;
 266         j++;
 267
 268 # Remapped LD/ST
 269
 270 In the [[sv/propagation]] page the concept of "Remapping" is described.
 271 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 272 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 273 elements worth of LDs or STs.  The usual interest in such re-mapping
 274 is for example in separating out 24-bit RGB channel data into separate
 275 contiguous registers.  NEON covers this as shown in the diagram below:
 276
 277 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
 278
 279 Remap easily covers this capability, and with dest
 280 elwidth overrides and saturation may do so with built-in conversion that
 281 would normally require additional width-extension, sign-extension and
 282 min/max Vectorised instructions as post-processing stages.
 283
 284 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 285 because the generic abstracted concept of "Remapping", when applied to
 286 LD/ST, will give that same capability, with far more flexibility.