openpower/sv/ldst.mdwn

   1 # SV Load and Store
   2
   3 Links:
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
   7 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
   8 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
   9
  10 Vectorisation of Load and Store requires creation, from scalar operations,
  11 a number of different types:
  12
  13 * fixed stride (contiguous sequence with no gaps)
  14 * element strided (sequential but regularly offset, with gaps)
  15 * vector indexed (vector of base addresses and vector of offsets)
  16
  17 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
  18 [[isa/fixedstore]] pseudocode to be of the form:
  19
  20     lbux RT, RA, RB
  21     EA <- (RA) + (RB)
  22     RT <- MEM(EA)
  23
  24 and for immediate variants:
  25
  26     lb RT,D(RA)
  27     EA <- RA + EXTS(D)
  28     RT <- MEM(EA)
  29
  30 Thus in the first example, the source registers may each be independently
  31 marked as scalar or vector, and likewise the destination; in the second
  32 example only the one source and one dest may be marked as scalar or
  33 vector.
  34
  35 Thus we can see that Vector Indexed may be covered, and, as demonstrated
  36 with the pseudocode below, the immediate can be set to the element width
  37 in order to give unit stride.
  38
  39 At the minimum however it is possible to provide unit stride and vector
  40 mode, as follows:
  41
  42     # LD not VLD!
  43     # op_width: lb=1, lh=2, lw=4, ld=8
  44     op_load(RT, RA, op_width, immed, svctx, update):
  45       ps = get_pred_val(FALSE, RA); # predication on src
  46       pd = get_pred_val(FALSE, RT); # ... AND on dest
  47       for (int i = 0, int j = 0; i < VL && j < VL;):
  48         # skip nonpredicates elements
  49         if (RA.isvec) while (!(ps & 1<<i)) i++;
  50         if (RT.isvec) while (!(pd & 1<<j)) j++;
  51         if RA.isvec:
  52           # indirect mode (multi mode)
  53           srcbase = ireg[RA+i]
  54           offs = immed;
  55         else:
  56           srcbase = ireg[RA]
  57           if svctx.ldstmode == elementstride:
  58             # element stride mode
  59             offs = i * immed
  60           elif svctx.ldstmode == unitstride:
  61             # unit stride mode
  62             offs = i * op_width
  63           else
  64             # standard scalar mode (but predicated)
  65             # no stride multiplier means VSPLAT mode
  66             offs = immed
  67         # compute EA
  68         EA = srcbase + offs
  69         # update RA? load from memory
  70         if update: ireg[rsv+i] = EA;
  71         ireg[RT+j] <= MEM[EA];
  72         if (!RT.isvec)
  73             break # destination scalar, end now
  74         if (RA.isvec) i++;
  75         if (RT.isvec) j++;
  76
  77 Indexed LD is:
  78
  79     function op_ldx(RT, RA, RB, update=False) # LD not VLD!
  80       rdv = map_dest_extra(RT);
  81       rsv = map_src_extra(RA);
  82       rso = map_src_extra(RB);
  83       ps = get_pred_val(FALSE, RA); # predication on src
  84       pd = get_pred_val(FALSE, RT); # ... AND on dest
  85       for (i=0, j=0, k=0; i < VL && j < VL && k < VL):
  86         # skip nonpredicated RA, RB and RT
  87         if (RA.isvec) while (!(ps & 1<<i)) i++;
  88         if (RB.isvec) while (!(ps & 1<<k)) k++;
  89         if (RT.isvec) while (!(pd & 1<<j)) j++;
  90         EA = ireg[rsv+i] + ireg[rso+k] # indexed address
  91         if update: ireg[rsv+i] = EA
  92         ireg[rdv+j] <= MEM[EA];
  93         if (!RT.isvec)
  94             break # destination scalar, end immediately
  95         if (!RA.isvec && !RB.isvec)
  96             break # scalar-scalar
  97         if (RA.isvec) i++;
  98         if (RB.isvec) k++;
  99         if (RT.isvec) j++;
 100
 101 # LOAD/STORE Elwidths <a name="ldst"></a>
 102
 103 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
 104 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 105 others like it provide an explicit operation width.  In order to fit the
 106 different types of LD/ST Modes into SV the src elwidth field is used to
 107 select that Mode, and the actual src elwidth is implicitly the same as
 108 the operation width.  We then still apply Twin Predication but using:
 109
 110 * operation width (lb=8, lh=16, lw=32, ld=64) as src elwidth
 111 * destination element width override
 112
 113 Saturation (and other transformations) occur on the value loaded from
 114 memory as if it was an "infinite bitwidth", sign-extended (if Saturation
 115 requests signed) from the source width (lb, lh, lw, ld) followed then
 116 by the actual Saturation to the destination width.
 117
 118 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
 119 is treated effectively as completely separate and distinct from SV
 120 augmentation.  This is primarily down to quirks surrounding LE/BE and
 121 byte-reversal in OpenPOWER.
 122
 123 Note the following regarding the pseudocode to follow:
 124
 125 * `scalar identity behaviour` SV Context parameter conditions turn this
 126   into a straight absolute fully-compliant Scalar v3.0B LD operation
 127 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 128   rather than `ld`)
 129 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 130   a "normal" part of Scalar v3.0B LD
 131 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 132   as a "normal" part of Scalar v3.0B LD
 133 * `svctx` specifies the SV Context and includes VL as well as
 134   destination elwidth overrides.
 135
 136 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
 137
 138 Note that twin predication, predication-zeroing, saturation
 139 and other modes have all been removed, for clarity and simplicity:
 140
 141     # LD not VLD! (ldbrx if brev=True)
 142     # this covers unit stride mode and a type of vector offset
 143     function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
 144       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
 145
 146         if RA.isvec:
 147             # strange vector mode, compute 64 bit address which is
 148             # not polymorphic! elwidth hardcoded to 64 here
 149             srcbase = get_polymorphed_reg(RA, 64, i)
 150         else:
 151             # unit stride mode, compute the address
 152             srcbase = ireg[RA] + i * op_width;
 153
 154         # takes care of (merges) processor LE/BE and ld/ldbrx
 155         bytereverse = brev XNOR MSR.LE
 156
 157         # read the underlying memory
 158         memread <= mem[srcbase + imm_offs];
 159
 160         # optionally performs byteswap at op width
 161         if (bytereverse):
 162             memread = byteswap(memread, op_width)
 163
 164         # now truncate/extend to over-ridden width.
 165         if not svpctx.saturation_mode:
 166             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
 167         else:
 168             ... saturation adjustment...
 169
 170         # takes care of inserting memory-read (now correctly byteswapped)
 171         # into regfile underlying LE-defined order, into the right place
 172         # within the NEON-like register, respecting destination element
 173         # bitwidth, and the element index (j)
 174         set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
 175
 176         # increments both src and dest element indices (no predication here)
 177         i++;
 178         j++;
 179
 180 When RA is marked as Vectorised the mode switches to an anomalous
 181 version similar to Indexed.  The element indices increment to select a
 182 64 bit base address, effectively as if the src elwidth was hard-set to
 183 "default".  The important thing to note is that `i*op_width` is *not*
 184 added on to the base address unless RA is marked as a scalar address.
 185
 186 # Remapped LD/ST
 187
 188 In the [[sv/propagation]] page the concept of "Remapping" is described.
 189 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 190 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 191 elements worth of LDs or STs.  The usual interest in such re-mapping
 192 is for example in separating out 24-bit RGB channel data into separate
 193 contiguous registers.  Remap easily covers this capability, and with
 194 elwidth overrides and saturation may do so with built-in conversion that
 195 would normally require sign-extension and min/max Vectorised instructions.
 196
 197 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 198 because the generic abstracted concept of "Remapping", when applied to
 199 LD/ST, will give that capability.