openpower/sv/ldst.mdwn

   1 [[!tag standards]]
   2
   3 # SV Load and Store
   4
   5 Links:
   6
   7 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   8 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
   9 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
  10 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
  11
  12 Vectorisation of Load and Store requires creation, from scalar operations,
  13 a number of different types:
  14
  15 * fixed stride (contiguous sequence with no gaps)
  16 * element strided (sequential but regularly offset, with gaps)
  17 * vector indexed (vector of base addresses and vector of offsets)
  18
  19 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
  20 [[isa/fixedstore]] pseudocode to be of the form:
  21
  22     lbux RT, RA, RB
  23     EA <- (RA) + (RB)
  24     RT <- MEM(EA)
  25
  26 and for immediate variants:
  27
  28     lb RT,D(RA)
  29     EA <- RA + EXTS(D)
  30     RT <- MEM(EA)
  31
  32 Thus in the first example, the source registers may each be independently
  33 marked as scalar or vector, and likewise the destination; in the second
  34 example only the one source and one dest may be marked as scalar or
  35 vector.
  36
  37 Thus we can see that Vector Indexed may be covered, and, as demonstrated
  38 with the pseudocode below, the immediate can be set to the element width
  39 in order to give unit stride.
  40
  41 At the minimum however it is possible to provide unit stride and vector
  42 mode, as follows:
  43
  44     # LD not VLD!
  45     # op_width: lb=1, lh=2, lw=4, ld=8
  46     op_load(RT, RA, op_width, immed, svctx, update):
  47       ps = get_pred_val(FALSE, RA); # predication on src
  48       pd = get_pred_val(FALSE, RT); # ... AND on dest
  49       for (int i = 0, int j = 0; i < VL && j < VL;):
  50         # skip nonpredicates elements
  51         if (RA.isvec) while (!(ps & 1<<i)) i++;
  52         if (RT.isvec) while (!(pd & 1<<j)) j++;
  53         if RA.isvec:
  54           # indirect mode (multi mode)
  55           srcbase = ireg[RA+i]
  56           offs = immed;
  57         else:
  58           srcbase = ireg[RA]
  59           if svctx.ldstmode == elementstride:
  60             # element stride mode
  61             offs = i * immed
  62           elif svctx.ldstmode == unitstride:
  63             # unit stride mode
  64             offs = i * op_width
  65           else
  66             # standard scalar mode (but predicated)
  67             # no stride multiplier means VSPLAT mode
  68             offs = immed
  69         # compute EA
  70         EA = srcbase + offs
  71         # update RA? load from memory
  72         if update: ireg[rsv+i] = EA;
  73         ireg[RT+j] <= MEM[EA];
  74         if (!RT.isvec)
  75             break # destination scalar, end now
  76         if (RA.isvec) i++;
  77         if (RT.isvec) j++;
  78
  79 Indexed LD is:
  80
  81     function op_ldx(RT, RA, RB, update=False) # LD not VLD!
  82       rdv = map_dest_extra(RT);
  83       rsv = map_src_extra(RA);
  84       rso = map_src_extra(RB);
  85       ps = get_pred_val(FALSE, RA); # predication on src
  86       pd = get_pred_val(FALSE, RT); # ... AND on dest
  87       for (i=0, j=0, k=0; i < VL && j < VL && k < VL):
  88         # skip nonpredicated RA, RB and RT
  89         if (RA.isvec) while (!(ps & 1<<i)) i++;
  90         if (RB.isvec) while (!(ps & 1<<k)) k++;
  91         if (RT.isvec) while (!(pd & 1<<j)) j++;
  92         EA = ireg[rsv+i] + ireg[rso+k] # indexed address
  93         if update: ireg[rsv+i] = EA
  94         ireg[rdv+j] <= MEM[EA];
  95         if (!RT.isvec)
  96             break # destination scalar, end immediately
  97         if (!RA.isvec && !RB.isvec)
  98             break # scalar-scalar
  99         if (RA.isvec) i++;
 100         if (RB.isvec) k++;
 101         if (RT.isvec) j++;
 102
 103 # LOAD/STORE Elwidths <a name="ldst"></a>
 104
 105 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
 106 provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and
 107 others like it provide an explicit operation width.  In order to fit the
 108 different types of LD/ST Modes into SV the src elwidth field is used to
 109 select that Mode, and the actual src elwidth is implicitly the same as
 110 the operation width.  We then still apply Twin Predication but using:
 111
 112 * operation width (lb=8, lh=16, lw=32, ld=64) as src elwidth
 113 * destination element width override
 114
 115 Saturation (and other transformations) occur on the value loaded from
 116 memory as if it was an "infinite bitwidth", sign-extended (if Saturation
 117 requests signed) from the source width (lb, lh, lw, ld) followed then
 118 by the actual Saturation to the destination width.
 119
 120 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
 121 is treated effectively as completely separate and distinct from SV
 122 augmentation.  This is primarily down to quirks surrounding LE/BE and
 123 byte-reversal in OpenPOWER.
 124
 125 Note the following regarding the pseudocode to follow:
 126
 127 * `scalar identity behaviour` SV Context parameter conditions turn this
 128   into a straight absolute fully-compliant Scalar v3.0B LD operation
 129 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 130   rather than `ld`)
 131 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 132   a "normal" part of Scalar v3.0B LD
 133 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 134   as a "normal" part of Scalar v3.0B LD
 135 * `svctx` specifies the SV Context and includes VL as well as
 136   destination elwidth overrides.
 137
 138 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
 139
 140 Note that twin predication, predication-zeroing, saturation
 141 and other modes have all been removed, for clarity and simplicity:
 142
 143     # LD not VLD! (ldbrx if brev=True)
 144     # this covers unit stride mode and a type of vector offset
 145     function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
 146       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
 147
 148         if RA.isvec:
 149             # strange vector mode, compute 64 bit address which is
 150             # not polymorphic! elwidth hardcoded to 64 here
 151             srcbase = get_polymorphed_reg(RA, 64, i)
 152         else:
 153             # unit stride mode, compute the address
 154             srcbase = ireg[RA] + i * op_width;
 155
 156         # takes care of (merges) processor LE/BE and ld/ldbrx
 157         bytereverse = brev XNOR MSR.LE
 158
 159         # read the underlying memory
 160         memread <= mem[srcbase + imm_offs];
 161
 162         # optionally performs byteswap at op width
 163         if (bytereverse):
 164             memread = byteswap(memread, op_width)
 165
 166         # now truncate/extend to over-ridden width.
 167         if not svpctx.saturation_mode:
 168             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
 169         else:
 170             ... saturation adjustment...
 171
 172         # takes care of inserting memory-read (now correctly byteswapped)
 173         # into regfile underlying LE-defined order, into the right place
 174         # within the NEON-like register, respecting destination element
 175         # bitwidth, and the element index (j)
 176         set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
 177
 178         # increments both src and dest element indices (no predication here)
 179         i++;
 180         j++;
 181
 182 When RA is marked as Vectorised the mode switches to an anomalous
 183 version similar to Indexed.  The element indices increment to select a
 184 64 bit base address, effectively as if the src elwidth was hard-set to
 185 "default".  The important thing to note is that `i*op_width` is *not*
 186 added on to the base address unless RA is marked as a scalar address.
 187
 188 # Remapped LD/ST
 189
 190 In the [[sv/propagation]] page the concept of "Remapping" is described.
 191 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
 192 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
 193 elements worth of LDs or STs.  The usual interest in such re-mapping
 194 is for example in separating out 24-bit RGB channel data into separate
 195 contiguous registers.  Remap easily covers this capability, and with
 196 elwidth overrides and saturation may do so with built-in conversion that
 197 would normally require sign-extension and min/max Vectorised instructions.
 198
 199 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
 200 because the generic abstracted concept of "Remapping", when applied to
 201 LD/ST, will give that capability.