openpower/sv/ldst.mdwn

   1 # SV Load and Store
   2
   3 Links:
   4
   5 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
   6 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
   7 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
   8 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
   9
  10 Vectorisation of Load and Store requires creation, from scalar operations, a number of different types:
  11
  12 * fixed stride (contiguous sequence with no gaps)
  13 * element strided (sequential but regularly offset, with gaps)
  14 * vector indexed (vector of base addresses and vector of offsets)
  15
  16 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and [[isa/fixedstore]] pseudocode to be of the form:
  17
  18     lbux RT, RA, RB
  19     EA <- (RA) + (RB)
  20     RT <- MEM(EA)
  21
  22 and for immediate variants:
  23
  24     lb RT,D(RA)
  25     EA <- RA + EXTS(D)
  26     RT <- MEM(EA)
  27
  28 Thus in the first example, the source registers may each be independently marked as scalar or vector, and likewise the destination; in the second example only the one source and one dest may be marked as scalar or vector.
  29
  30 Thus we can see that Vector Indexed may be covered, and, as demonstrated with the pseudocode below, the immediate can be set to the element width in order to give unit stride.
  31
  32 At the minimum however it is possible to provide unit stride and vector mode, as follows:
  33
  34     # LD not VLD!
  35     # op_width: lb=1, lh=2, lw=4, ld=8
  36     op_load(RT, RA, op_width, immed, svctx, update):
  37       rdv = map_dest_extra(RT); # possible REMAP
  38       rsv = map_src_extra(RA);  # possible REMAP
  39       ps = get_pred_val(FALSE, RA); # predication on src
  40       pd = get_pred_val(FALSE, RT); # ... AND on dest
  41       for (int i = 0, int j = 0; i < VL && j < VL;):
  42         # skip nonpredicates elements
  43         if (RA.isvec) while (!(ps & 1<<i)) i++;
  44         if (RT.isvec) while (!(pd & 1<<j)) j++;
  45         if RA.isvec:
  46           # indirect mode (multi mode)
  47           srcbase = ireg[RA+i]
  48           offs = immed;
  49         else:
  50           srcbase = ireg[RA]
  51           if svctx.ldstmode == elementstride:
  52             # element stride mode
  53             offs = i * immed
  54           elif svctx.ldstmode == unitstride:
  55             # unit stride mode
  56             offs = i * op_width
  57           else
  58             # standard scalar mode (but predicated)
  59             # no stride multiplier means VSPLAT mode
  60             offs = immed
  61         # compute EA
  62         EA = srcbase + offs
  63         # update RA? load from memory
  64         if update: ireg[rsv+i] = EA;
  65         ireg[rdv+j] <= MEM[EA];
  66         if (!RT.isvec)
  67             break # destination scalar, end now
  68         if (RA.isvec) i++;
  69         if (RT.isvec) j++;
  70
  71 Indexed LD is:
  72
  73     function op_ldx(RT, RA, RB, update=False) # LD not VLD!
  74       rdv = map_dest_extra(RT);
  75       rsv = map_src_extra(RA);
  76       rso = map_src_extra(RB);
  77       ps = get_pred_val(FALSE, RA); # predication on src
  78       pd = get_pred_val(FALSE, RT); # ... AND on dest
  79       for (i=0, j=0, k=0; i < VL && j < VL && k < VL):
  80         # skip nonpredicated RA, RB and RT
  81         if (RA.isvec) while (!(ps & 1<<i)) i++;
  82         if (RB.isvec) while (!(ps & 1<<k)) k++;
  83         if (RT.isvec) while (!(pd & 1<<j)) j++;
  84         EA = ireg[rsv+i] + ireg[rso+k] # indexed address
  85         if update: ireg[rsv+i] = EA
  86         ireg[rdv+j] <= MEM[EA];
  87         if (!RT.isvec)
  88             break # destination scalar, end immediately
  89         if (!RA.isvec && !RB.isvec)
  90             break # scalar-scalar
  91         if (RA.isvec) i++;
  92         if (RB.isvec) k++;
  93         if (RT.isvec) j++;
  94
  95 # LOAD/STORE Elwidths <a name="ldst"></a>
  96
  97 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA provides a width for the operation (lb, lh, lw, ld).  Only `extsb` and others like it provide an explicit operation width.  In order to fit the different types of LD/ST Modes into SV the src elwidth field is used to select that Mode, and the actual src elwidth is implicitly the same as the operation width.  We then still apply Twin Predication but using:
  98
  99 * operation width (lb=8, lh=16, lw=32, ld=64) as src elwidth
 100 * destination element width override
 101
 102 Saturation (and other transformations) occur on the value loaded from memory as if it was an "infinite bitwidth", sign-extended (if Saturation requests signed) from the source width (lb, lh, lw, ld) followed then by the actual Saturation to the destination width.
 103
 104 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side is treated effectively as completely separate and distinct from SV augmentation.  This is primarily down to quirks surrounding LE/BE and byte-reversal in OpenPOWER.
 105
 106 Note the following regarding the pseudocode to follow:
 107
 108 * `scalar identity behaviour` SV Context parameter conditions turn this
 109   into a straight absolute fully-compliant Scalar v3.0B LD operation
 110 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
 111   rather than `ld`)
 112 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
 113   a "normal" part of Scalar v3.0B LD
 114 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
 115   as a "normal" part of Scalar v3.0B LD
 116 * `svctx` specifies the SV Context and includes VL as well as
 117   destination elwidth overrides.
 118
 119 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
 120
 121 Note that twin predication, predication-zeroing, saturation
 122 and other modes have all been removed, for clarity and simplicity:
 123
 124     # LD not VLD! (ldbrx if brev=True)
 125     # this covers unit stride mode and a type of vector offset
 126     function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
 127       for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
 128
 129         if RA.isvec:
 130             # strange vector mode, compute 64 bit address which is
 131             # not polymorphic! elwidth hardcoded to 64 here
 132             srcbase = get_polymorphed_reg(RA, 64, i)
 133         else:
 134             # unit stride mode, compute the address
 135             srcbase = ireg[RA] + i * op_width;
 136
 137         # takes care of (merges) processor LE/BE and ld/ldbrx
 138         bytereverse = brev XNOR MSR.LE
 139
 140         # read the underlying memory
 141         memread <= mem[srcbase + imm_offs];
 142
 143         # optionally performs byteswap at op width
 144         if (bytereverse):
 145             memread = byteswap(memread, op_width)
 146
 147         # now truncate/extend to over-ridden width.
 148         if not svpctx.saturation_mode:
 149             memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
 150         else:
 151             ... saturation adjustment...
 152
 153         # takes care of inserting memory-read (now correctly byteswapped)
 154         # into regfile underlying LE-defined order, into the right place
 155         # within the NEON-like register, respecting destination element
 156         # bitwidth, and the element index (j)
 157         set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
 158
 159         # increments both src and dest element indices (no predication here)
 160         i++;
 161         j++;
 162
 163 When RA is marked as Vectorised the mode switches to an anomalous version similar to Indexed.  The element indices increment to select a 64 bit base address, effectively as if the src elwidth was hard-set to "default".  The important thing to note is that `i*op_width` is *not* added on to the base address unless RA is marked as a scalar address.