7 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
9 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
10 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
12 Vectorisation of Load and Store requires creation, from scalar operations,
13 a number of different types:
15 * fixed stride (contiguous sequence with no gaps)
16 * element strided (sequential but regularly offset, with gaps)
17 * vector indexed (vector of base addresses and vector of offsets)
19 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
20 [[isa/fixedstore]] pseudocode to be of the form:
26 and for immediate variants:
32 Thus in the first example, the source registers may each be independently
33 marked as scalar or vector, and likewise the destination; in the second
34 example only the one source and one dest may be marked as scalar or
37 Thus we can see that Vector Indexed may be covered, and, as demonstrated
38 with the pseudocode below, the immediate can be set to the element width
39 in order to give unit stride.
41 At the minimum however it is possible to provide unit stride and vector
45 # op_width: lb=1, lh=2, lw=4, ld=8
46 op_load(RT, RA, op_width, immed, svctx, update):
47 ps = get_pred_val(FALSE, RA); # predication on src
48 pd = get_pred_val(FALSE, RT); # ... AND on dest
49 for (int i = 0, int j = 0; i < VL && j < VL;):
50 # skip nonpredicates elements
51 if (RA.isvec) while (!(ps & 1<<i)) i++;
52 if (RT.isvec) while (!(pd & 1<<j)) j++;
54 # indirect mode (multi mode)
59 if svctx.ldstmode == elementstride:
62 elif svctx.ldstmode == unitstride:
66 # standard scalar mode (but predicated)
67 # no stride multiplier means VSPLAT mode
71 # update RA? load from memory
72 if update: ireg[rsv+i] = EA;
73 ireg[RT+j] <= MEM[EA];
75 break # destination scalar, end now
81 function op_ldx(RT, RA, RB, update=False) # LD not VLD!
82 rdv = map_dest_extra(RT);
83 rsv = map_src_extra(RA);
84 rso = map_src_extra(RB);
85 ps = get_pred_val(FALSE, RA); # predication on src
86 pd = get_pred_val(FALSE, RT); # ... AND on dest
87 for (i=0, j=0, k=0; i < VL && j < VL && k < VL):
88 # skip nonpredicated RA, RB and RT
89 if (RA.isvec) while (!(ps & 1<<i)) i++;
90 if (RB.isvec) while (!(ps & 1<<k)) k++;
91 if (RT.isvec) while (!(pd & 1<<j)) j++;
92 EA = ireg[rsv+i] + ireg[rso+k] # indexed address
93 if update: ireg[rsv+i] = EA
94 ireg[rdv+j] <= MEM[EA];
96 break # destination scalar, end immediately
97 if (!RA.isvec && !RB.isvec)
103 # LOAD/STORE Elwidths <a name="ldst"></a>
105 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
106 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
107 others like it provide an explicit operation width. In order to fit the
108 different types of LD/ST Modes into SV the src elwidth field is used to
109 select that Mode, and the actual src elwidth is implicitly the same as
110 the operation width. We then still apply Twin Predication but using:
112 * operation width (lb=8, lh=16, lw=32, ld=64) as src elwidth
113 * destination element width override
115 Saturation (and other transformations) occur on the value loaded from
116 memory as if it was an "infinite bitwidth", sign-extended (if Saturation
117 requests signed) from the source width (lb, lh, lw, ld) followed then
118 by the actual Saturation to the destination width.
120 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
121 is treated effectively as completely separate and distinct from SV
122 augmentation. This is primarily down to quirks surrounding LE/BE and
123 byte-reversal in OpenPOWER.
125 Note the following regarding the pseudocode to follow:
127 * `scalar identity behaviour` SV Context parameter conditions turn this
128 into a straight absolute fully-compliant Scalar v3.0B LD operation
129 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
131 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
132 a "normal" part of Scalar v3.0B LD
133 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
134 as a "normal" part of Scalar v3.0B LD
135 * `svctx` specifies the SV Context and includes VL as well as
136 destination elwidth overrides.
138 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
140 Note that twin predication, predication-zeroing, saturation
141 and other modes have all been removed, for clarity and simplicity:
143 # LD not VLD! (ldbrx if brev=True)
144 # this covers unit stride mode and a type of vector offset
145 function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
146 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
149 # strange vector mode, compute 64 bit address which is
150 # not polymorphic! elwidth hardcoded to 64 here
151 srcbase = get_polymorphed_reg(RA, 64, i)
153 # unit stride mode, compute the address
154 srcbase = ireg[RA] + i * op_width;
156 # takes care of (merges) processor LE/BE and ld/ldbrx
157 bytereverse = brev XNOR MSR.LE
159 # read the underlying memory
160 memread <= mem[srcbase + imm_offs];
162 # optionally performs byteswap at op width
164 memread = byteswap(memread, op_width)
166 # now truncate/extend to over-ridden width.
167 if not svpctx.saturation_mode:
168 memread = adjust_wid(memread, op_width, svctx.dest_elwidth)
170 ... saturation adjustment...
172 # takes care of inserting memory-read (now correctly byteswapped)
173 # into regfile underlying LE-defined order, into the right place
174 # within the NEON-like register, respecting destination element
175 # bitwidth, and the element index (j)
176 set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
178 # increments both src and dest element indices (no predication here)
182 When RA is marked as Vectorised the mode switches to an anomalous
183 version similar to Indexed. The element indices increment to select a
184 64 bit base address, effectively as if the src elwidth was hard-set to
185 "default". The important thing to note is that `i*op_width` is *not*
186 added on to the base address unless RA is marked as a scalar address.
190 In the [[sv/propagation]] page the concept of "Remapping" is described.
191 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
192 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
193 elements worth of LDs or STs. The usual interest in such re-mapping
194 is for example in separating out 24-bit RGB channel data into separate
195 contiguous registers. Remap easily covers this capability, and with
196 elwidth overrides and saturation may do so with built-in conversion that
197 would normally require sign-extension and min/max Vectorised instructions.
199 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
200 because the generic abstracted concept of "Remapping", when applied to
201 LD/ST, will give that capability.