7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
13 Vectorisation of Load and Store requires creation, from scalar operations,
14 a number of different modes:
16 * fixed stride (contiguous sequence with no gaps)
17 * element strided (sequential but regularly offset, with gaps)
18 * vector indexed (vector of base addresses and vector of offsets)
19 * fail-first on the same (where it makes sense to do so)
20 * Structure Packing (covered in SV by [[sv/remap]]).
22 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
23 [[isa/fixedstore]] pseudocode to be of the form:
29 and for immediate variants:
35 Thus in the first example, the source registers may each be independently
36 marked as scalar or vector, and likewise the destination; in the second
37 example only the one source and one dest may be marked as scalar or
40 Thus we can see that Vector Indexed may be covered, and, as demonstrated
41 with the pseudocode below, the immediate can be used to give unit stride or element stride. With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
43 # LD not VLD! format - ldop RT, immed(RA)
44 # op_width: lb=1, lh=2, lw=4, ld=8
45 op_load(RT, RA, op_width, immed, svctx, RAupdate):
46 ps = get_pred_val(FALSE, RA); # predication on src
47 pd = get_pred_val(FALSE, RT); # ... AND on dest
48 for (i=0, j=0, u=0; i < VL && j < VL;):
49 # skip nonpredicates elements
50 if (RA.isvec) while (!(ps & 1<<i)) i++;
51 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
52 if (RT.isvec) while (!(pd & 1<<j)) j++;
53 if svctx.ldstmode == elementstride:
57 elif svctx.ldstmode == unitstride:
62 # quirky Vector indexed mode but with an immediate
66 # standard scalar mode (but predicated)
67 # no stride multiplier means VSPLAT mode
74 if RAupdate: ireg[RAupdate+u] = EA;
76 ireg[RT+j] <= MEM[EA];
78 break # destination scalar, end now
80 if (RAupdate.isvec) u++;
85 # format: ldop RT, RA, RB
86 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
87 ps = get_pred_val(FALSE, RA); # predication on src
88 pd = get_pred_val(FALSE, RT); # ... AND on dest
89 for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
90 # skip nonpredicated RA, RB and RT
91 if (RA.isvec) while (!(ps & 1<<i)) i++;
92 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
93 if (RB.isvec) while (!(ps & 1<<k)) k++;
94 if (RT.isvec) while (!(pd & 1<<j)) j++;
95 EA = ireg[RA+i] + ireg[RB+k] # indexed address
96 if RAupdate: ireg[RAupdate+u] = EA
97 ireg[RT+j] <= MEM[EA];
99 break # destination scalar, end immediately
100 if (!RA.isvec && !RB.isvec)
101 break # scalar-scalar
103 if (RAupdate.isvec) u++;
107 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source. This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
109 # Determining the LD/ST Modes
111 A minor complication (caused by the retro-fitting of modern Vector
112 features to a Scalar ISA) is that certain features do not exactly make
113 sense or are considered a security risk. Fail-first on Vector Indexed
114 allows attackers to probe large numbers of pages from userspace, where
115 strided fail-first (by creating contiguous sequential LDs) does not.
117 In addition, reduce mode makes no sense, and for LD/ST with immediates
118 Vector source RA makes no sense either (or, is a quirk). Realistically we need
119 an alternative table meaning for [[sv/svp64]] mode. The following modes make sense:
122 * predicate-result (mostly for cache-inhibited LD/ST)
124 * fail-first, where a vector source on RA or RB is banned
126 The table for [[sv/svp64]] for `immed(RA)` is:
128 | 0-1 | 2 | 3 4 | description |
129 | --- | --- |---------|-------------------------- |
130 | 00 | str | sz dz | normal mode |
131 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
132 | 01 | inv | els RC1 | Rc=0: ffirst z/nonz |
133 | 10 | N | sz els | sat mode: N=0/1 u/s |
134 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
135 | 11 | inv | els RC1 | Rc=0: pred-result z/nonz |
137 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
138 whether stride is unit or element:
141 svctx.ldstmode = indexed
143 svctx.ldstmode = unitstride
145 svctx.ldstmode = elementstride
147 The modes for `RA+RB` indexed version are slightly different:
149 | 0-1 | 2 | 3 4 | description |
150 | --- | --- |---------|-------------------------- |
151 | 00 | 0 | sz dz | normal mode |
152 | 00 | rsv | rsvd | reserved |
153 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
154 | 01 | inv | sz RC1 | Rc=0: ffirst z/nonz |
155 | 10 | N | sz dz | sat mode: N=0/1 u/s |
156 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
157 | 11 | inv | sz RC1 | Rc=0: pred-result z/nonz |
159 A summary of the effect of Vectorisation of src or dest:
161 imm(RA) RT.v RA.v no stride allowed
162 imm(RA) RY.s RA.v no stride allowed
163 imm(RA) RT.v RA.s stride-select needed
164 imm(RA) RT.s RA.s not vectorised
165 RA,RB RT.v RA/RB.v ffirst banned
166 RA,RB RT.s RA/RB.v ffirst banned
167 RA,RB RT.v RA/RB.s VSPLAT possible
168 RA,RB RT.s RA/RB.s not vectorised
170 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially. `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers. This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
171 If a genuine VSPLAT is required then a scalar cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
173 # LOAD/STORE Elwidths <a name="ldst"></a>
175 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
176 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
177 others like it provide an explicit operation width. There are therefore
178 *three* widths involved:
180 * operation width (lb=8, lh=16, lw=32, ld=64)
181 * src elelent width override
182 * destination element width override
184 Some care is therefore needed to express and make clear the transformations,
185 which are expressly in this order:
187 * Load at the operation width (lb/lh/lw/ld) as usual
188 * byte-reversal as usual
189 * Non-saturated mode:
190 - zero-extension or truncation from operation width to source elwidth
191 - zero/truncation to dest elwidth
193 - Sign-extension or truncation from operation width to source width
194 - signed/unsigned saturation down to dest elwidth
196 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
197 is treated effectively as completely separate and distinct from SV
198 augmentation. This is primarily down to quirks surrounding LE/BE and
199 byte-reversal in OpenPOWER.
201 Note the following regarding the pseudocode to follow:
203 * `scalar identity behaviour` SV Context parameter conditions turn this
204 into a straight absolute fully-compliant Scalar v3.0B LD operation
205 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
207 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
208 a "normal" part of Scalar v3.0B LD
209 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
210 as a "normal" part of Scalar v3.0B LD
211 * `svctx` specifies the SV Context and includes VL as well as
212 source and destination elwidth overrides.
214 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
216 Note that twin predication, predication-zeroing, saturation
217 and other modes have all been removed, for clarity and simplicity:
219 # LD not VLD! (ldbrx if brev=True)
220 # this covers unit stride mode and a type of vector offset
221 function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
222 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
224 if not svctx.unit/el-strided:
225 # strange vector mode, compute 64 bit address which is
226 # not polymorphic! elwidth hardcoded to 64 here
227 srcbase = get_polymorphed_reg(RA, 64, i)
229 # unit / element stride mode, compute 64 bit address
230 srcbase = get_polymorphed_reg(RA, 64, 0)
231 # adjust for unit/el-stride
234 # takes care of (merges) processor LE/BE and ld/ldbrx
235 bytereverse = brev XNOR MSR.LE
237 # read the underlying memory
238 memread <= mem[srcbase + imm_offs];
240 # optionally performs byteswap at op width
242 memread = byteswap(memread, op_width)
246 if svpctx.saturation_mode:
247 ... saturation adjustment...
249 # truncate/extend to over-ridden source width.
250 memread = adjust_wid(memread, op_width, svctx.src_elwidth)
252 # takes care of inserting memory-read (now correctly byteswapped)
253 # into regfile underlying LE-defined order, into the right place
254 # within the NEON-like register, respecting destination element
255 # bitwidth, and the element index (j)
256 set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
258 # increments both src and dest element indices (no predication here)
264 In the [[sv/propagation]] page the concept of "Remapping" is described.
265 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
266 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
267 elements worth of LDs or STs. The usual interest in such re-mapping
268 is for example in separating out 24-bit RGB channel data into separate
269 contiguous registers. NEON covers this as shown in the diagram below:
271 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
273 Remap easily covers this capability, and with dest
274 elwidth overrides and saturation may do so with built-in conversion that
275 would normally require additional width-extension, sign-extension and
276 min/max Vectorised instructions as post-processing stages.
278 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
279 because the generic abstracted concept of "Remapping", when applied to
280 LD/ST, will give that same capability, with far more flexibility.
284 <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
285 <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
286 <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
287 <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
288 <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
289 <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
290 <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
292 permutations of vector selection, to identify above asm-syntax:
294 imm(RA) RT.v RA.v nonstrided
295 sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
296 imm(RA) RT.s RA.v nonstrided
297 sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
298 imm(RA) RT.v RA.s fixed stride: unit or element
299 sv.ld r#.v, ofst(r#2).v -> the whole vector is at ofst+r#2
302 sv.ld/els r#.v, ofst(r#2).v -> the vector is at ofst*elidx+r#2
303 mem 0 ... offs ... offs*2
305 imm(RA) RT.s RA.s not vectorised
311 sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
313 sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
315 sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
317 sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
321 RA,RB RT.s RA.s RB.s not vectorised