7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
12 * [[simple_v_extension/specification/ld.x]]
14 Vectorisation of Load and Store requires creation, from scalar operations,
15 a number of different modes:
17 * fixed stride (contiguous sequence with no gaps) aka "unit" stride
18 * element strided (sequential but regularly offset, with gaps)
19 * vector indexed (vector of base addresses and vector of offsets)
20 * fail-first on the same (where it makes sense to do so)
21 * Structure Packing (covered in SV by [[sv/remap]]).
23 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
24 [[isa/fixedstore]] pseudocode to be of the form:
30 and for immediate variants:
36 Thus in the first example, the source registers may each be independently
37 marked as scalar or vector, and likewise the destination; in the second
38 example only the one source and one dest may be marked as scalar or
41 Thus we can see that Vector Indexed may be covered, and, as demonstrated
42 with the pseudocode below, the immediate can be used to give unit stride or element stride. With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
44 # LD not VLD! format - ldop RT, immed(RA)
45 # op_width: lb=1, lh=2, lw=4, ld=8
46 op_load(RT, RA, RC, op_width, immed, svctx, RAupdate):
47 ps = get_pred_val(FALSE, RA); # predication on src
48 pd = get_pred_val(FALSE, RT); # ... AND on dest
49 for (i=0, j=0, u=0; i < VL && j < VL;):
50 # skip nonpredicates elements
51 if (RA.isvec) while (!(ps & 1<<i)) i++;
52 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
53 if (RT.isvec) while (!(pd & 1<<j)) j++;
54 if svctx.ldstmode == shifted: # for FFT/DCT
55 # FFT/DCT shifted mode
60 offs = (i * immed) << RC
61 elif svctx.ldstmode == elementstride:
64 offs = i * immed # j*immed for a ST
65 elif svctx.ldstmode == unitstride:
68 offs = immed + (i * op_width) # j*op_width for ST
70 # quirky Vector indexed mode but with an immediate
74 # standard scalar mode (but predicated)
75 # no stride multiplier means VSPLAT mode
82 if RAupdate: ireg[RAupdate+u] = EA;
84 ireg[RT+j] <= MEM[EA];
86 break # destination scalar, end now
88 if (RAupdate.isvec) u++;
91 # reverses the bitorder up to "width" bits
95 for _ in range(width):
96 result = (result << 1) | (val & 1)
102 # format: ldop RT, RA, RB
103 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
104 ps = get_pred_val(FALSE, RA); # predication on src
105 pd = get_pred_val(FALSE, RT); # ... AND on dest
106 for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
107 # skip nonpredicated RA, RB and RT
108 if (RA.isvec) while (!(ps & 1<<i)) i++;
109 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
110 if (RB.isvec) while (!(ps & 1<<k)) k++;
111 if (RT.isvec) while (!(pd & 1<<j)) j++;
112 EA = ireg[RA+i] + ireg[RB+k] # indexed address
113 if RAupdate: ireg[RAupdate+u] = EA
114 ireg[RT+j] <= MEM[EA];
116 break # destination scalar, end immediately
117 if (!RA.isvec && !RB.isvec)
118 break # scalar-scalar
120 if (RAupdate.isvec) u++;
124 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source. This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
126 # Determining the LD/ST Modes
128 A minor complication (caused by the retro-fitting of modern Vector
129 features to a Scalar ISA) is that certain features do not exactly make
130 sense or are considered a security risk. Fail-first on Vector Indexed
131 allows attackers to probe large numbers of pages from userspace, where
132 strided fail-first (by creating contiguous sequential LDs) does not.
134 In addition, reduce mode makes no sense, and for LD/ST with immediates
135 Vector source RA makes no sense either (or, is a quirk). Realistically we need
136 an alternative table meaning for [[sv/svp64]] mode. The following modes make sense:
139 * predicate-result (mostly for cache-inhibited LD/ST)
141 * fail-first, where a vector source on RA or RB is banned
143 Also, given that FFT, DCT and other related algorithms
144 are of such high importance in so many areas of Computer
145 Science, a special "shift" mode has been added which
146 allows part of the immediate to be used instead as RC, a register
147 which shifts the immediate `DS << GPR(RC)`.
149 The table for [[sv/svp64]] for `immed(RA)` is:
151 | 0-1 | 2 | 3 4 | description |
152 | --- | --- |---------|--------------------------- |
153 | 00 | 0 | dz els | normal mode |
154 | 00 | 1 | dz shf | shift mode |
155 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
156 | 01 | inv | els RC1 | Rc=0: ffirst z/nonz |
157 | 10 | N | dz els | sat mode: N=0/1 u/s |
158 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
159 | 11 | inv | els RC1 | Rc=0: pred-result z/nonz |
161 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
162 whether stride is unit or element:
165 svctx.ldstmode = bitreversed
167 svctx.ldstmode = indexed
169 svctx.ldstmode = unitstride
171 svctx.ldstmode = elementstride
173 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
174 in effect the multiplication of the immediate-offset by zero results
175 in reading from the exact same memory location.
177 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
178 just the once and be copied, rather than hitting the Data Cache
179 multiple times with the same memory read at the same location.
180 This would allow for memory-mapped peripherals to have multiple
181 data values read in quick succession and stored in sequentially
184 For non-cache-inhibited ST from a vector source onto a scalar
185 destination: with the Vector
186 loop effectively creating multiple memory writes to the same location,
187 we can deduce that the last of these will be the "successful" one. Thus,
188 implementations are free and clear to optimise out the overwriting STs,
189 leaving just the last one as the "winner". Bear in mind that predicate
190 masks will skip some elements (in source non-zeroing mode).
191 Cache-inhibited ST operations on the other hand **MUST** write out
192 a Vector source multiple successive times to the exact same Scalar
195 Note that there are no immediate versions of cache-inhibited LD/ST.
197 The modes for `RA+RB` indexed version are slightly different:
199 | 0-1 | 2 | 3 4 | description |
200 | --- | --- |---------|-------------------------- |
201 | 00 | 0 | dz sz | normal mode |
202 | 00 | 1 | rsvd | reserved |
203 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
204 | 01 | inv | dz RC1 | Rc=0: ffirst z/nonz |
205 | 10 | N | dz sz | sat mode: N=0/1 u/s |
206 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
207 | 11 | inv | dz RC1 | Rc=0: pred-result z/nonz |
209 A summary of the effect of Vectorisation of src or dest:
211 imm(RA) RT.v RA.v no stride allowed
212 imm(RA) RT.s RA.v no stride allowed
213 imm(RA) RT.v RA.s stride-select allowed
214 imm(RA) RT.s RA.s not vectorised
215 RA,RB RT.v RA/RB.v ffirst banned
216 RA,RB RT.s RA/RB.v ffirst banned
217 RA,RB RT.v RA/RB.s VSPLAT possible
218 RA,RB RT.s RA/RB.s not vectorised
220 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially. `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers. This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
221 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
222 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
226 ffirst LD/ST to multiple pages via a Vectorised base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail. Therefore in these special circumstances requesting ffirst with a vector base is instead interpreted as element-strided LD/ST. See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
228 # LOAD/STORE Elwidths <a name="elwidth"></a>
230 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
231 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
232 others like it provide an explicit operation width. There are therefore
233 *three* widths involved:
235 * operation width (lb=8, lh=16, lw=32, ld=64)
236 * src elelent width override
237 * destination element width override
239 Some care is therefore needed to express and make clear the transformations,
240 which are expressly in this order:
242 * Load at the operation width (lb/lh/lw/ld) as usual
243 * byte-reversal as usual
244 * Non-saturated mode:
245 - zero-extension or truncation from operation width to source elwidth
246 - zero/truncation to dest elwidth
248 - Sign-extension or truncation from operation width to source width
249 - signed/unsigned saturation down to dest elwidth
251 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
252 is treated effectively as completely separate and distinct from SV
253 augmentation. This is primarily down to quirks surrounding LE/BE and
254 byte-reversal in OpenPOWER.
256 It is unfortubately possible to request an elwidth override on the memory side which
257 does not mesh with the operation width: these result in `UNDEFINED`
258 behaviour. The reason is that the effect of attempting a 64-bit `sv.ld`
259 operation with a source elwidth override of 8/16/32 would result in
260 overlapping memory requests, particularly on unit and element strided
261 operations. Thus it is `UNDEFINED` when the elwidth is smaller than
262 the memory operation width. Examples include `sv.lw/sw=16/els` which
263 requests (overlapping) 4-byte memory reads offset from
264 each other at 2-byte intervals. Store likewise is also `UNDEFINED`
265 where the dest elwidth override is less than the operation width.
267 Note the following regarding the pseudocode to follow:
269 * `scalar identity behaviour` SV Context parameter conditions turn this
270 into a straight absolute fully-compliant Scalar v3.0B LD operation
271 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
273 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
274 a "normal" part of Scalar v3.0B LD
275 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
276 as a "normal" part of Scalar v3.0B LD
277 * `svctx` specifies the SV Context and includes VL as well as
278 source and destination elwidth overrides.
280 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
282 Note that twin predication, predication-zeroing, saturation
283 and other modes have all been removed, for clarity and simplicity:
285 # LD not VLD! (ldbrx if brev=True)
286 # this covers unit stride mode and a type of vector offset
287 function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
288 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
290 if not svctx.unit/el-strided:
291 # strange vector mode, compute 64 bit address which is
292 # not polymorphic! elwidth hardcoded to 64 here
293 srcbase = get_polymorphed_reg(RA, 64, i)
295 # unit / element stride mode, compute 64 bit address
296 srcbase = get_polymorphed_reg(RA, 64, 0)
297 # adjust for unit/el-stride
300 # takes care of (merges) processor LE/BE and ld/ldbrx
301 bytereverse = brev XNOR MSR.LE
303 # read the underlying memory
304 memread <= mem[srcbase + imm_offs];
306 # optionally performs byteswap at op width
308 memread = byteswap(memread, op_width)
312 if svpctx.saturation_mode:
313 ... saturation adjustment...
315 # truncate/extend to over-ridden source width.
316 memread = adjust_wid(memread, op_width, svctx.src_elwidth)
318 # takes care of inserting memory-read (now correctly byteswapped)
319 # into regfile underlying LE-defined order, into the right place
320 # within the NEON-like register, respecting destination element
321 # bitwidth, and the element index (j)
322 set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
324 # increments both src and dest element indices (no predication here)
330 In the [[sv/propagation]] page the concept of "Remapping" is described.
331 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
332 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
333 elements worth of LDs or STs. The usual interest in such re-mapping
334 is for example in separating out 24-bit RGB channel data into separate
335 contiguous registers. NEON covers this as shown in the diagram below:
337 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
339 Remap easily covers this capability, and with dest
340 elwidth overrides and saturation may do so with built-in conversion that
341 would normally require additional width-extension, sign-extension and
342 min/max Vectorised instructions as post-processing stages.
344 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
345 because the generic abstracted concept of "Remapping", when applied to
346 LD/ST, will give that same capability, with far more flexibility.
350 this section covers assembly notation for the immediate and indexed LD/ST.
351 the summary is that in immediate mode for LD it is not clear that if the
352 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
353 the memory being read is *still a vector load*, known as "unit or element strides".
355 This anomaly is made clear with the following notation:
357 sv.ld RT.v, imm(RA).v
359 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
363 Notes taken from IRC conversation
365 <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
366 <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
367 <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
368 <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
369 <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
370 <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
371 <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
373 permutations of vector selection, to identify above asm-syntax:
375 imm(RA) RT.v RA.v nonstrided
376 sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
377 mem@ 0+r#2 offs+(r#2+1) offs+(r#2+2)
379 imm(RA) RT.s RA.v nonstrided
380 sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
381 (dest r# is scalar) -> VSELECT mode
382 imm(RA) RT.v RA.s fixed stride: unit or element
383 sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
386 sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
387 mem@r#2 +0 ... +offs ... +offs*2
389 imm(RA) RT.s RA.s not vectorised
395 sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
397 sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
399 sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
401 sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
405 RA,RB RT.s RA.s RB.s not vectorised