(no commit message)
[libreriscv.git] / openpower / sv / ldst.mdwn
1 [[!tag standards]]
2
3 # SV Load and Store
4
5 Links:
6
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
12
13 Vectorisation of Load and Store requires creation, from scalar operations,
14 a number of different modes:
15
16 * fixed stride (contiguous sequence with no gaps) aka "unit" stride
17 * element strided (sequential but regularly offset, with gaps)
18 * vector indexed (vector of base addresses and vector of offsets)
19 * fail-first on the same (where it makes sense to do so)
20 * Structure Packing (covered in SV by [[sv/remap]]).
21
22 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
23 [[isa/fixedstore]] pseudocode to be of the form:
24
25 lbux RT, RA, RB
26 EA <- (RA) + (RB)
27 RT <- MEM(EA)
28
29 and for immediate variants:
30
31 lb RT,D(RA)
32 EA <- RA + EXTS(D)
33 RT <- MEM(EA)
34
35 Thus in the first example, the source registers may each be independently
36 marked as scalar or vector, and likewise the destination; in the second
37 example only the one source and one dest may be marked as scalar or
38 vector.
39
40 Thus we can see that Vector Indexed may be covered, and, as demonstrated
41 with the pseudocode below, the immediate can be used to give unit stride or element stride. With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
42
43 # LD not VLD! format - ldop RT, immed(RA)
44 # op_width: lb=1, lh=2, lw=4, ld=8
45 op_load(RT, RA, op_width, immed, svctx, RAupdate):
46  ps = get_pred_val(FALSE, RA); # predication on src
47  pd = get_pred_val(FALSE, RT); # ... AND on dest
48  for (i=0, j=0, u=0; i < VL && j < VL;):
49 # skip nonpredicates elements
50 if (RA.isvec) while (!(ps & 1<<i)) i++;
51 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
52 if (RT.isvec) while (!(pd & 1<<j)) j++;
53 if svctx.ldstmode == elementstride:
54 # element stride mode
55 srcbase = ireg[RA]
56 offs = i * immed
57 elif svctx.ldstmode == unitstride:
58 # unit stride mode
59 srcbase = ireg[RA]
60 offs = i * op_width
61 elif RA.isvec:
62 # quirky Vector indexed mode but with an immediate
63 srcbase = ireg[RA+i]
64 offs = immed;
65 else
66 # standard scalar mode (but predicated)
67 # no stride multiplier means VSPLAT mode
68 srcbase = ireg[RA]
69 offs = immed
70
71 # compute EA
72 EA = srcbase + offs
73 # update RA?
74 if RAupdate: ireg[RAupdate+u] = EA;
75 # load from memory
76 ireg[RT+j] <= MEM[EA];
77 if (!RT.isvec)
78 break # destination scalar, end now
79 if (RA.isvec) i++;
80 if (RAupdate.isvec) u++;
81 if (RT.isvec) j++;
82
83 Indexed LD is:
84
85 # format: ldop RT, RA, RB
86 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
87  ps = get_pred_val(FALSE, RA); # predication on src
88  pd = get_pred_val(FALSE, RT); # ... AND on dest
89  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
90 # skip nonpredicated RA, RB and RT
91 if (RA.isvec) while (!(ps & 1<<i)) i++;
92 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
93 if (RB.isvec) while (!(ps & 1<<k)) k++;
94 if (RT.isvec) while (!(pd & 1<<j)) j++;
95 EA = ireg[RA+i] + ireg[RB+k] # indexed address
96 if RAupdate: ireg[RAupdate+u] = EA
97 ireg[RT+j] <= MEM[EA];
98 if (!RT.isvec)
99 break # destination scalar, end immediately
100 if (!RA.isvec && !RB.isvec)
101 break # scalar-scalar
102 if (RA.isvec) i++;
103 if (RAupdate.isvec) u++;
104 if (RB.isvec) k++;
105 if (RT.isvec) j++;
106
107 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source. This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
108
109 # Determining the LD/ST Modes
110
111 A minor complication (caused by the retro-fitting of modern Vector
112 features to a Scalar ISA) is that certain features do not exactly make
113 sense or are considered a security risk. Fail-first on Vector Indexed
114 allows attackers to probe large numbers of pages from userspace, where
115 strided fail-first (by creating contiguous sequential LDs) does not.
116
117 In addition, reduce mode makes no sense, and for LD/ST with immediates
118 Vector source RA makes no sense either (or, is a quirk). Realistically we need
119 an alternative table meaning for [[sv/svp64]] mode. The following modes make sense:
120
121 * saturation
122 * predicate-result (mostly for cache-inhibited LD/ST)
123 * normal
124 * fail-first, where a vector source on RA or RB is banned
125
126 The table for [[sv/svp64]] for `immed(RA)` is:
127
128 | 0-1 | 2 | 3 4 | description |
129 | --- | --- |---------|-------------------------- |
130 | 00 | els | sz dz | normal mode |
131 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
132 | 01 | inv | els RC1 | Rc=0: ffirst z/nonz |
133 | 10 | N | sz els | sat mode: N=0/1 u/s |
134 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
135 | 11 | inv | els RC1 | Rc=0: pred-result z/nonz |
136
137 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
138 whether stride is unit or element:
139
140 if RA.isvec:
141 svctx.ldstmode = indexed
142 elif els == 0:
143 svctx.ldstmode = unitstride
144 elif immediate != 0:
145 svctx.ldstmode = elementstride
146
147 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
148 in effect the multiplication of the immediate-offset by zero results
149 in reading from the exact same memory location.
150
151 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
152 just the once and be copied, rather than hitting the Data Cache
153 multiple times with the same memory read at the same location.
154
155 For ST from a vector source onto a scalar destination: with the Vector
156 loop effectively creating multiple memory writes to the same location,
157 we can deduce that the last of these will be the "successful" one. Thus,
158 implementations are free and clear to optimise out the overwriting STs,
159 leaving just the last one as the "winner". Bear in mind that predicate
160 masks will skip some elements (in source non-zeroing mode).
161
162 Note that there are no immediate versions of cache-inhibited LD/ST.
163
164 The modes for `RA+RB` indexed version are slightly different:
165
166 | 0-1 | 2 | 3 4 | description |
167 | --- | --- |---------|-------------------------- |
168 | 00 | 0 | sz dz | normal mode |
169 | 00 | rsv | rsvd | reserved |
170 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
171 | 01 | inv | sz RC1 | Rc=0: ffirst z/nonz |
172 | 10 | N | sz dz | sat mode: N=0/1 u/s |
173 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
174 | 11 | inv | sz RC1 | Rc=0: pred-result z/nonz |
175
176 A summary of the effect of Vectorisation of src or dest:
177
178 imm(RA) RT.v RA.v no stride allowed
179 imm(RA) RT.s RA.v no stride allowed
180 imm(RA) RT.v RA.s stride-select allowed
181 imm(RA) RT.s RA.s not vectorised
182 RA,RB RT.v RA/RB.v ffirst banned
183 RA,RB RT.s RA/RB.v ffirst banned
184 RA,RB RT.v RA/RB.s VSPLAT possible
185 RA,RB RT.s RA/RB.s not vectorised
186
187 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially. `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers. This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
188 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
189 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
190
191 # LOAD/STORE Elwidths <a name="ldst"></a>
192
193 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
194 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
195 others like it provide an explicit operation width. There are therefore
196 *three* widths involved:
197
198 * operation width (lb=8, lh=16, lw=32, ld=64)
199 * src elelent width override
200 * destination element width override
201
202 Some care is therefore needed to express and make clear the transformations,
203 which are expressly in this order:
204
205 * Load at the operation width (lb/lh/lw/ld) as usual
206 * byte-reversal as usual
207 * Non-saturated mode:
208 - zero-extension or truncation from operation width to source elwidth
209 - zero/truncation to dest elwidth
210 * Saturated mode:
211 - Sign-extension or truncation from operation width to source width
212 - signed/unsigned saturation down to dest elwidth
213
214 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
215 is treated effectively as completely separate and distinct from SV
216 augmentation. This is primarily down to quirks surrounding LE/BE and
217 byte-reversal in OpenPOWER.
218
219 Note the following regarding the pseudocode to follow:
220
221 * `scalar identity behaviour` SV Context parameter conditions turn this
222 into a straight absolute fully-compliant Scalar v3.0B LD operation
223 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
224 rather than `ld`)
225 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
226 a "normal" part of Scalar v3.0B LD
227 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
228 as a "normal" part of Scalar v3.0B LD
229 * `svctx` specifies the SV Context and includes VL as well as
230 source and destination elwidth overrides.
231
232 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
233
234 Note that twin predication, predication-zeroing, saturation
235 and other modes have all been removed, for clarity and simplicity:
236
237 # LD not VLD! (ldbrx if brev=True)
238 # this covers unit stride mode and a type of vector offset
239 function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
240 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
241
242 if not svctx.unit/el-strided:
243 # strange vector mode, compute 64 bit address which is
244 # not polymorphic! elwidth hardcoded to 64 here
245 srcbase = get_polymorphed_reg(RA, 64, i)
246 else:
247 # unit / element stride mode, compute 64 bit address
248 srcbase = get_polymorphed_reg(RA, 64, 0)
249 # adjust for unit/el-stride
250 srcbase += ....
251
252 # takes care of (merges) processor LE/BE and ld/ldbrx
253 bytereverse = brev XNOR MSR.LE
254
255 # read the underlying memory
256 memread <= mem[srcbase + imm_offs];
257
258 # optionally performs byteswap at op width
259 if (bytereverse):
260 memread = byteswap(memread, op_width)
261
262
263 # check saturation.
264 if svpctx.saturation_mode:
265 ... saturation adjustment...
266 else:
267 # truncate/extend to over-ridden source width.
268 memread = adjust_wid(memread, op_width, svctx.src_elwidth)
269
270 # takes care of inserting memory-read (now correctly byteswapped)
271 # into regfile underlying LE-defined order, into the right place
272 # within the NEON-like register, respecting destination element
273 # bitwidth, and the element index (j)
274 set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
275
276 # increments both src and dest element indices (no predication here)
277 i++;
278 j++;
279
280 # Remapped LD/ST
281
282 In the [[sv/propagation]] page the concept of "Remapping" is described.
283 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
284 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
285 elements worth of LDs or STs. The usual interest in such re-mapping
286 is for example in separating out 24-bit RGB channel data into separate
287 contiguous registers. NEON covers this as shown in the diagram below:
288
289 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
290
291 Remap easily covers this capability, and with dest
292 elwidth overrides and saturation may do so with built-in conversion that
293 would normally require additional width-extension, sign-extension and
294 min/max Vectorised instructions as post-processing stages.
295
296 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
297 because the generic abstracted concept of "Remapping", when applied to
298 LD/ST, will give that same capability, with far more flexibility.
299
300 # notes from lxo
301
302 this section covers assembly notation for the immediate and indexed LD/ST.
303 the summary is that in immediate mode for LD it is not clear that if the
304 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
305 the memory being read is *still a vector load*, known as "unit or element strides".
306
307 This anomaly is made clear with the following notation:
308
309 sv.ld RT.v, imm(RA).v
310
311 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
312
313 sv.ld RT.v, imm(RA)
314
315 Notes taken from IRC conversation
316
317 <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
318 <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
319 <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
320 <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
321 <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
322 <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
323 <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
324
325 permutations of vector selection, to identify above asm-syntax:
326
327 imm(RA) RT.v RA.v nonstrided
328 sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
329 mem@ 0+r#2 offs+(r#2+1) offs+(r#2+2)
330 destreg r# r#+1 r#+2
331 imm(RA) RT.s RA.v nonstrided
332 sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
333 (dest r# is scalar) -> VSELECT mode
334 imm(RA) RT.v RA.s fixed stride: unit or element
335 sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
336 mem@r#2 +0 +1 +2
337 destreg r# r#+1 r#+2
338 sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
339 mem@r#2 +0 ... +offs ... +offs*2
340 destreg r# r#+1 r#+2
341 imm(RA) RT.s RA.s not vectorised
342 sv.ld r#, ofst(r#2)
343
344 indexed mode:
345
346 RA,RB RT.v RA.v RB.v
347 sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
348 RA,RB RT.v RA.s RB.v
349 sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
350 RA,RB RT.v RA.v RB.s
351 sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
352 RA,RB RT.v RA.s RB.s
353 sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
354 RA,RB RT.s RA.v RB.v
355 RA,RB RT.s RA.s RB.v
356 RA,RB RT.s RA.v RB.s
357 RA,RB RT.s RA.s RB.s not vectorised