ff6159487a2e525a9b8b21f370212045f21b23fb
[libreriscv.git] / openpower / sv / ldst.mdwn
1 [[!tag standards]]
2
3 # SV Load and Store
4
5 Links:
6
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
12 * [[simple_v_extension/specification/ld.x]]
13
14 Vectorisation of Load and Store requires creation, from scalar operations,
15 a number of different modes:
16
17 * fixed stride (contiguous sequence with no gaps) aka "unit" stride
18 * element strided (sequential but regularly offset, with gaps)
19 * vector indexed (vector of base addresses and vector of offsets)
20 * fail-first on the same (where it makes sense to do so)
21 * Structure Packing (covered in SV by [[sv/remap]]).
22
23 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
24 [[isa/fixedstore]] pseudocode to be of the form:
25
26 lbux RT, RA, RB
27 EA <- (RA) + (RB)
28 RT <- MEM(EA)
29
30 and for immediate variants:
31
32 lb RT,D(RA)
33 EA <- RA + EXTS(D)
34 RT <- MEM(EA)
35
36 Thus in the first example, the source registers may each be independently
37 marked as scalar or vector, and likewise the destination; in the second
38 example only the one source and one dest may be marked as scalar or
39 vector.
40
41 Thus we can see that Vector Indexed may be covered, and, as demonstrated
42 with the pseudocode below, the immediate can be used to give unit stride or element stride. With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
43
44 # LD not VLD! format - ldop RT, immed(RA)
45 # op_width: lb=1, lh=2, lw=4, ld=8
46 op_load(RT, RA, RC, op_width, immed, svctx, RAupdate):
47  ps = get_pred_val(FALSE, RA); # predication on src
48  pd = get_pred_val(FALSE, RT); # ... AND on dest
49  for (i=0, j=0, u=0; i < VL && j < VL;):
50 # skip nonpredicates elements
51 if (RA.isvec) while (!(ps & 1<<i)) i++;
52 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
53 if (RT.isvec) while (!(pd & 1<<j)) j++;
54 if svctx.ldstmode == shifted: # for FFT/DCT
55 # FFT/DCT shifted mode
56 if (RA.isvec)
57 srcbase = ireg[RA+i]
58 else
59 srcbase = ireg[RA]
60 offs = (i * immed) << RC
61 elif svctx.ldstmode == elementstride:
62 # element stride mode
63 srcbase = ireg[RA]
64 offs = i * immed # j*immed for a ST
65 elif svctx.ldstmode == unitstride:
66 # unit stride mode
67 srcbase = ireg[RA]
68 offs = immed + (i * op_width) # j*op_width for ST
69 elif RA.isvec:
70 # quirky Vector indexed mode but with an immediate
71 srcbase = ireg[RA+i]
72 offs = immed;
73 else
74 # standard scalar mode (but predicated)
75 # no stride multiplier means VSPLAT mode
76 srcbase = ireg[RA]
77 offs = immed
78
79 # compute EA
80 EA = srcbase + offs
81 # update RA?
82 if RAupdate: ireg[RAupdate+u] = EA;
83 # load from memory
84 ireg[RT+j] <= MEM[EA];
85 if (!RT.isvec)
86 break # destination scalar, end now
87 if (RA.isvec) i++;
88 if (RAupdate.isvec) u++;
89 if (RT.isvec) j++;
90
91 # reverses the bitorder up to "width" bits
92 def bitrev(val, VL):
93 width = log2(VL)
94 result = 0
95 for _ in range(width):
96 result = (result << 1) | (val & 1)
97 val >>= 1
98 return result
99
100 Indexed LD is:
101
102 # format: ldop RT, RA, RB
103 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
104  ps = get_pred_val(FALSE, RA); # predication on src
105  pd = get_pred_val(FALSE, RT); # ... AND on dest
106  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
107 # skip nonpredicated RA, RB and RT
108 if (RA.isvec) while (!(ps & 1<<i)) i++;
109 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
110 if (RB.isvec) while (!(ps & 1<<k)) k++;
111 if (RT.isvec) while (!(pd & 1<<j)) j++;
112 EA = ireg[RA+i] + ireg[RB+k] # indexed address
113 if RAupdate: ireg[RAupdate+u] = EA
114 ireg[RT+j] <= MEM[EA];
115 if (!RT.isvec)
116 break # destination scalar, end immediately
117 if (!RA.isvec && !RB.isvec)
118 break # scalar-scalar
119 if (RA.isvec) i++;
120 if (RAupdate.isvec) u++;
121 if (RB.isvec) k++;
122 if (RT.isvec) j++;
123
124 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source. This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
125
126 # Determining the LD/ST Modes
127
128 A minor complication (caused by the retro-fitting of modern Vector
129 features to a Scalar ISA) is that certain features do not exactly make
130 sense or are considered a security risk. Fail-first on Vector Indexed
131 allows attackers to probe large numbers of pages from userspace, where
132 strided fail-first (by creating contiguous sequential LDs) does not.
133
134 In addition, reduce mode makes no sense, and for LD/ST with immediates
135 Vector source RA makes no sense either (or, is a quirk). Realistically we need
136 an alternative table meaning for [[sv/svp64]] mode. The following modes make sense:
137
138 * saturation
139 * predicate-result (mostly for cache-inhibited LD/ST)
140 * normal
141 * fail-first, where a vector source on RA or RB is banned
142 * Signed Effective Address computation (Vector Indexed only)
143
144 Also, given that FFT, DCT and other related algorithms
145 are of such high importance in so many areas of Computer
146 Science, a special "shift" mode has been added which
147 allows part of the immediate to be used instead as RC, a register
148 which shifts the immediate `DS << GPR(RC)`.
149
150 The table for [[sv/svp64]] for `immed(RA)` is:
151
152 | 0-1 | 2 | 3 4 | description |
153 | --- | --- |---------|--------------------------- |
154 | 00 | 0 | dz els | normal mode |
155 | 00 | 1 | dz shf | shift mode |
156 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
157 | 01 | inv | els RC1 | Rc=0: ffirst z/nonz |
158 | 10 | N | dz els | sat mode: N=0/1 u/s |
159 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
160 | 11 | inv | els RC1 | Rc=0: pred-result z/nonz |
161
162 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
163 whether stride is unit or element:
164
165 if bitreversed:
166 svctx.ldstmode = bitreversed
167 elif RA.isvec:
168 svctx.ldstmode = indexed
169 elif els == 0:
170 svctx.ldstmode = unitstride
171 elif immediate != 0:
172 svctx.ldstmode = elementstride
173
174 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
175 in effect the multiplication of the immediate-offset by zero results
176 in reading from the exact same memory location.
177
178 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
179 just the once and be copied, rather than hitting the Data Cache
180 multiple times with the same memory read at the same location.
181 This would allow for memory-mapped peripherals to have multiple
182 data values read in quick succession and stored in sequentially
183 numbered registers.
184
185 For non-cache-inhibited ST from a vector source onto a scalar
186 destination: with the Vector
187 loop effectively creating multiple memory writes to the same location,
188 we can deduce that the last of these will be the "successful" one. Thus,
189 implementations are free and clear to optimise out the overwriting STs,
190 leaving just the last one as the "winner". Bear in mind that predicate
191 masks will skip some elements (in source non-zeroing mode).
192 Cache-inhibited ST operations on the other hand **MUST** write out
193 a Vector source multiple successive times to the exact same Scalar
194 destination.
195
196 Note that there are no immediate versions of cache-inhibited LD/ST.
197
198 The modes for `RA+RB` indexed version are slightly different:
199
200 | 0-1 | 2 | 3 4 | description |
201 | --- | --- |---------|-------------------------- |
202 | 00 | SEA | dz sz | Signed Effective Address |
203 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
204 | 01 | inv | dz RC1 | Rc=0: ffirst z/nonz |
205 | 10 | N | dz sz | sat mode: N=0/1 u/s |
206 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
207 | 11 | inv | dz RC1 | Rc=0: pred-result z/nonz |
208
209 A summary of the effect of Vectorisation of src or dest:
210
211 imm(RA) RT.v RA.v no stride allowed
212 imm(RA) RT.s RA.v no stride allowed
213 imm(RA) RT.v RA.s stride-select allowed
214 imm(RA) RT.s RA.s not vectorised
215 RA,RB RT.v {RA|RB}.v ffirst banned
216 RA,RB RT.s {RA|RB}.v ffirst banned
217 RA,RB RT.v {RA&RB}.s VSPLAT possible
218 RA,RB RT.s {RA&RB}.s not vectorised
219
220 Signed Effective Address computation is only relevant for
221 Vector Indexed Mode, when elwidth overrides are applied.
222 The source override applies to RB, and before adding to
223 RA in order to calculate the Effective Address, if SEA is
224 set RB is sign-extended from elwidth bits to the full 64
225 bits. For other Modes (ffirst, saturate),
226 all EA computation is unsigned.
227
228 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially. `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers. This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
229 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
230 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
231
232 ## LD/ST ffirst
233
234 ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail. Therefore in these special circumstances requesting ffirst on Indexed LD/ST is instead interpreted as element-strided LD/ST. See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
235
236 for(i = 0; i < VL; i++)
237 reg[rt + i] = mem[reg[ra] + i * reg[rb]];
238
239 High security implementations where any kind of speculative probing
240 of memory pages is considered a risk should take advantage of the fact that
241 implementations may truncate VL at any point, without requiring software
242 to be rewritten and made non-portable. Such implementations may choose
243 to *always* set VL=1 which will have the effect of terminating any
244 speculative probing (and also adversely affect performance), but will
245 at least not require applications to be rewritten.
246
247 # LOAD/STORE Elwidths <a name="elwidth"></a>
248
249 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
250 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
251 others like it provide an explicit operation width. There are therefore
252 *three* widths involved:
253
254 * operation width (lb=8, lh=16, lw=32, ld=64)
255 * src elelent width override
256 * destination element width override
257
258 Some care is therefore needed to express and make clear the transformations,
259 which are expressly in this order:
260
261 * Load at the operation width (lb/lh/lw/ld) as usual
262 * byte-reversal as usual
263 * Non-saturated mode:
264 - zero-extension or truncation from operation width to source elwidth
265 - zero/truncation to dest elwidth
266 * Saturated mode:
267 - Sign-extension or truncation from operation width to source width
268 - signed/unsigned saturation down to dest elwidth
269
270 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
271 is treated effectively as completely separate and distinct from SV
272 augmentation. This is primarily down to quirks surrounding LE/BE and
273 byte-reversal in OpenPOWER.
274
275 It is unfortunately possible to request an elwidth override on the memory side which
276 does not mesh with the operation width: these result in `UNDEFINED`
277 behaviour. The reason is that the effect of attempting a 64-bit `sv.ld`
278 operation with a source elwidth override of 8/16/32 would result in
279 overlapping memory requests, particularly on unit and element strided
280 operations. Thus it is `UNDEFINED` when the elwidth is smaller than
281 the memory operation width. Examples include `sv.lw/sw=16/els` which
282 requests (overlapping) 4-byte memory reads offset from
283 each other at 2-byte intervals. Store likewise is also `UNDEFINED`
284 where the dest elwidth override is less than the operation width.
285
286 Note the following regarding the pseudocode to follow:
287
288 * `scalar identity behaviour` SV Context parameter conditions turn this
289 into a straight absolute fully-compliant Scalar v3.0B LD operation
290 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
291 rather than `ld`)
292 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
293 a "normal" part of Scalar v3.0B LD
294 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
295 as a "normal" part of Scalar v3.0B LD
296 * `svctx` specifies the SV Context and includes VL as well as
297 source and destination elwidth overrides.
298
299 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
300
301 Note that twin predication, predication-zeroing, saturation
302 and other modes have all been removed, for clarity and simplicity:
303
304 # LD not VLD! (ldbrx if brev=True)
305 # this covers unit stride mode and a type of vector offset
306 function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
307 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
308
309 if not svctx.unit/el-strided:
310 # strange vector mode, compute 64 bit address which is
311 # not polymorphic! elwidth hardcoded to 64 here
312 srcbase = get_polymorphed_reg(RA, 64, i)
313 else:
314 # unit / element stride mode, compute 64 bit address
315 srcbase = get_polymorphed_reg(RA, 64, 0)
316 # adjust for unit/el-stride
317 srcbase += ....
318
319 # takes care of (merges) processor LE/BE and ld/ldbrx
320 bytereverse = brev XNOR MSR.LE
321
322 # read the underlying memory
323 memread <= mem[srcbase + imm_offs];
324
325 # optionally performs byteswap at op width
326 if (bytereverse):
327 memread = byteswap(memread, op_width)
328
329 # check saturation.
330 if svpctx.saturation_mode:
331 ... saturation adjustment...
332 else:
333 # truncate/extend to over-ridden source width.
334 memread = adjust_wid(memread, op_width, svctx.src_elwidth)
335
336 # takes care of inserting memory-read (now correctly byteswapped)
337 # into regfile underlying LE-defined order, into the right place
338 # within the NEON-like register, respecting destination element
339 # bitwidth, and the element index (j)
340 set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
341
342 # increments both src and dest element indices (no predication here)
343 i++;
344 j++;
345
346 # Remapped LD/ST
347
348 In the [[sv/propagation]] page the concept of "Remapping" is described.
349 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
350 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
351 elements worth of LDs or STs. The usual interest in such re-mapping
352 is for example in separating out 24-bit RGB channel data into separate
353 contiguous registers. NEON covers this as shown in the diagram below:
354
355 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
356
357 Remap easily covers this capability, and with dest
358 elwidth overrides and saturation may do so with built-in conversion that
359 would normally require additional width-extension, sign-extension and
360 min/max Vectorised instructions as post-processing stages.
361
362 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
363 because the generic abstracted concept of "Remapping", when applied to
364 LD/ST, will give that same capability, with far more flexibility.
365
366 # notes from lxo
367
368 this section covers assembly notation for the immediate and indexed LD/ST.
369 the summary is that in immediate mode for LD it is not clear that if the
370 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
371 the memory being read is *still a vector load*, known as "unit or element strides".
372
373 This anomaly is made clear with the following notation:
374
375 sv.ld RT.v, imm(RA).v
376
377 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
378
379 sv.ld RT.v, imm(RA)
380
381 Notes taken from IRC conversation
382
383 <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
384 <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
385 <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
386 <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
387 <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
388 <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
389 <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
390
391 permutations of vector selection, to identify above asm-syntax:
392
393 imm(RA) RT.v RA.v nonstrided
394 sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
395 mem@ 0+r#2 offs+(r#2+1) offs+(r#2+2)
396 destreg r# r#+1 r#+2
397 imm(RA) RT.s RA.v nonstrided
398 sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
399 (dest r# is scalar) -> VSELECT mode
400 imm(RA) RT.v RA.s fixed stride: unit or element
401 sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
402 mem@r#2 +0 +1 +2
403 destreg r# r#+1 r#+2
404 sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
405 mem@r#2 +0 ... +offs ... +offs*2
406 destreg r# r#+1 r#+2
407 imm(RA) RT.s RA.s not vectorised
408 sv.ld r#, ofst(r#2)
409
410 indexed mode:
411
412 RA,RB RT.v RA.v RB.v
413 sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
414 RA,RB RT.v RA.s RB.v
415 sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
416 RA,RB RT.v RA.v RB.s
417 sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
418 RA,RB RT.v RA.s RB.s
419 sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
420 RA,RB RT.s RA.v RB.v
421 RA,RB RT.s RA.s RB.v
422 RA,RB RT.s RA.v RB.s
423 RA,RB RT.s RA.s RB.s not vectorised