(no commit message)
[libreriscv.git] / openpower / sv / ldst.mdwn
1 [[!tag standards]]
2
3 # SV Load and Store
4
5 Links:
6
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
12 * [[simple_v_extension/specification/ld.x]]
13
14 Vectorisation of Load and Store requires creation, from scalar operations,
15 a number of different modes:
16
17 * fixed stride (contiguous sequence with no gaps) aka "unit" stride
18 * element strided (sequential but regularly offset, with gaps)
19 * vector indexed (vector of base addresses and vector of offsets)
20 * fail-first on the same (where it makes sense to do so)
21 * Structure Packing (covered in SV by [[sv/remap]]).
22
23 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
24 [[isa/fixedstore]] pseudocode to be of the form:
25
26 lbux RT, RA, RB
27 EA <- (RA) + (RB)
28 RT <- MEM(EA)
29
30 and for immediate variants:
31
32 lb RT,D(RA)
33 EA <- RA + EXTS(D)
34 RT <- MEM(EA)
35
36 Thus in the first example, the source registers may each be independently
37 marked as scalar or vector, and likewise the destination; in the second
38 example only the one source and one dest may be marked as scalar or
39 vector.
40
41 Thus we can see that Vector Indexed may be covered, and, as demonstrated
42 with the pseudocode below, the immediate can be used to give unit stride or element stride. With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
43
44 # LD not VLD! format - ldop RT, immed(RA)
45 # op_width: lb=1, lh=2, lw=4, ld=8
46 op_load(RT, RA, RC, op_width, immed, svctx, RAupdate):
47  ps = get_pred_val(FALSE, RA); # predication on src
48  pd = get_pred_val(FALSE, RT); # ... AND on dest
49  for (i=0, j=0, u=0; i < VL && j < VL;):
50 # skip nonpredicates elements
51 if (RA.isvec) while (!(ps & 1<<i)) i++;
52 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
53 if (RT.isvec) while (!(pd & 1<<j)) j++;
54 if svctx.ldstmode == shifted: # for FFT/DCT
55 # FFT/DCT shifted mode
56 if (RA.isvec)
57 srcbase = ireg[RA+i]
58 else
59 srcbase = ireg[RA]
60 offs = (i * immed) << RC
61 elif svctx.ldstmode == elementstride:
62 # element stride mode
63 srcbase = ireg[RA]
64 offs = i * immed # j*immed for a ST
65 elif svctx.ldstmode == unitstride:
66 # unit stride mode
67 srcbase = ireg[RA]
68 offs = immed + (i * op_width) # j*op_width for ST
69 elif RA.isvec:
70 # quirky Vector indexed mode but with an immediate
71 srcbase = ireg[RA+i]
72 offs = immed;
73 else
74 # standard scalar mode (but predicated)
75 # no stride multiplier means VSPLAT mode
76 srcbase = ireg[RA]
77 offs = immed
78
79 # compute EA
80 EA = srcbase + offs
81 # update RA?
82 if RAupdate: ireg[RAupdate+u] = EA;
83 # load from memory
84 ireg[RT+j] <= MEM[EA];
85 if (!RT.isvec)
86 break # destination scalar, end now
87 if (RA.isvec) i++;
88 if (RAupdate.isvec) u++;
89 if (RT.isvec) j++;
90
91 # reverses the bitorder up to "width" bits
92 def bitrev(val, VL):
93 width = log2(VL)
94 result = 0
95 for _ in range(width):
96 result = (result << 1) | (val & 1)
97 val >>= 1
98 return result
99
100 Indexed LD is:
101
102 # format: ldop RT, RA, RB
103 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
104  ps = get_pred_val(FALSE, RA); # predication on src
105  pd = get_pred_val(FALSE, RT); # ... AND on dest
106  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
107 # skip nonpredicated RA, RB and RT
108 if (RA.isvec) while (!(ps & 1<<i)) i++;
109 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
110 if (RB.isvec) while (!(ps & 1<<k)) k++;
111 if (RT.isvec) while (!(pd & 1<<j)) j++;
112 if svctx.ldstmode == elementstride:
113 EA = ireg[RA] + ireg[RB]*j # register-strided
114 else
115 EA = ireg[RA+i] + ireg[RB+k] # indexed address
116 if RAupdate: ireg[RAupdate+u] = EA
117 ireg[RT+j] <= MEM[EA];
118 if (!RT.isvec)
119 break # destination scalar, end immediately
120 if (!RA.isvec && !RB.isvec)
121 break # scalar-scalar
122 if (RA.isvec) i++;
123 if (RAupdate.isvec) u++;
124 if (RB.isvec) k++;
125 if (RT.isvec) j++;
126
127 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source. This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
128
129 # Determining the LD/ST Modes
130
131 A minor complication (caused by the retro-fitting of modern Vector
132 features to a Scalar ISA) is that certain features do not exactly make
133 sense or are considered a security risk. Fail-first on Vector Indexed
134 would allow attackers to probe large numbers of pages from userspace, where
135 strided fail-first (by creating contiguous sequential LDs) does not.
136
137 In addition, reduce mode makes no sense, and for LD/ST with immediates
138 Vector source RA makes no sense either (or, is a quirk). Realistically we need
139 an alternative table meaning for [[sv/svp64]] mode. The following modes make sense:
140
141 * saturation
142 * predicate-result (mostly for cache-inhibited LD/ST)
143 * normal
144 * fail-first (where Vector Indexed is banned)
145 * Signed Effective Address computation (Vector Indexed only)
146
147 Also, given that FFT, DCT and other related algorithms
148 are of such high importance in so many areas of Computer
149 Science, a special "shift" mode has been added which
150 allows part of the immediate to be used instead as RC, a register
151 which shifts the immediate `DS << GPR(RC)`.
152
153 The table for [[sv/svp64]] for `immed(RA)` is:
154
155 | 0-1 | 2 | 3 4 | description |
156 | --- | --- |---------|--------------------------- |
157 | 00 | 0 | dz els | normal mode |
158 | 00 | 1 | dz shf | shift mode |
159 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
160 | 01 | inv | els RC1 | Rc=0: ffirst z/nonz |
161 | 10 | N | dz els | sat mode: N=0/1 u/s |
162 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
163 | 11 | inv | els RC1 | Rc=0: pred-result z/nonz |
164
165 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
166 whether stride is unit or element:
167
168 if bitreversed:
169 svctx.ldstmode = bitreversed
170 elif RA.isvec:
171 svctx.ldstmode = indexed
172 elif els == 0:
173 svctx.ldstmode = unitstride
174 elif immediate != 0:
175 svctx.ldstmode = elementstride
176
177 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
178 in effect the multiplication of the immediate-offset by zero results
179 in reading from the exact same memory location.
180
181 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
182 just the once and be copied, rather than hitting the Data Cache
183 multiple times with the same memory read at the same location.
184 This would allow for memory-mapped peripherals to have multiple
185 data values read in quick succession and stored in sequentially
186 numbered registers.
187
188 For non-cache-inhibited ST from a vector source onto a scalar
189 destination: with the Vector
190 loop effectively creating multiple memory writes to the same location,
191 we can deduce that the last of these will be the "successful" one. Thus,
192 implementations are free and clear to optimise out the overwriting STs,
193 leaving just the last one as the "winner". Bear in mind that predicate
194 masks will skip some elements (in source non-zeroing mode).
195 Cache-inhibited ST operations on the other hand **MUST** write out
196 a Vector source multiple successive times to the exact same Scalar
197 destination.
198
199 Note that there are no immediate versions of cache-inhibited LD/ST.
200
201 The modes for `RA+RB` indexed version are slightly different:
202
203 | 0-1 | 2 | 3 4 | description |
204 | --- | --- |---------|-------------------------- |
205 | 00 | SEA | dz sz | normal mode |
206 | 01 | SEA | dz sz | Strided (scalar only source) |
207 | 10 | N | dz sz | sat mode: N=0/1 u/s |
208 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
209 | 11 | inv | dz RC1 | Rc=0: pred-result z/nonz |
210
211 Vector Indexed Strided Mode is qualified as follows:
212
213 if mode = 0b01 and !RA.isvec and !RB.isvec:
214 svctx.ldstmode = elementstride
215
216 A summary of the effect of Vectorisation of src or dest:
217
218 imm(RA) RT.v RA.v no stride allowed
219 imm(RA) RT.s RA.v no stride allowed
220 imm(RA) RT.v RA.s stride-select allowed
221 imm(RA) RT.s RA.s not vectorised
222 RA,RB RT.v {RA|RB}.v UNDEFINED
223 RA,RB RT.s {RA|RB}.v UNDEFINED
224 RA,RB RT.v {RA&RB}.s VSPLAT possible. stride selectable
225 RA,RB RT.s {RA&RB}.s not vectorised
226
227 Signed Effective Address computation is only relevant for
228 Vector Indexed Mode, when elwidth overrides are applied.
229 The source override applies to RB, and before adding to
230 RA in order to calculate the Effective Address, if SEA is
231 set RB is sign-extended from elwidth bits to the full 64
232 bits. For other Modes (ffirst, saturate),
233 all EA computation is unsigned.
234
235 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially. `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers. This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
236 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
237 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
238
239 ## LD/ST ffirst
240
241 ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail. Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST. See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
242
243 for(i = 0; i < VL; i++)
244 reg[rt + i] = mem[reg[ra] + i * reg[rb]];
245
246 High security implementations where any kind of speculative probing
247 of memory pages is considered a risk should take advantage of the fact that
248 implementations may truncate VL at any point, without requiring software
249 to be rewritten and made non-portable. Such implementations may choose
250 to *always* set VL=1 which will have the effect of terminating any
251 speculative probing (and also adversely affect performance), but will
252 at least not require applications to be rewritten.
253
254 Low-performance simpler hardware implementations may
255 choose to also set VL=1 as the bare minimum compliant implementation of
256 LD/ST Fail-First. It is however critically important to remember that
257 the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
258 **MUST** raise exceptions exactly like an ordinary LD/ST.
259
260 # LOAD/STORE Elwidths <a name="elwidth"></a>
261
262 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
263 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
264 others like it provide an explicit operation width. There are therefore
265 *three* widths involved:
266
267 * operation width (lb=8, lh=16, lw=32, ld=64)
268 * src elelent width override
269 * destination element width override
270
271 Some care is therefore needed to express and make clear the transformations,
272 which are expressly in this order:
273
274 * Load at the operation width (lb/lh/lw/ld) as usual
275 * byte-reversal as usual
276 * Non-saturated mode:
277 - zero-extension or truncation from operation width to source elwidth
278 - zero/truncation to dest elwidth
279 * Saturated mode:
280 - Sign-extension or truncation from operation width to source width
281 - signed/unsigned saturation down to dest elwidth
282
283 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
284 is treated effectively as completely separate and distinct from SV
285 augmentation. This is primarily down to quirks surrounding LE/BE and
286 byte-reversal in OpenPOWER.
287
288 It is unfortunately possible to request an elwidth override on the memory side which
289 does not mesh with the operation width: these result in `UNDEFINED`
290 behaviour. The reason is that the effect of attempting a 64-bit `sv.ld`
291 operation with a source elwidth override of 8/16/32 would result in
292 overlapping memory requests, particularly on unit and element strided
293 operations. Thus it is `UNDEFINED` when the elwidth is smaller than
294 the memory operation width. Examples include `sv.lw/sw=16/els` which
295 requests (overlapping) 4-byte memory reads offset from
296 each other at 2-byte intervals. Store likewise is also `UNDEFINED`
297 where the dest elwidth override is less than the operation width.
298
299 Note the following regarding the pseudocode to follow:
300
301 * `scalar identity behaviour` SV Context parameter conditions turn this
302 into a straight absolute fully-compliant Scalar v3.0B LD operation
303 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
304 rather than `ld`)
305 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
306 a "normal" part of Scalar v3.0B LD
307 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
308 as a "normal" part of Scalar v3.0B LD
309 * `svctx` specifies the SV Context and includes VL as well as
310 source and destination elwidth overrides.
311
312 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
313
314 Note that twin predication, predication-zeroing, saturation
315 and other modes have all been removed, for clarity and simplicity:
316
317 # LD not VLD! (ldbrx if brev=True)
318 # this covers unit stride mode and a type of vector offset
319 function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
320 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
321
322 if not svctx.unit/el-strided:
323 # strange vector mode, compute 64 bit address which is
324 # not polymorphic! elwidth hardcoded to 64 here
325 srcbase = get_polymorphed_reg(RA, 64, i)
326 else:
327 # unit / element stride mode, compute 64 bit address
328 srcbase = get_polymorphed_reg(RA, 64, 0)
329 # adjust for unit/el-stride
330 srcbase += ....
331
332 # takes care of (merges) processor LE/BE and ld/ldbrx
333 bytereverse = brev XNOR MSR.LE
334
335 # read the underlying memory
336 memread <= mem[srcbase + imm_offs];
337
338 # optionally performs byteswap at op width
339 if (bytereverse):
340 memread = byteswap(memread, op_width)
341
342 # check saturation.
343 if svpctx.saturation_mode:
344 ... saturation adjustment...
345 else:
346 # truncate/extend to over-ridden source width.
347 memread = adjust_wid(memread, op_width, svctx.src_elwidth)
348
349 # takes care of inserting memory-read (now correctly byteswapped)
350 # into regfile underlying LE-defined order, into the right place
351 # within the NEON-like register, respecting destination element
352 # bitwidth, and the element index (j)
353 set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
354
355 # increments both src and dest element indices (no predication here)
356 i++;
357 j++;
358
359 # Remapped LD/ST
360
361 In the [[sv/propagation]] page the concept of "Remapping" is described.
362 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
363 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
364 elements worth of LDs or STs. The usual interest in such re-mapping
365 is for example in separating out 24-bit RGB channel data into separate
366 contiguous registers. NEON covers this as shown in the diagram below:
367
368 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
369
370 Remap easily covers this capability, and with dest
371 elwidth overrides and saturation may do so with built-in conversion that
372 would normally require additional width-extension, sign-extension and
373 min/max Vectorised instructions as post-processing stages.
374
375 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
376 because the generic abstracted concept of "Remapping", when applied to
377 LD/ST, will give that same capability, with far more flexibility.
378
379 # notes from lxo
380
381 this section covers assembly notation for the immediate and indexed LD/ST.
382 the summary is that in immediate mode for LD it is not clear that if the
383 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
384 the memory being read is *still a vector load*, known as "unit or element strides".
385
386 This anomaly is made clear with the following notation:
387
388 sv.ld RT.v, imm(RA).v
389
390 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
391
392 sv.ld RT.v, imm(RA)
393
394 Notes taken from IRC conversation
395
396 <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
397 <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
398 <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
399 <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
400 <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
401 <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
402 <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
403
404 permutations of vector selection, to identify above asm-syntax:
405
406 imm(RA) RT.v RA.v nonstrided
407 sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
408 mem@ 0+r#2 offs+(r#2+1) offs+(r#2+2)
409 destreg r# r#+1 r#+2
410 imm(RA) RT.s RA.v nonstrided
411 sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
412 (dest r# is scalar) -> VSELECT mode
413 imm(RA) RT.v RA.s fixed stride: unit or element
414 sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
415 mem@r#2 +0 +1 +2
416 destreg r# r#+1 r#+2
417 sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
418 mem@r#2 +0 ... +offs ... +offs*2
419 destreg r# r#+1 r#+2
420 imm(RA) RT.s RA.s not vectorised
421 sv.ld r#, ofst(r#2)
422
423 indexed mode:
424
425 RA,RB RT.v RA.v RB.v
426 sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
427 RA,RB RT.v RA.s RB.v
428 sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
429 RA,RB RT.v RA.v RB.s
430 sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
431 RA,RB RT.v RA.s RB.s
432 sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
433 RA,RB RT.s RA.v RB.v
434 RA,RB RT.s RA.s RB.v
435 RA,RB RT.s RA.v RB.s
436 RA,RB RT.s RA.s RB.s not vectorised