(no commit message)
[libreriscv.git] / openpower / sv / ldst.mdwn
1 [[!tag standards]]
2
3 # SV Load and Store
4
5 Links:
6
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
12 * [[simple_v_extension/specification/ld.x]]
13
14 Vectorisation of Load and Store requires creation, from scalar operations,
15 a number of different modes:
16
17 * fixed stride (contiguous sequence with no gaps) aka "unit" stride
18 * element strided (sequential but regularly offset, with gaps)
19 * vector indexed (vector of base addresses and vector of offsets)
20 * fail-first on the same (where it makes sense to do so)
21 * Structure Packing (covered in SV by [[sv/remap]]).
22
23 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
24 [[isa/fixedstore]] pseudocode to be of the form:
25
26 lbux RT, RA, RB
27 EA <- (RA) + (RB)
28 RT <- MEM(EA)
29
30 and for immediate variants:
31
32 lb RT,D(RA)
33 EA <- RA + EXTS(D)
34 RT <- MEM(EA)
35
36 Thus in the first example, the source registers may each be independently
37 marked as scalar or vector, and likewise the destination; in the second
38 example only the one source and one dest may be marked as scalar or
39 vector.
40
41 Thus we can see that Vector Indexed may be covered, and, as demonstrated
42 with the pseudocode below, the immediate can be used to give unit stride or element stride. With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
43
44 # LD not VLD! format - ldop RT, immed(RA)
45 # op_width: lb=1, lh=2, lw=4, ld=8
46 op_load(RT, RA, RC, op_width, immed, svctx, RAupdate):
47  ps = get_pred_val(FALSE, RA); # predication on src
48  pd = get_pred_val(FALSE, RT); # ... AND on dest
49  for (i=0, j=0, u=0; i < VL && j < VL;):
50 # skip nonpredicates elements
51 if (RA.isvec) while (!(ps & 1<<i)) i++;
52 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
53 if (RT.isvec) while (!(pd & 1<<j)) j++;
54 if svctx.ldstmode == shifted: # for FFT/DCT
55 # FFT/DCT shifted mode
56 if (RA.isvec)
57 srcbase = ireg[RA+i]
58 else
59 srcbase = ireg[RA]
60 offs = (i * immed) << RC
61 elif svctx.ldstmode == elementstride:
62 # element stride mode
63 srcbase = ireg[RA]
64 offs = i * immed # j*immed for a ST
65 elif svctx.ldstmode == unitstride:
66 # unit stride mode
67 srcbase = ireg[RA]
68 offs = immed + (i * op_width) # j*op_width for ST
69 elif RA.isvec:
70 # quirky Vector indexed mode but with an immediate
71 srcbase = ireg[RA+i]
72 offs = immed;
73 else
74 # standard scalar mode (but predicated)
75 # no stride multiplier means VSPLAT mode
76 srcbase = ireg[RA]
77 offs = immed
78
79 # compute EA
80 EA = srcbase + offs
81 # update RA?
82 if RAupdate: ireg[RAupdate+u] = EA;
83 # load from memory
84 ireg[RT+j] <= MEM[EA];
85 if (!RT.isvec)
86 break # destination scalar, end now
87 if (RA.isvec) i++;
88 if (RAupdate.isvec) u++;
89 if (RT.isvec) j++;
90
91 # reverses the bitorder up to "width" bits
92 def bitrev(val, VL):
93 width = log2(VL)
94 result = 0
95 for _ in range(width):
96 result = (result << 1) | (val & 1)
97 val >>= 1
98 return result
99
100 Indexed LD is:
101
102 # format: ldop RT, RA, RB
103 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
104  ps = get_pred_val(FALSE, RA); # predication on src
105  pd = get_pred_val(FALSE, RT); # ... AND on dest
106  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
107 # skip nonpredicated RA, RB and RT
108 if (RA.isvec) while (!(ps & 1<<i)) i++;
109 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
110 if (RB.isvec) while (!(ps & 1<<k)) k++;
111 if (RT.isvec) while (!(pd & 1<<j)) j++;
112 EA = ireg[RA+i] + ireg[RB+k] # indexed address
113 if RAupdate: ireg[RAupdate+u] = EA
114 ireg[RT+j] <= MEM[EA];
115 if (!RT.isvec)
116 break # destination scalar, end immediately
117 if (!RA.isvec && !RB.isvec)
118 break # scalar-scalar
119 if (RA.isvec) i++;
120 if (RAupdate.isvec) u++;
121 if (RB.isvec) k++;
122 if (RT.isvec) j++;
123
124 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source. This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
125
126 # Determining the LD/ST Modes
127
128 A minor complication (caused by the retro-fitting of modern Vector
129 features to a Scalar ISA) is that certain features do not exactly make
130 sense or are considered a security risk. Fail-first on Vector Indexed
131 allows attackers to probe large numbers of pages from userspace, where
132 strided fail-first (by creating contiguous sequential LDs) does not.
133
134 In addition, reduce mode makes no sense, and for LD/ST with immediates
135 Vector source RA makes no sense either (or, is a quirk). Realistically we need
136 an alternative table meaning for [[sv/svp64]] mode. The following modes make sense:
137
138 * saturation
139 * predicate-result (mostly for cache-inhibited LD/ST)
140 * normal
141 * fail-first, where a vector source on RA or RB is banned
142
143 Also, given that FFT, DCT and other related algorithms
144 are of such high importance in so many areas of Computer
145 Science, a special "shift" mode has been added which
146 allows part of the immediate to be used instead as RC, a register
147 which shifts the immediate `DS << GPR(RC)`.
148
149 The table for [[sv/svp64]] for `immed(RA)` is:
150
151 | 0-1 | 2 | 3 4 | description |
152 | --- | --- |---------|--------------------------- |
153 | 00 | 0 | dz els | normal mode |
154 | 00 | 1 | dz shf | shift mode |
155 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
156 | 01 | inv | els RC1 | Rc=0: ffirst z/nonz |
157 | 10 | N | dz els | sat mode: N=0/1 u/s |
158 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
159 | 11 | inv | els RC1 | Rc=0: pred-result z/nonz |
160
161 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
162 whether stride is unit or element:
163
164 if bitreversed:
165 svctx.ldstmode = bitreversed
166 elif RA.isvec:
167 svctx.ldstmode = indexed
168 elif els == 0:
169 svctx.ldstmode = unitstride
170 elif immediate != 0:
171 svctx.ldstmode = elementstride
172
173 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
174 in effect the multiplication of the immediate-offset by zero results
175 in reading from the exact same memory location.
176
177 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
178 just the once and be copied, rather than hitting the Data Cache
179 multiple times with the same memory read at the same location.
180
181 For ST from a vector source onto a scalar destination: with the Vector
182 loop effectively creating multiple memory writes to the same location,
183 we can deduce that the last of these will be the "successful" one. Thus,
184 implementations are free and clear to optimise out the overwriting STs,
185 leaving just the last one as the "winner". Bear in mind that predicate
186 masks will skip some elements (in source non-zeroing mode).
187
188 Note that there are no immediate versions of cache-inhibited LD/ST.
189
190 The modes for `RA+RB` indexed version are slightly different:
191
192 | 0-1 | 2 | 3 4 | description |
193 | --- | --- |---------|-------------------------- |
194 | 00 | 0 | dz sz | normal mode |
195 | 00 | 1 | rsvd | reserved |
196 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
197 | 01 | inv | dz RC1 | Rc=0: ffirst z/nonz |
198 | 10 | N | dz sz | sat mode: N=0/1 u/s |
199 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
200 | 11 | inv | dz RC1 | Rc=0: pred-result z/nonz |
201
202 A summary of the effect of Vectorisation of src or dest:
203
204 imm(RA) RT.v RA.v no stride allowed
205 imm(RA) RT.s RA.v no stride allowed
206 imm(RA) RT.v RA.s stride-select allowed
207 imm(RA) RT.s RA.s not vectorised
208 RA,RB RT.v RA/RB.v ffirst banned
209 RA,RB RT.s RA/RB.v ffirst banned
210 RA,RB RT.v RA/RB.s VSPLAT possible
211 RA,RB RT.s RA/RB.s not vectorised
212
213 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially. `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers. This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
214 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
215 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
216
217 ## LD/ST ffirst
218
219 ffirst LD/ST to multiple pages via a Vectorised base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail. Therefore in these special circumstances requesting ffirst with a vector base is instead interpreted as element-strided LD/ST. See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
220
221 # LOAD/STORE Elwidths <a name="ldst"></a>
222
223 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
224 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
225 others like it provide an explicit operation width. There are therefore
226 *three* widths involved:
227
228 * operation width (lb=8, lh=16, lw=32, ld=64)
229 * src elelent width override
230 * destination element width override
231
232 Some care is therefore needed to express and make clear the transformations,
233 which are expressly in this order:
234
235 * Load at the operation width (lb/lh/lw/ld) as usual
236 * byte-reversal as usual
237 * Non-saturated mode:
238 - zero-extension or truncation from operation width to source elwidth
239 - zero/truncation to dest elwidth
240 * Saturated mode:
241 - Sign-extension or truncation from operation width to source width
242 - signed/unsigned saturation down to dest elwidth
243
244 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
245 is treated effectively as completely separate and distinct from SV
246 augmentation. This is primarily down to quirks surrounding LE/BE and
247 byte-reversal in OpenPOWER.
248
249 Note the following regarding the pseudocode to follow:
250
251 * `scalar identity behaviour` SV Context parameter conditions turn this
252 into a straight absolute fully-compliant Scalar v3.0B LD operation
253 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
254 rather than `ld`)
255 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
256 a "normal" part of Scalar v3.0B LD
257 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
258 as a "normal" part of Scalar v3.0B LD
259 * `svctx` specifies the SV Context and includes VL as well as
260 source and destination elwidth overrides.
261
262 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
263
264 Note that twin predication, predication-zeroing, saturation
265 and other modes have all been removed, for clarity and simplicity:
266
267 # LD not VLD! (ldbrx if brev=True)
268 # this covers unit stride mode and a type of vector offset
269 function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
270 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
271
272 if not svctx.unit/el-strided:
273 # strange vector mode, compute 64 bit address which is
274 # not polymorphic! elwidth hardcoded to 64 here
275 srcbase = get_polymorphed_reg(RA, 64, i)
276 else:
277 # unit / element stride mode, compute 64 bit address
278 srcbase = get_polymorphed_reg(RA, 64, 0)
279 # adjust for unit/el-stride
280 srcbase += ....
281
282 # takes care of (merges) processor LE/BE and ld/ldbrx
283 bytereverse = brev XNOR MSR.LE
284
285 # read the underlying memory
286 memread <= mem[srcbase + imm_offs];
287
288 # optionally performs byteswap at op width
289 if (bytereverse):
290 memread = byteswap(memread, op_width)
291
292
293 # check saturation.
294 if svpctx.saturation_mode:
295 ... saturation adjustment...
296 else:
297 # truncate/extend to over-ridden source width.
298 memread = adjust_wid(memread, op_width, svctx.src_elwidth)
299
300 # takes care of inserting memory-read (now correctly byteswapped)
301 # into regfile underlying LE-defined order, into the right place
302 # within the NEON-like register, respecting destination element
303 # bitwidth, and the element index (j)
304 set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
305
306 # increments both src and dest element indices (no predication here)
307 i++;
308 j++;
309
310 # Remapped LD/ST
311
312 In the [[sv/propagation]] page the concept of "Remapping" is described.
313 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
314 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
315 elements worth of LDs or STs. The usual interest in such re-mapping
316 is for example in separating out 24-bit RGB channel data into separate
317 contiguous registers. NEON covers this as shown in the diagram below:
318
319 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
320
321 Remap easily covers this capability, and with dest
322 elwidth overrides and saturation may do so with built-in conversion that
323 would normally require additional width-extension, sign-extension and
324 min/max Vectorised instructions as post-processing stages.
325
326 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
327 because the generic abstracted concept of "Remapping", when applied to
328 LD/ST, will give that same capability, with far more flexibility.
329
330 # notes from lxo
331
332 this section covers assembly notation for the immediate and indexed LD/ST.
333 the summary is that in immediate mode for LD it is not clear that if the
334 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
335 the memory being read is *still a vector load*, known as "unit or element strides".
336
337 This anomaly is made clear with the following notation:
338
339 sv.ld RT.v, imm(RA).v
340
341 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
342
343 sv.ld RT.v, imm(RA)
344
345 Notes taken from IRC conversation
346
347 <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
348 <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
349 <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
350 <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
351 <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
352 <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
353 <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
354
355 permutations of vector selection, to identify above asm-syntax:
356
357 imm(RA) RT.v RA.v nonstrided
358 sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
359 mem@ 0+r#2 offs+(r#2+1) offs+(r#2+2)
360 destreg r# r#+1 r#+2
361 imm(RA) RT.s RA.v nonstrided
362 sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
363 (dest r# is scalar) -> VSELECT mode
364 imm(RA) RT.v RA.s fixed stride: unit or element
365 sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
366 mem@r#2 +0 +1 +2
367 destreg r# r#+1 r#+2
368 sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
369 mem@r#2 +0 ... +offs ... +offs*2
370 destreg r# r#+1 r#+2
371 imm(RA) RT.s RA.s not vectorised
372 sv.ld r#, ofst(r#2)
373
374 indexed mode:
375
376 RA,RB RT.v RA.v RB.v
377 sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
378 RA,RB RT.v RA.s RB.v
379 sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
380 RA,RB RT.v RA.v RB.s
381 sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
382 RA,RB RT.v RA.s RB.s
383 sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
384 RA,RB RT.s RA.v RB.v
385 RA,RB RT.s RA.s RB.v
386 RA,RB RT.s RA.v RB.s
387 RA,RB RT.s RA.s RB.s not vectorised