(no commit message)
[libreriscv.git] / openpower / sv / ldst.mdwn
1 [[!tag standards]]
2
3 # SV Load and Store
4
5 Links:
6
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
12
13 Vectorisation of Load and Store requires creation, from scalar operations,
14 a number of different modes:
15
16 * fixed stride (contiguous sequence with no gaps) aka "unit" stride
17 * element strided (sequential but regularly offset, with gaps)
18 * vector indexed (vector of base addresses and vector of offsets)
19 * fail-first on the same (where it makes sense to do so)
20 * Structure Packing (covered in SV by [[sv/remap]]).
21
22 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
23 [[isa/fixedstore]] pseudocode to be of the form:
24
25 lbux RT, RA, RB
26 EA <- (RA) + (RB)
27 RT <- MEM(EA)
28
29 and for immediate variants:
30
31 lb RT,D(RA)
32 EA <- RA + EXTS(D)
33 RT <- MEM(EA)
34
35 Thus in the first example, the source registers may each be independently
36 marked as scalar or vector, and likewise the destination; in the second
37 example only the one source and one dest may be marked as scalar or
38 vector.
39
40 Thus we can see that Vector Indexed may be covered, and, as demonstrated
41 with the pseudocode below, the immediate can be used to give unit stride or element stride. With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
42
43 # LD not VLD! format - ldop RT, immed(RA)
44 # op_width: lb=1, lh=2, lw=4, ld=8
45 op_load(RT, RA, RC, op_width, immed, svctx, RAupdate):
46  ps = get_pred_val(FALSE, RA); # predication on src
47  pd = get_pred_val(FALSE, RT); # ... AND on dest
48  for (i=0, j=0, u=0; i < VL && j < VL;):
49 # skip nonpredicates elements
50 if (RA.isvec) while (!(ps & 1<<i)) i++;
51 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
52 if (RT.isvec) while (!(pd & 1<<j)) j++;
53 if svctx.ldstmode == bitreversed: # for FFT/DCT
54 # FFT/DCT bitreversed mode
55 if (RA.isvec)
56 srcbase = ireg[RA+i]
57 else
58 srcbase = ireg[RA]
59 offs = (bitrev(i, VL) * immed) << RC
60 elif svctx.ldstmode == elementstride:
61 # element stride mode
62 srcbase = ireg[RA]
63 offs = i * immed # j*immed for a ST
64 elif svctx.ldstmode == unitstride:
65 # unit stride mode
66 srcbase = ireg[RA]
67 offs = immed + (i * op_width) # j*op_width for ST
68 elif RA.isvec:
69 # quirky Vector indexed mode but with an immediate
70 srcbase = ireg[RA+i]
71 offs = immed;
72 else
73 # standard scalar mode (but predicated)
74 # no stride multiplier means VSPLAT mode
75 srcbase = ireg[RA]
76 offs = immed
77
78 # compute EA
79 EA = srcbase + offs
80 # update RA?
81 if RAupdate: ireg[RAupdate+u] = EA;
82 # load from memory
83 ireg[RT+j] <= MEM[EA];
84 if (!RT.isvec)
85 break # destination scalar, end now
86 if (RA.isvec) i++;
87 if (RAupdate.isvec) u++;
88 if (RT.isvec) j++;
89
90 # reverses the bitorder up to "width" bits
91 def bitrev(val, VL):
92 width = log2(VL)
93 result = 0
94 for _ in range(width):
95 result = (result << 1) | (val & 1)
96 val >>= 1
97 return result
98
99 Indexed LD is:
100
101 # format: ldop RT, RA, RB
102 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
103  ps = get_pred_val(FALSE, RA); # predication on src
104  pd = get_pred_val(FALSE, RT); # ... AND on dest
105  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
106 # skip nonpredicated RA, RB and RT
107 if (RA.isvec) while (!(ps & 1<<i)) i++;
108 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
109 if (RB.isvec) while (!(ps & 1<<k)) k++;
110 if (RT.isvec) while (!(pd & 1<<j)) j++;
111 EA = ireg[RA+i] + ireg[RB+k] # indexed address
112 if RAupdate: ireg[RAupdate+u] = EA
113 ireg[RT+j] <= MEM[EA];
114 if (!RT.isvec)
115 break # destination scalar, end immediately
116 if (!RA.isvec && !RB.isvec)
117 break # scalar-scalar
118 if (RA.isvec) i++;
119 if (RAupdate.isvec) u++;
120 if (RB.isvec) k++;
121 if (RT.isvec) j++;
122
123 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source. This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
124
125 # Determining the LD/ST Modes
126
127 A minor complication (caused by the retro-fitting of modern Vector
128 features to a Scalar ISA) is that certain features do not exactly make
129 sense or are considered a security risk. Fail-first on Vector Indexed
130 allows attackers to probe large numbers of pages from userspace, where
131 strided fail-first (by creating contiguous sequential LDs) does not.
132
133 In addition, reduce mode makes no sense, and for LD/ST with immediates
134 Vector source RA makes no sense either (or, is a quirk). Realistically we need
135 an alternative table meaning for [[sv/svp64]] mode. The following modes make sense:
136
137 * saturation
138 * predicate-result (mostly for cache-inhibited LD/ST)
139 * normal
140 * fail-first, where a vector source on RA or RB is banned
141
142 The table for [[sv/svp64]] for `immed(RA)` is:
143
144 | 0-1 | 2 | 3 4 | description |
145 | --- | --- |---------|--------------------------- |
146 | 00 | 0 | dz els | normal mode |
147 | 00 | 1 | dz rsv | bitreverse mode (FFT, DCT) |
148 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
149 | 01 | inv | els RC1 | Rc=0: ffirst z/nonz |
150 | 10 | N | dz els | sat mode: N=0/1 u/s |
151 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
152 | 11 | inv | els RC1 | Rc=0: pred-result z/nonz |
153
154 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
155 whether stride is unit or element:
156
157 if bitreversed:
158 svctx.ldstmode = bitreversed
159 elif RA.isvec:
160 svctx.ldstmode = indexed
161 elif els == 0:
162 svctx.ldstmode = unitstride
163 elif immediate != 0:
164 svctx.ldstmode = elementstride
165
166 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
167 in effect the multiplication of the immediate-offset by zero results
168 in reading from the exact same memory location.
169
170 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
171 just the once and be copied, rather than hitting the Data Cache
172 multiple times with the same memory read at the same location.
173
174 For ST from a vector source onto a scalar destination: with the Vector
175 loop effectively creating multiple memory writes to the same location,
176 we can deduce that the last of these will be the "successful" one. Thus,
177 implementations are free and clear to optimise out the overwriting STs,
178 leaving just the last one as the "winner". Bear in mind that predicate
179 masks will skip some elements (in source non-zeroing mode).
180
181 Note that there are no immediate versions of cache-inhibited LD/ST.
182
183 The modes for `RA+RB` indexed version are slightly different:
184
185 | 0-1 | 2 | 3 4 | description |
186 | --- | --- |---------|-------------------------- |
187 | 00 | 0 | dz sz | normal mode |
188 | 00 | 1 | rsvd | reserved |
189 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
190 | 01 | inv | dz RC1 | Rc=0: ffirst z/nonz |
191 | 10 | N | dz sz | sat mode: N=0/1 u/s |
192 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
193 | 11 | inv | dz RC1 | Rc=0: pred-result z/nonz |
194
195 A summary of the effect of Vectorisation of src or dest:
196
197 imm(RA) RT.v RA.v no stride allowed
198 imm(RA) RT.s RA.v no stride allowed
199 imm(RA) RT.v RA.s stride-select allowed
200 imm(RA) RT.s RA.s not vectorised
201 RA,RB RT.v RA/RB.v ffirst banned
202 RA,RB RT.s RA/RB.v ffirst banned
203 RA,RB RT.v RA/RB.s VSPLAT possible
204 RA,RB RT.s RA/RB.s not vectorised
205
206 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially. `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers. This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
207 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
208 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
209
210 # LOAD/STORE Elwidths <a name="ldst"></a>
211
212 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
213 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
214 others like it provide an explicit operation width. There are therefore
215 *three* widths involved:
216
217 * operation width (lb=8, lh=16, lw=32, ld=64)
218 * src elelent width override
219 * destination element width override
220
221 Some care is therefore needed to express and make clear the transformations,
222 which are expressly in this order:
223
224 * Load at the operation width (lb/lh/lw/ld) as usual
225 * byte-reversal as usual
226 * Non-saturated mode:
227 - zero-extension or truncation from operation width to source elwidth
228 - zero/truncation to dest elwidth
229 * Saturated mode:
230 - Sign-extension or truncation from operation width to source width
231 - signed/unsigned saturation down to dest elwidth
232
233 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
234 is treated effectively as completely separate and distinct from SV
235 augmentation. This is primarily down to quirks surrounding LE/BE and
236 byte-reversal in OpenPOWER.
237
238 Note the following regarding the pseudocode to follow:
239
240 * `scalar identity behaviour` SV Context parameter conditions turn this
241 into a straight absolute fully-compliant Scalar v3.0B LD operation
242 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
243 rather than `ld`)
244 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
245 a "normal" part of Scalar v3.0B LD
246 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
247 as a "normal" part of Scalar v3.0B LD
248 * `svctx` specifies the SV Context and includes VL as well as
249 source and destination elwidth overrides.
250
251 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
252
253 Note that twin predication, predication-zeroing, saturation
254 and other modes have all been removed, for clarity and simplicity:
255
256 # LD not VLD! (ldbrx if brev=True)
257 # this covers unit stride mode and a type of vector offset
258 function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
259 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
260
261 if not svctx.unit/el-strided:
262 # strange vector mode, compute 64 bit address which is
263 # not polymorphic! elwidth hardcoded to 64 here
264 srcbase = get_polymorphed_reg(RA, 64, i)
265 else:
266 # unit / element stride mode, compute 64 bit address
267 srcbase = get_polymorphed_reg(RA, 64, 0)
268 # adjust for unit/el-stride
269 srcbase += ....
270
271 # takes care of (merges) processor LE/BE and ld/ldbrx
272 bytereverse = brev XNOR MSR.LE
273
274 # read the underlying memory
275 memread <= mem[srcbase + imm_offs];
276
277 # optionally performs byteswap at op width
278 if (bytereverse):
279 memread = byteswap(memread, op_width)
280
281
282 # check saturation.
283 if svpctx.saturation_mode:
284 ... saturation adjustment...
285 else:
286 # truncate/extend to over-ridden source width.
287 memread = adjust_wid(memread, op_width, svctx.src_elwidth)
288
289 # takes care of inserting memory-read (now correctly byteswapped)
290 # into regfile underlying LE-defined order, into the right place
291 # within the NEON-like register, respecting destination element
292 # bitwidth, and the element index (j)
293 set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
294
295 # increments both src and dest element indices (no predication here)
296 i++;
297 j++;
298
299 # Remapped LD/ST
300
301 In the [[sv/propagation]] page the concept of "Remapping" is described.
302 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
303 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
304 elements worth of LDs or STs. The usual interest in such re-mapping
305 is for example in separating out 24-bit RGB channel data into separate
306 contiguous registers. NEON covers this as shown in the diagram below:
307
308 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
309
310 Remap easily covers this capability, and with dest
311 elwidth overrides and saturation may do so with built-in conversion that
312 would normally require additional width-extension, sign-extension and
313 min/max Vectorised instructions as post-processing stages.
314
315 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
316 because the generic abstracted concept of "Remapping", when applied to
317 LD/ST, will give that same capability, with far more flexibility.
318
319 # notes from lxo
320
321 this section covers assembly notation for the immediate and indexed LD/ST.
322 the summary is that in immediate mode for LD it is not clear that if the
323 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
324 the memory being read is *still a vector load*, known as "unit or element strides".
325
326 This anomaly is made clear with the following notation:
327
328 sv.ld RT.v, imm(RA).v
329
330 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
331
332 sv.ld RT.v, imm(RA)
333
334 Notes taken from IRC conversation
335
336 <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
337 <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
338 <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
339 <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
340 <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
341 <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
342 <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
343
344 permutations of vector selection, to identify above asm-syntax:
345
346 imm(RA) RT.v RA.v nonstrided
347 sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
348 mem@ 0+r#2 offs+(r#2+1) offs+(r#2+2)
349 destreg r# r#+1 r#+2
350 imm(RA) RT.s RA.v nonstrided
351 sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
352 (dest r# is scalar) -> VSELECT mode
353 imm(RA) RT.v RA.s fixed stride: unit or element
354 sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
355 mem@r#2 +0 +1 +2
356 destreg r# r#+1 r#+2
357 sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
358 mem@r#2 +0 ... +offs ... +offs*2
359 destreg r# r#+1 r#+2
360 imm(RA) RT.s RA.s not vectorised
361 sv.ld r#, ofst(r#2)
362
363 indexed mode:
364
365 RA,RB RT.v RA.v RB.v
366 sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
367 RA,RB RT.v RA.s RB.v
368 sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
369 RA,RB RT.v RA.v RB.s
370 sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
371 RA,RB RT.v RA.s RB.s
372 sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
373 RA,RB RT.s RA.v RB.v
374 RA,RB RT.s RA.s RB.v
375 RA,RB RT.s RA.v RB.s
376 RA,RB RT.s RA.s RB.s not vectorised