(no commit message)
[libreriscv.git] / openpower / sv / ldst.mdwn
1 [[!tag standards]]
2
3 # SV Load and Store
4
5 Links:
6
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
12 * [[simple_v_extension/specification/ld.x]]
13
14 # Rationale
15
16 All Vector ISAs dating back fifty years have extensive and comprehensive
17 Load and Store operations that go far beyond the capabilities of Scalar
18 RISC or CISC processors, yet at their heart on an individual element
19 basis may be found to be no different from RISC Scalar equivalents.
20
21 The resource savings from Vector LD/ST are significant and stem from
22 the fact that one single instruction can trigger a dozen (or in some
23 microarchitectures such as Cray or NEC SX Aurora) hundreds of element-level Memory accesses.
24
25 Additionally, and simply: if the Arithmetic side of an ISA supports
26 Vector Operations, then in order to keep the ALUs 100% occupied the
27 Memory infrastructure (and the ISA itself) correspondingly needs Vector
28 Memory Operations as well.
29
30 Vectorised Load and Store also presents an extra dimension (literally)
31 which creates scenarios unique to Vector applications, that a Scalar
32 (and even a SIMD) ISA simply never encounters. SVP64 endeavours to
33 add such modes without changing the behaviour of the underlying Base
34 (Scalar) v3.0B operations.
35
36 # Modes overview
37
38 Vectorisation of Load and Store requires creation, from scalar operations,
39 a number of different modes:
40
41 * fixed stride (contiguous sequence with no gaps) aka "unit" stride
42 * element strided (sequential but regularly offset, with gaps)
43 * vector indexed (vector of base addresses and vector of offsets)
44 * Speculative fail-first (where it makes sense to do so)
45 * Structure Packing (covered in SV by [[sv/remap]]).
46
47 Also included in SVP64 LD/ST is both signed and unsigned Saturation,
48 as well as Element-width overrides and Twin-Predication.
49
50 # Vectorisation of Scalar Power ISA v3.0B
51
52 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
53 [[isa/fixedstore]] pseudocode to be of the form:
54
55 lbux RT, RA, RB
56 EA <- (RA) + (RB)
57 RT <- MEM(EA)
58
59 and for immediate variants:
60
61 lb RT,D(RA)
62 EA <- RA + EXTS(D)
63 RT <- MEM(EA)
64
65 Thus in the first example, the source registers may each be independently
66 marked as scalar or vector, and likewise the destination; in the second
67 example only the one source and one dest may be marked as scalar or
68 vector.
69
70 Thus we can see that Vector Indexed may be covered, and, as demonstrated
71 with the pseudocode below, the immediate can be used to give unit stride or element stride. With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
72
73 # LD not VLD! format - ldop RT, immed(RA)
74 # op_width: lb=1, lh=2, lw=4, ld=8
75 op_load(RT, RA, RC, op_width, immed, svctx, RAupdate):
76  ps = get_pred_val(FALSE, RA); # predication on src
77  pd = get_pred_val(FALSE, RT); # ... AND on dest
78  for (i=0, j=0, u=0; i < VL && j < VL;):
79 # skip nonpredicates elements
80 if (RA.isvec) while (!(ps & 1<<i)) i++;
81 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
82 if (RT.isvec) while (!(pd & 1<<j)) j++;
83 if svctx.ldstmode == shifted: # for FFT/DCT
84 # FFT/DCT shifted mode
85 if (RA.isvec)
86 srcbase = ireg[RA+i]
87 else
88 srcbase = ireg[RA]
89 offs = (i * immed) << RC
90 elif svctx.ldstmode == elementstride:
91 # element stride mode
92 srcbase = ireg[RA]
93 offs = i * immed # j*immed for a ST
94 elif svctx.ldstmode == unitstride:
95 # unit stride mode
96 srcbase = ireg[RA]
97 offs = immed + (i * op_width) # j*op_width for ST
98 elif RA.isvec:
99 # quirky Vector indexed mode but with an immediate
100 srcbase = ireg[RA+i]
101 offs = immed;
102 else
103 # standard scalar mode (but predicated)
104 # no stride multiplier means VSPLAT mode
105 srcbase = ireg[RA]
106 offs = immed
107
108 # compute EA
109 EA = srcbase + offs
110 # update RA?
111 if RAupdate: ireg[RAupdate+u] = EA;
112 # load from memory
113 ireg[RT+j] <= MEM[EA];
114 if (!RT.isvec)
115 break # destination scalar, end now
116 if (RA.isvec) i++;
117 if (RAupdate.isvec) u++;
118 if (RT.isvec) j++;
119
120 # reverses the bitorder up to "width" bits
121 def bitrev(val, VL):
122 width = log2(VL)
123 result = 0
124 for _ in range(width):
125 result = (result << 1) | (val & 1)
126 val >>= 1
127 return result
128
129 Indexed LD is:
130
131 # format: ldop RT, RA, RB
132 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
133  ps = get_pred_val(FALSE, RA); # predication on src
134  pd = get_pred_val(FALSE, RT); # ... AND on dest
135  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
136 # skip nonpredicated RA, RB and RT
137 if (RA.isvec) while (!(ps & 1<<i)) i++;
138 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
139 if (RB.isvec) while (!(ps & 1<<k)) k++;
140 if (RT.isvec) while (!(pd & 1<<j)) j++;
141 if svctx.ldstmode == elementstride:
142 EA = ireg[RA] + ireg[RB]*j # register-strided
143 else
144 EA = ireg[RA+i] + ireg[RB+k] # indexed address
145 if RAupdate: ireg[RAupdate+u] = EA
146 ireg[RT+j] <= MEM[EA];
147 if (!RT.isvec)
148 break # destination scalar, end immediately
149 if svctx.ldstmode != elementstride:
150 if (!RA.isvec && !RB.isvec)
151 break # scalar-scalar
152 if (RA.isvec) i++;
153 if (RAupdate.isvec) u++;
154 if (RB.isvec) k++;
155 if (RT.isvec) j++;
156
157 Note in both cases that [[sv/svp64]] allows RA-as-a-dest in "update" mode (`ldux`) to be effectively a *completely different* register from RA-as-a-source. This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
158
159 # Determining the LD/ST Modes
160
161 A minor complication (caused by the retro-fitting of modern Vector
162 features to a Scalar ISA) is that certain features do not exactly make
163 sense or are considered a security risk. Fail-first on Vector Indexed
164 would allow attackers to probe large numbers of pages from userspace, where
165 strided fail-first (by creating contiguous sequential LDs) does not.
166
167 In addition, reduce mode makes no sense, and for LD/ST with immediates
168 Vector source RA makes no sense either (or, is a quirk). Realistically we need
169 an alternative table meaning for [[sv/svp64]] mode. The following modes make sense:
170
171 * saturation
172 * predicate-result (mostly for cache-inhibited LD/ST)
173 * normal
174 * fail-first (where Vector Indexed is banned)
175 * Signed Effective Address computation (Vector Indexed only)
176
177 Also, given that FFT, DCT and other related algorithms
178 are of such high importance in so many areas of Computer
179 Science, a special "shift" mode has been added which
180 allows part of the immediate to be used instead as RC, a register
181 which shifts the immediate `DS << GPR(RC)`.
182
183 The table for [[sv/svp64]] for `immed(RA)` is:
184
185 | 0-1 | 2 | 3 4 | description |
186 | --- | --- |---------|--------------------------- |
187 | 00 | 0 | dz els | normal mode |
188 | 00 | 1 | dz shf | shift mode |
189 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
190 | 01 | inv | els RC1 | Rc=0: ffirst z/nonz |
191 | 10 | N | dz els | sat mode: N=0/1 u/s |
192 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
193 | 11 | inv | els RC1 | Rc=0: pred-result z/nonz |
194
195 The `els` bit is only relevant when `RA.isvec` is clear: this indicates
196 whether stride is unit or element:
197
198 if bitreversed:
199 svctx.ldstmode = bitreversed
200 elif RA.isvec:
201 svctx.ldstmode = indexed
202 elif els == 0:
203 svctx.ldstmode = unitstride
204 elif immediate != 0:
205 svctx.ldstmode = elementstride
206
207 An immediate of zero is a safety-valve to allow `LD-VSPLAT`:
208 in effect the multiplication of the immediate-offset by zero results
209 in reading from the exact same memory location.
210
211 For `LD-VSPLAT`, on non-cache-inhibited Loads, the read can occur
212 just the once and be copied, rather than hitting the Data Cache
213 multiple times with the same memory read at the same location.
214 This would allow for memory-mapped peripherals to have multiple
215 data values read in quick succession and stored in sequentially
216 numbered registers.
217
218 For non-cache-inhibited ST from a vector source onto a scalar
219 destination: with the Vector
220 loop effectively creating multiple memory writes to the same location,
221 we can deduce that the last of these will be the "successful" one. Thus,
222 implementations are free and clear to optimise out the overwriting STs,
223 leaving just the last one as the "winner". Bear in mind that predicate
224 masks will skip some elements (in source non-zeroing mode).
225 Cache-inhibited ST operations on the other hand **MUST** write out
226 a Vector source multiple successive times to the exact same Scalar
227 destination.
228
229 Note that there are no immediate versions of cache-inhibited LD/ST.
230
231 The modes for `RA+RB` indexed version are slightly different:
232
233 | 0-1 | 2 | 3 4 | description |
234 | --- | --- |---------|-------------------------- |
235 | 00 | SEA | dz sz | normal mode |
236 | 01 | SEA | dz sz | Strided (scalar only source) |
237 | 10 | N | dz sz | sat mode: N=0/1 u/s |
238 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
239 | 11 | inv | dz RC1 | Rc=0: pred-result z/nonz |
240
241 Vector Indexed Strided Mode is qualified as follows:
242
243 if mode = 0b01 and !RA.isvec and !RB.isvec:
244 svctx.ldstmode = elementstride
245
246 A summary of the effect of Vectorisation of src or dest:
247
248 imm(RA) RT.v RA.v no stride allowed
249 imm(RA) RT.s RA.v no stride allowed
250 imm(RA) RT.v RA.s stride-select allowed
251 imm(RA) RT.s RA.s not vectorised
252 RA,RB RT.v {RA|RB}.v UNDEFINED
253 RA,RB RT.s {RA|RB}.v UNDEFINED
254 RA,RB RT.v {RA&RB}.s VSPLAT possible. stride selectable
255 RA,RB RT.s {RA&RB}.s not vectorised
256
257 Signed Effective Address computation is only relevant for
258 Vector Indexed Mode, when elwidth overrides are applied.
259 The source override applies to RB, and before adding to
260 RA in order to calculate the Effective Address, if SEA is
261 set RB is sign-extended from elwidth bits to the full 64
262 bits. For other Modes (ffirst, saturate),
263 all EA computation with elwidth overrides is unsigned.
264
265 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially. `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers. This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
266 If a genuine cache-inhibited LD-VSPLAT is required then a *scalar*
267 cache-inhibited LD should be performed, followed by a VSPLAT-augmented mv.
268
269 ## LD/ST ffirst
270
271 LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
272 ordinary one. Exceptions occur "as normal". However for elements 1
273 and above, if an exception would occur, then VL is **truncated** to the
274 previous element: the exception is **not** then raised because the
275 LD/ST was effectively speculative.
276
277 ffirst LD/ST to multiple pages via a Vectorised Index base is considered a security risk due to the abuse of probing multiple pages in rapid succession and getting feedback on which pages would fail. Therefore Vector Indexed LD/ST is prohibited entirely, and the Mode bit instead used for element-strided LD/ST. See <https://bugs.libre-soc.org/show_bug.cgi?id=561>
278
279 for(i = 0; i < VL; i++)
280 reg[rt + i] = mem[reg[ra] + i * reg[rb]];
281
282 High security implementations where any kind of speculative probing
283 of memory pages is considered a risk should take advantage of the fact that
284 implementations may truncate VL at any point, without requiring software
285 to be rewritten and made non-portable. Such implementations may choose
286 to *always* set VL=1 which will have the effect of terminating any
287 speculative probing (and also adversely affect performance), but will
288 at least not require applications to be rewritten.
289
290 Low-performance simpler hardware implementations may
291 choose (always) to also set VL=1 as the bare minimum compliant implementation of
292 LD/ST Fail-First. It is however critically important to remember that
293 the first element LD/ST **MUST** be treated as an ordinary LD/ST, i.e.
294 **MUST** raise exceptions exactly like an ordinary LD/ST.
295
296 For ffirst LD/STs, VL may be truncated arbitrarily to a nonzero value for any implementation-specific reason. For example: it is perfectly reasonable for implementations to alter VL when ffirst LD or ST operations are initiated on a nonaligned boundary, such that within a loop the subsequent iteration of that loop begins subsequent ffirst LD/ST operations on an aligned boundary
297 such as the beginning of a cache line, or beginning of a Virtual Memory
298 page. Likewise, to reduce workloads or balance resources.
299
300 Vertical-First Mode is slightly strange in that only one element
301 at a time is ever executed anyway. Given that programmers may
302 legitimately choose to alter srcstep and dststep in non-sequential
303 order as part of explicit loops, it is neither possible nor
304 safe to make speculative assumptions about future LD/STs.
305 Therefore, Fail-First LD/ST in Vertical-First is `UNDEFINED`.
306 This is very different from Arithmetic (Data-dependent) FFirst
307 where Vertical-First Mode is deterministic, not speculative.
308
309 # LOAD/STORE Elwidths <a name="elwidth"></a>
310
311 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
312 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
313 others like it provide an explicit operation width. There are therefore
314 *three* widths involved:
315
316 * operation width (lb=8, lh=16, lw=32, ld=64)
317 * src elelent width override
318 * destination element width override
319
320 Some care is therefore needed to express and make clear the transformations,
321 which are expressly in this order:
322
323 * Load at the operation width (lb/lh/lw/ld) as usual
324 * byte-reversal as usual
325 * Non-saturated mode:
326 - zero-extension or truncation from operation width to source elwidth
327 - zero/truncation to dest elwidth
328 * Saturated mode:
329 - Sign-extension or truncation from operation width to source width
330 - signed/unsigned saturation down to dest elwidth
331
332 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
333 is treated effectively as completely separate and distinct from SV
334 augmentation. This is primarily down to quirks surrounding LE/BE and
335 byte-reversal in OpenPOWER.
336
337 It is unfortunately possible to request an elwidth override on the memory side which
338 does not mesh with the operation width: these result in `UNDEFINED`
339 behaviour. The reason is that the effect of attempting a 64-bit `sv.ld`
340 operation with a source elwidth override of 8/16/32 would result in
341 overlapping memory requests, particularly on unit and element strided
342 operations. Thus it is `UNDEFINED` when the elwidth is smaller than
343 the memory operation width. Examples include `sv.lw/sw=16/els` which
344 requests (overlapping) 4-byte memory reads offset from
345 each other at 2-byte intervals. Store likewise is also `UNDEFINED`
346 where the dest elwidth override is less than the operation width.
347
348 Note the following regarding the pseudocode to follow:
349
350 * `scalar identity behaviour` SV Context parameter conditions turn this
351 into a straight absolute fully-compliant Scalar v3.0B LD operation
352 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
353 rather than `ld`)
354 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
355 a "normal" part of Scalar v3.0B LD
356 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
357 as a "normal" part of Scalar v3.0B LD
358 * `svctx` specifies the SV Context and includes VL as well as
359 source and destination elwidth overrides.
360
361 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
362
363 Note that twin predication, predication-zeroing, saturation
364 and other modes have all been removed, for clarity and simplicity:
365
366 # LD not VLD! (ldbrx if brev=True)
367 # this covers unit stride mode and a type of vector offset
368 function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
369 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
370
371 if not svctx.unit/el-strided:
372 # strange vector mode, compute 64 bit address which is
373 # not polymorphic! elwidth hardcoded to 64 here
374 srcbase = get_polymorphed_reg(RA, 64, i)
375 else:
376 # unit / element stride mode, compute 64 bit address
377 srcbase = get_polymorphed_reg(RA, 64, 0)
378 # adjust for unit/el-stride
379 srcbase += ....
380
381 # takes care of (merges) processor LE/BE and ld/ldbrx
382 bytereverse = brev XNOR MSR.LE
383
384 # read the underlying memory
385 memread <= mem[srcbase + imm_offs];
386
387 # optionally performs byteswap at op width
388 if (bytereverse):
389 memread = byteswap(memread, op_width)
390
391 # check saturation.
392 if svpctx.saturation_mode:
393 ... saturation adjustment...
394 else:
395 # truncate/extend to over-ridden source width.
396 memread = adjust_wid(memread, op_width, svctx.src_elwidth)
397
398 # takes care of inserting memory-read (now correctly byteswapped)
399 # into regfile underlying LE-defined order, into the right place
400 # within the NEON-like register, respecting destination element
401 # bitwidth, and the element index (j)
402 set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
403
404 # increments both src and dest element indices (no predication here)
405 i++;
406 j++;
407
408 # Remapped LD/ST
409
410 In the [[sv/propagation]] page the concept of "Remapping" is described.
411 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
412 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
413 elements worth of LDs or STs. The usual interest in such re-mapping
414 is for example in separating out 24-bit RGB channel data into separate
415 contiguous registers. NEON covers this as shown in the diagram below:
416
417 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
418
419 Remap easily covers this capability, and with dest
420 elwidth overrides and saturation may do so with built-in conversion that
421 would normally require additional width-extension, sign-extension and
422 min/max Vectorised instructions as post-processing stages.
423
424 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
425 because the generic abstracted concept of "Remapping", when applied to
426 LD/ST, will give that same capability, with far more flexibility.
427
428 # notes from lxo
429
430 this section covers assembly notation for the immediate and indexed LD/ST.
431 the summary is that in immediate mode for LD it is not clear that if the
432 destination register is Vectorised `RT.v` but the source `imm(RA)` is scalar
433 the memory being read is *still a vector load*, known as "unit or element strides".
434
435 This anomaly is made clear with the following notation:
436
437 sv.ld RT.v, imm(RA).v
438
439 The following notation, although technically correct due to being implicitly identical to the above, is prohibited and is a syntax error:
440
441 sv.ld RT.v, imm(RA)
442
443 Notes taken from IRC conversation
444
445 <lxo> sv.ld r#.v, ofst(r#).v -> the whole vector is at ofst+r#
446 <lxo> sv.ld r#.v, ofst(r#.v) -> r# is a vector of addresses
447 <lxo> similarly sv.ldx r#.v, r#, r#.v -> whole vector at r#+r#
448 <lxo> whereas sv.ldx r#.v, r#.v, r# -> vector of addresses
449 <lxo> point being, you take an operand with the "m" constraint (or other memory-operand constraints), append .v to it and you're done addressing the in-memory vector
450 <lxo> as in asm ("sv.ld1 %0.v, %1.v" : "=r"(vec_in_reg) : "m"(vec_in_mem));
451 <lxo> (and ld%U1 got mangled into underline; %U expands to x if the address is a sum of registers
452
453 permutations of vector selection, to identify above asm-syntax:
454
455 imm(RA) RT.v RA.v nonstrided
456 sv.ld r#.v, ofst(r#2.v) -> r#2 is a vector of addresses
457 mem@ 0+r#2 offs+(r#2+1) offs+(r#2+2)
458 destreg r# r#+1 r#+2
459 imm(RA) RT.s RA.v nonstrided
460 sv.ld r#, ofst(r#2.v) -> r#2 is a vector of addresses
461 (dest r# is scalar) -> VSELECT mode
462 imm(RA) RT.v RA.s fixed stride: unit or element
463 sv.ld r#.v, ofst(r#2).v -> whole vector is at ofst+r#2
464 mem@r#2 +0 +1 +2
465 destreg r# r#+1 r#+2
466 sv.ld/els r#.v, ofst(r#2).v -> vector at ofst*elidx+r#2
467 mem@r#2 +0 ... +offs ... +offs*2
468 destreg r# r#+1 r#+2
469 imm(RA) RT.s RA.s not vectorised
470 sv.ld r#, ofst(r#2)
471
472 indexed mode:
473
474 RA,RB RT.v RA.v RB.v
475 sv.ldx r#.v, r#2, r#3.v -> whole vector at r#2+r#3
476 RA,RB RT.v RA.s RB.v
477 sv.ldx r#.v, r#2.v, r#3.v -> whole vector at r#2+r#3
478 RA,RB RT.v RA.v RB.s
479 sv.ldx r#.v, r#2.v, r#3 -> vector of addresses
480 RA,RB RT.v RA.s RB.s
481 sv.ldx r#.v, r#2, r#3 -> VSPLAT mode
482 RA,RB RT.s RA.v RB.v
483 RA,RB RT.s RA.s RB.v
484 RA,RB RT.s RA.v RB.s
485 RA,RB RT.s RA.s RB.s not vectorised