(no commit message)
[libreriscv.git] / openpower / sv / ldst.mdwn
1 [[!tag standards]]
2
3 # SV Load and Store
4
5 Links:
6
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
12
13 Vectorisation of Load and Store requires creation, from scalar operations,
14 a number of different types:
15
16 * fixed stride (contiguous sequence with no gaps)
17 * element strided (sequential but regularly offset, with gaps)
18 * vector indexed (vector of base addresses and vector of offsets)
19 * fail-first on the same (where it makes sense to do so)
20
21 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
22 [[isa/fixedstore]] pseudocode to be of the form:
23
24 lbux RT, RA, RB
25 EA <- (RA) + (RB)
26 RT <- MEM(EA)
27
28 and for immediate variants:
29
30 lb RT,D(RA)
31 EA <- RA + EXTS(D)
32 RT <- MEM(EA)
33
34 Thus in the first example, the source registers may each be independently
35 marked as scalar or vector, and likewise the destination; in the second
36 example only the one source and one dest may be marked as scalar or
37 vector.
38
39 Thus we can see that Vector Indexed may be covered, and, as demonstrated
40 with the pseudocode below, the immediate can be used to give unit stride or element stride. With there being no way to tell which from the OpenPOWER v3.0B Scalar opcode alone, the choice is provided instead by the SV Context.
41
42 # LD not VLD! format - ldop RT, immed(RA)
43 # op_width: lb=1, lh=2, lw=4, ld=8
44 op_load(RT, RA, op_width, immed, svctx, RAupdate):
45  ps = get_pred_val(FALSE, RA); # predication on src
46  pd = get_pred_val(FALSE, RT); # ... AND on dest
47  for (i=0, j=0, u=0; i < VL && j < VL;):
48 # skip nonpredicates elements
49 if (RA.isvec) while (!(ps & 1<<i)) i++;
50 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
51 if (RT.isvec) while (!(pd & 1<<j)) j++;
52 if svctx.ldstmode == elementstride:
53 # element stride mode
54 srcbase = ireg[RA]
55 offs = i * immed
56 elif svctx.ldstmode == unitstride:
57 # unit stride mode
58 srcbase = ireg[RA]
59 offs = i * op_width
60 elif RA.isvec:
61 # quirky Vector indexed mode but with an immediate
62 srcbase = ireg[RA+i]
63 offs = immed;
64 else
65 # standard scalar mode (but predicated)
66 # no stride multiplier means VSPLAT mode
67 srcbase = ireg[RA]
68 offs = immed
69
70 # compute EA
71 EA = srcbase + offs
72 # update RA?
73 if RAupdate: ireg[RAupdate+u] = EA;
74 # load from memory
75 ireg[RT+j] <= MEM[EA];
76 if (!RT.isvec)
77 break # destination scalar, end now
78 if (RA.isvec) i++;
79 if (RAupdate.isvec) u++;
80 if (RT.isvec) j++;
81
82 Indexed LD is:
83
84 # format: ldop RT, RA, RB
85 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
86  ps = get_pred_val(FALSE, RA); # predication on src
87  pd = get_pred_val(FALSE, RT); # ... AND on dest
88  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
89 # skip nonpredicated RA, RB and RT
90 if (RA.isvec) while (!(ps & 1<<i)) i++;
91 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
92 if (RB.isvec) while (!(ps & 1<<k)) k++;
93 if (RT.isvec) while (!(pd & 1<<j)) j++;
94 EA = ireg[RA+i] + ireg[RB+k] # indexed address
95 if RAupdate: ireg[RAupdate+u] = EA
96 ireg[RT+j] <= MEM[EA];
97 if (!RT.isvec)
98 break # destination scalar, end immediately
99 if (!RA.isvec && !RB.isvec)
100 break # scalar-scalar
101 if (RA.isvec) i++;
102 if (RAupdate.isvec) u++;
103 if (RB.isvec) k++;
104 if (RT.isvec) j++;
105
106 Note in both cases that [[sv/svp64]] allows RA in "update" mode (`ldux`) to be effectively a completely different register from RA-as-a-source. This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
107
108 # Determining the LD/ST Modes
109
110 A minor complication (caused by the retro-fitting of modern Vector
111 features to a Scalar ISA) is that certain features do not exactly make
112 sense or are considered a security risk. Fail-first on Vector Indexed
113 allows attackers to probe large numbers of pages from userspace, where
114 strided fail-first (by creating contiguous sequential LDs) does not.
115
116 In addition, reduce mode makes no sense, and for LD/ST with immediates
117 Vector source RA makes no sense either. Realistically we need
118 an alternative table meaning for [[sv/svp64]] mode.
119
120 * saturation
121 * predicate-result
122 * normal
123 * fail-first, where vector source on RA or RB is banned
124
125 The table for [[sv/svp64] for immed(RA) is:
126
127 | 0-1 | 2 | 3 4 | description |
128 | --- | --- |---------|-------------------------- |
129 | 00 | str | sz dz | normal mode |
130 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
131 | 01 | inv | str RC1 | Rc=0: ffirst z/nonz |
132 | 10 | N | sz str | sat mode: N=0/1 u/s |
133 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
134 | 11 | inv | str RC1 | Rc=0: pred-result z/nonz |
135
136 The `str` bit is only relevant when `RA.isvec` is clear: this indicates
137 whether stride is unit or element:
138
139 if RA.isvec:
140 svctx.ldstmode = indexed
141 elif str == 0:
142 svctx.ldstmode = unitstride
143 else:
144 svctx.ldstmode = elementstride
145
146 The modes for RA+RB indexed version are slightly different:
147
148 | 0-1 | 2 | 3 4 | description |
149 | --- | --- |---------|-------------------------- |
150 | 00 | 0 | sz dz | normal mode |
151 | 00 | rsv | rsvd | reserved |
152 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
153 | 01 | inv | sz RC1 | Rc=0: ffirst z/nonz |
154 | 10 | N | sz dz | sat mode: N=0/1 u/s |
155 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
156 | 11 | inv | sz RC1 | Rc=0: pred-result z/nonz |
157
158 A summary of the effect of Vectorisation of src or dest:
159
160 imm(RA) RT.v RA.v no stride allowed
161 imm(RA) RY.s RA.v no stride allowed
162 imm(RA) RT.v RA.s stride-select needed
163 imm(RA) RT.s RA.s not vectorised
164 RA,RB RT.v RA/RB.v ffirst banned
165 RA,RB RT.s RA/RB.v ffirst banned
166 RA,RB RT.v RA/RB.s VSPLAT possible
167 RA,RB RT.s RA/RB.s not vectorised
168
169 Note that cache-inhibited LD/ST (`ldcix`) when VSPLAT is activated will perform **multiple** LD/ST operations, sequentially. `ldcix` even with scalar src will read the same memory location *multiple times*, storing the result in successive Vector destination registers. This because the cache-inhibit instructions are used to read and write memory-mapped peripherals.
170
171 # LOAD/STORE Elwidths <a name="ldst"></a>
172
173 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
174 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
175 others like it provide an explicit operation width. There are therefore
176 *three* widths involved:
177
178 * operation width (lb=8, lh=16, lw=32, ld=64)
179 s src elelent width override
180 * destination element width override
181
182 Some care is therefore needed to express and make clear the transformations,
183 which are expressly in this order:
184
185 * Load at the operation width (lb/lh/lw/ld) as usual
186 * byte-reversal as usual
187 * Non-saturated mode:
188 - zero-extension or truncation from operation width to source elwidth
189 - zero/truncation to dest elwidth
190 * Saturated mode:
191 - Sign-extension or truncation from operation width to source width
192 - signed/unsigned saturation down to dest elwidth
193
194 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
195 is treated effectively as completely separate and distinct from SV
196 augmentation. This is primarily down to quirks surrounding LE/BE and
197 byte-reversal in OpenPOWER.
198
199 Note the following regarding the pseudocode to follow:
200
201 * `scalar identity behaviour` SV Context parameter conditions turn this
202 into a straight absolute fully-compliant Scalar v3.0B LD operation
203 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
204 rather than `ld`)
205 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
206 a "normal" part of Scalar v3.0B LD
207 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
208 as a "normal" part of Scalar v3.0B LD
209 * `svctx` specifies the SV Context and includes VL as well as
210 source and destination elwidth overrides.
211
212 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
213
214 Note that twin predication, predication-zeroing, saturation
215 and other modes have all been removed, for clarity and simplicity:
216
217 # LD not VLD! (ldbrx if brev=True)
218 # this covers unit stride mode and a type of vector offset
219 function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
220 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
221
222 if not svctx.unit/el-strided:
223 # strange vector mode, compute 64 bit address which is
224 # not polymorphic! elwidth hardcoded to 64 here
225 srcbase = get_polymorphed_reg(RA, 64, i)
226 else:
227 # unit / element stride mode, compute 64 bit address
228 srcbase = get_polymorphed_reg(RA, 64, 0)
229 # adjust for unit/el-stride
230 srcbase += ....
231
232 # takes care of (merges) processor LE/BE and ld/ldbrx
233 bytereverse = brev XNOR MSR.LE
234
235 # read the underlying memory
236 memread <= mem[srcbase + imm_offs];
237
238 # optionally performs byteswap at op width
239 if (bytereverse):
240 memread = byteswap(memread, op_width)
241
242
243 # check saturation.
244 if svpctx.saturation_mode:
245 ... saturation adjustment...
246 else:
247 # truncate/extend to over-ridden source width.
248 memread = adjust_wid(memread, op_width, svctx.src_elwidth)
249
250 # takes care of inserting memory-read (now correctly byteswapped)
251 # into regfile underlying LE-defined order, into the right place
252 # within the NEON-like register, respecting destination element
253 # bitwidth, and the element index (j)
254 set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
255
256 # increments both src and dest element indices (no predication here)
257 i++;
258 j++;
259
260 # Remapped LD/ST
261
262 In the [[sv/propagation]] page the concept of "Remapping" is described.
263 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
264 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
265 elements worth of LDs or STs. The usual interest in such re-mapping
266 is for example in separating out 24-bit RGB channel data into separate
267 contiguous registers. NEON covers this as shown in the diagram below:
268
269 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
270
271 Remap easily covers this capability, and with dest
272 elwidth overrides and saturation may do so with built-in conversion that
273 would normally require additional width-extension, sign-extension and
274 min/max Vectorised instructions as post-processing stages.
275
276 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
277 because the generic abstracted concept of "Remapping", when applied to
278 LD/ST, will give that same capability, with far more flexibility.