154c1693b4c35d4dd9734d118e67b840433ea780
[libreriscv.git] / openpower / sv / ldst.mdwn
1 [[!tag standards]]
2
3 # SV Load and Store
4
5 Links:
6
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
12
13 Vectorisation of Load and Store requires creation, from scalar operations,
14 a number of different types:
15
16 * fixed stride (contiguous sequence with no gaps)
17 * element strided (sequential but regularly offset, with gaps)
18 * vector indexed (vector of base addresses and vector of offsets)
19 * fail-first on the same (where it makes sense to do so)
20
21 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
22 [[isa/fixedstore]] pseudocode to be of the form:
23
24 lbux RT, RA, RB
25 EA <- (RA) + (RB)
26 RT <- MEM(EA)
27
28 and for immediate variants:
29
30 lb RT,D(RA)
31 EA <- RA + EXTS(D)
32 RT <- MEM(EA)
33
34 Thus in the first example, the source registers may each be independently
35 marked as scalar or vector, and likewise the destination; in the second
36 example only the one source and one dest may be marked as scalar or
37 vector.
38
39 Thus we can see that Vector Indexed may be covered, and, as demonstrated
40 with the pseudocode below, the immediate can be set to the element width
41 in order to give unit or element stride. With there being no way to tell which from the Scalar opcode, the choice is provided instead by the SV Context.
42
43 # LD not VLD! format - ldop RT, immed(RA)
44 # op_width: lb=1, lh=2, lw=4, ld=8
45 op_load(RT, RA, op_width, immed, svctx, RAupdate):
46  ps = get_pred_val(FALSE, RA); # predication on src
47  pd = get_pred_val(FALSE, RT); # ... AND on dest
48  for (i=0, j=0, u=0; i < VL && j < VL;):
49 # skip nonpredicates elements
50 if (RA.isvec) while (!(ps & 1<<i)) i++;
51 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
52 if (RT.isvec) while (!(pd & 1<<j)) j++;
53 if svctx.ldstmode == elementstride:
54 # element stride mode
55 srcbase = ireg[RA]
56 offs = i * immed
57 elif svctx.ldstmode == unitstride:
58 # unit stride mode
59 srcbase = ireg[RA]
60 offs = i * op_width
61 elif RA.isvec:
62 # quirky Vector indexed mode but with an immediate
63 srcbase = ireg[RA+i]
64 offs = immed;
65 else
66 # standard scalar mode (but predicated)
67 # no stride multiplier means VSPLAT mode
68 srcbase = ireg[RA]
69 offs = immed
70
71 # compute EA
72 EA = srcbase + offs
73 # update RA?
74 if RAupdate: ireg[RAupdate+u] = EA;
75 # load from memory
76 ireg[RT+j] <= MEM[EA];
77 if (!RT.isvec)
78 break # destination scalar, end now
79 if (RA.isvec) i++;
80 if (RAupdate.isvec) u++;
81 if (RT.isvec) j++;
82
83 Indexed LD is:
84
85 # format: ldop RT, RA, RB
86 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
87  ps = get_pred_val(FALSE, RA); # predication on src
88  pd = get_pred_val(FALSE, RT); # ... AND on dest
89  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
90 # skip nonpredicated RA, RB and RT
91 if (RA.isvec) while (!(ps & 1<<i)) i++;
92 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
93 if (RB.isvec) while (!(ps & 1<<k)) k++;
94 if (RT.isvec) while (!(pd & 1<<j)) j++;
95 EA = ireg[RA+i] + ireg[RB+k] # indexed address
96 if RAupdate: ireg[RAupdate+u] = EA
97 ireg[RT+j] <= MEM[EA];
98 if (!RT.isvec)
99 break # destination scalar, end immediately
100 if (!RA.isvec && !RB.isvec)
101 break # scalar-scalar
102 if (RA.isvec) i++;
103 if (RAupdate.isvec) u++;
104 if (RB.isvec) k++;
105 if (RT.isvec) j++;
106
107 Note in both cases that [[sv/svp64]] allows RA in "update" mode (`ldux`) to be effectively a completely different register from RA-as-a-source. This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
108
109 # Determining the LD/ST Modes
110
111 A minor complication (caused by the retro-fitting of modern Vector
112 features to a Scalar ISA) is that certain features do not exactly make
113 sense or are considered a security risk. Fail-first on Vector Indexed
114 allows attackers to probe large numbers of pages from userspace, where
115 strided fail-first (by creating contiguous sequential LDs) does not.
116
117 In addition, even in other modes, Vector source RA makes no sense for
118 computing offsets, and reduce mode even less. Realistically we need
119 an alternative table meaning for [[sv/svp64]] mode.
120
121 TODO
122
123 in all cases:
124 - vector immed(RA) nonsense.
125 - unit-stride/el-stride needed on immed(RA)
126
127 modes for immed(RA) version:
128
129 * saturation
130 * predicate-result?
131 * normal
132 * fail-first
133 - vector RA is "banned"
134
135 | 0-1 | 2 | 3 4 | description |
136 | --- | --- |---------|-------------------------- |
137 | 00 | str | sz dz | normal mode |
138 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
139 | 01 | inv | str RC1 | Rc=0: ffirst z/nonz |
140 | 10 | N | sz str | sat mode: N=0/1 u/s |
141 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
142 | 11 | inv | str RC1 | Rc=0: pred-result z/nonz |
143
144 The `str` bit is only relevant when `RA.isvec` is clear: this indicates
145 whether stride is unit or element:
146
147 if RA.isvec:
148 svctx.ldstmode = indexed
149 elif str == 0:
150 svctx.ldstmode = unitstride
151 else:
152 svctx.ldstmode = elementstride
153
154 Thr modes for RA+RB indexed version are slightly different:
155
156 * saturation
157 * predicate-result
158 * normal
159 * fail-first
160 - vector RA or RB is "banned"
161
162
163 | 0-1 | 2 | 3 4 | description |
164 | --- | --- |---------|-------------------------- |
165 | 00 | 0 | sz dz | normal mode |
166 | 00 | rsv | rsvd | reserved |
167 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
168 | 01 | inv | sz RC1 | Rc=0: ffirst z/nonz |
169 | 10 | N | sz dz | sat mode: N=0/1 u/s |
170 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
171 | 11 | inv | sz RC1 | Rc=0: pred-result z/nonz |
172
173 imm(RA) RT.v RA.v no stride allowed
174 imm(RA) RY.s RA.v no stride allowed
175 imm(RA) RT.v RA.s stride-select needed
176 imm(RA) RT.s RA.s not vectorised
177 RA,RB RT.v RA/RB.v ffirst banned
178 RA,RB RT.s RA/RB.v ffirst banned
179 RA,RB RT.v RA/RB.s vsplat activated
180 RA,RB RT.s RA/RB.s not vectirised
181
182 # LOAD/STORE Elwidths <a name="ldst"></a>
183
184 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
185 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
186 others like it provide an explicit operation width. There are therefore
187 *three* widths involved:
188
189 * operation width (lb=8, lh=16, lw=32, ld=64)
190 s src elelent width override
191 * destination element width override
192
193 Some care is therefore needed to express and make clear the transformations,
194 which are expressly in this order:
195
196 * Load at the operation width (lb/lh/lw/ld) as usual
197 * byte-reversal as usual
198 * Non-saturated mode:
199 - zero-extension or truncation from operation width to source elwidth
200 - zero/truncation to dest elwidth
201 * Saturated mode:
202 - Sign-extension or truncation from operation width to source width
203 - signed/unsigned saturation down to dest elwidth
204
205 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
206 is treated effectively as completely separate and distinct from SV
207 augmentation. This is primarily down to quirks surrounding LE/BE and
208 byte-reversal in OpenPOWER.
209
210 Note the following regarding the pseudocode to follow:
211
212 * `scalar identity behaviour` SV Context parameter conditions turn this
213 into a straight absolute fully-compliant Scalar v3.0B LD operation
214 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
215 rather than `ld`)
216 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
217 a "normal" part of Scalar v3.0B LD
218 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
219 as a "normal" part of Scalar v3.0B LD
220 * `svctx` specifies the SV Context and includes VL as well as
221 source and destination elwidth overrides.
222
223 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
224
225 Note that twin predication, predication-zeroing, saturation
226 and other modes have all been removed, for clarity and simplicity:
227
228 # LD not VLD! (ldbrx if brev=True)
229 # this covers unit stride mode and a type of vector offset
230 function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
231 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
232
233 if not svctx.unit/el-strided:
234 # strange vector mode, compute 64 bit address which is
235 # not polymorphic! elwidth hardcoded to 64 here
236 srcbase = get_polymorphed_reg(RA, 64, i)
237 else:
238 # unit / element stride mode, compute 64 bit address
239 srcbase = get_polymorphed_reg(RA, 64, 0)
240 # adjust for unit/el-stride
241 srcbase += ....
242
243 # takes care of (merges) processor LE/BE and ld/ldbrx
244 bytereverse = brev XNOR MSR.LE
245
246 # read the underlying memory
247 memread <= mem[srcbase + imm_offs];
248
249 # optionally performs byteswap at op width
250 if (bytereverse):
251 memread = byteswap(memread, op_width)
252
253
254 # check saturation.
255 if svpctx.saturation_mode:
256 ... saturation adjustment...
257 else:
258 # truncate/extend to over-ridden source width.
259 memread = adjust_wid(memread, op_width, svctx.src_elwidth)
260
261 # takes care of inserting memory-read (now correctly byteswapped)
262 # into regfile underlying LE-defined order, into the right place
263 # within the NEON-like register, respecting destination element
264 # bitwidth, and the element index (j)
265 set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
266
267 # increments both src and dest element indices (no predication here)
268 i++;
269 j++;
270
271 # Remapped LD/ST
272
273 In the [[sv/propagation]] page the concept of "Remapping" is described.
274 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
275 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
276 elements worth of LDs or STs. The usual interest in such re-mapping
277 is for example in separating out 24-bit RGB channel data into separate
278 contiguous registers. NEON covers this as shown in the diagram below:
279
280 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
281
282 Remap easily covers this capability, and with dest
283 elwidth overrides and saturation may do so with built-in conversion that
284 would normally require additional width-extension, sign-extension and
285 min/max Vectorised instructions as post-processing stages.
286
287 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
288 because the generic abstracted concept of "Remapping", when applied to
289 LD/ST, will give that same capability, with far more flexibility.