(no commit message)
[libreriscv.git] / openpower / sv / ldst.mdwn
1 [[!tag standards]]
2
3 # SV Load and Store
4
5 Links:
6
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=561>
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=572>
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=571>
10 * <https://llvm.org/devmtg/2016-11/Slides/Emerson-ScalableVectorizationinLLVMIR.pdf>
11 * <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-loads-and-stores>
12
13 Vectorisation of Load and Store requires creation, from scalar operations,
14 a number of different types:
15
16 * fixed stride (contiguous sequence with no gaps)
17 * element strided (sequential but regularly offset, with gaps)
18 * vector indexed (vector of base addresses and vector of offsets)
19 * fail-first on the same (where it makes sense to do so)
20
21 OpenPOWER Load/Store operations may be seen from [[isa/fixedload]] and
22 [[isa/fixedstore]] pseudocode to be of the form:
23
24 lbux RT, RA, RB
25 EA <- (RA) + (RB)
26 RT <- MEM(EA)
27
28 and for immediate variants:
29
30 lb RT,D(RA)
31 EA <- RA + EXTS(D)
32 RT <- MEM(EA)
33
34 Thus in the first example, the source registers may each be independently
35 marked as scalar or vector, and likewise the destination; in the second
36 example only the one source and one dest may be marked as scalar or
37 vector.
38
39 Thus we can see that Vector Indexed may be covered, and, as demonstrated
40 with the pseudocode below, the immediate can be set to the element width
41 in order to give unit or element stride. With there being no way to tell which from the Scalar opcode, the choice is provided instead by the SV Context.
42
43 # LD not VLD!
44 # op_width: lb=1, lh=2, lw=4, ld=8
45 op_load(RT, RA, op_width, immed, svctx, RAupdate):
46  ps = get_pred_val(FALSE, RA); # predication on src
47  pd = get_pred_val(FALSE, RT); # ... AND on dest
48  for (i=0, j=0, u=0; i < VL && j < VL;):
49 # skip nonpredicates elements
50 if (RA.isvec) while (!(ps & 1<<i)) i++;
51 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
52 if (RT.isvec) while (!(pd & 1<<j)) j++;
53 if svctx.ldstmode == elementstride:
54 # element stride mode
55 srcbase = ireg[RA]
56 offs = i * immed
57 elif svctx.ldstmode == unitstride:
58 # unit stride mode
59 srcbase = ireg[RA]
60 offs = i * op_width
61 elif RA.isvec:
62 # quirky Vector indexed mode but with an immediate
63 srcbase = ireg[RA+i]
64 offs = immed;
65 else
66 # standard scalar mode (but predicated)
67 # no stride multiplier means VSPLAT mode
68 srcbase = ireg[RA]
69 offs = immed
70
71 # compute EA
72 EA = srcbase + offs
73 # update RA?
74 if RAupdate: ireg[RAupdate+u] = EA;
75 # load from memory
76 ireg[RT+j] <= MEM[EA];
77 if (!RT.isvec)
78 break # destination scalar, end now
79 if (RA.isvec) i++;
80 if (RT.isvec) j++;
81
82 Indexed LD is:
83
84 function op_ldx(RT, RA, RB, RAupdate=False) # LD not VLD!
85  ps = get_pred_val(FALSE, RA); # predication on src
86  pd = get_pred_val(FALSE, RT); # ... AND on dest
87  for (i=0, j=0, k=0, u=0; i < VL && j < VL && k < VL):
88 # skip nonpredicated RA, RB and RT
89 if (RA.isvec) while (!(ps & 1<<i)) i++;
90 if (RAupdate.isvec) while (!(ps & 1<<u)) u++;
91 if (RB.isvec) while (!(ps & 1<<k)) k++;
92 if (RT.isvec) while (!(pd & 1<<j)) j++;
93 EA = ireg[RA+i] + ireg[RB+k] # indexed address
94 if RAupdate: ireg[RAupdate+u] = EA
95 ireg[RT+j] <= MEM[EA];
96 if (!RT.isvec)
97 break # destination scalar, end immediately
98 if (!RA.isvec && !RB.isvec)
99 break # scalar-scalar
100 if (RA.isvec) i++;
101 if (RB.isvec) k++;
102 if (RT.isvec) j++;
103
104 Note in both cases that [[sv/svp64]] allows RA in "update" mode (`ldux`) to be effectively a completely different register from RA-as-a-source. This because there is room in svp64 to extend RA-as-src as well as RA-as-dest, both independently as scalar or vector *and* independently extending their range.
105
106 # Determining the LD/ST Modes
107
108 A minor complication (caused by the retro-fitting of modern Vector
109 features to a Scalar ISA) is that certain features do not exactly make
110 sense or are considered a security risk. Fail-first on Vector Indexed
111 allows attackers to probe large numbers of pages from userspace, where
112 strided fail-first (by creating contiguous sequential LDs) does not.
113
114 In addition, even in other modes, Vector source RA makes no sense for
115 computing offsets, and reduce mode even less. Realistically we need
116 an alternative table meaning for [[sv/svp64]] mode.
117
118 TODO
119
120 in all cases:
121 - vector immed(RA) nonsense.
122 - unit-stride/el-stride needed on immed(RA)
123
124 modes for immed(RA) version:
125
126 * saturation
127 * predicate-result?
128 * normal
129 * fail-first
130 - vector RA is "banned"
131
132 | 0-1 | 2 | 3 4 | description |
133 | --- | --- |---------|-------------------------- |
134 | 00 | str | sz dz | normal mode |
135 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
136 | 01 | inv | str RC1 | Rc=0: ffirst z/nonz |
137 | 10 | N | sz str | sat mode: N=0/1 u/s |
138 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
139 | 11 | inv | str RC1 | Rc=0: pred-result z/nonz |
140
141 The `str` bit is only relevant when `RA.isvec` is clear: this indicates
142 whether stride is unit or element:
143
144 if RA.isvec:
145 svctx.ldstmode = indexed
146 elif str == 0:
147 svctx.ldstmode = unitstride
148 else:
149 svctx.ldstmode = elementstride
150
151 Thr modes for RA+RB indexed version are slightly different:
152
153 * saturation
154 * predicate-result
155 * normal
156 * fail-first
157 - vector RA or RB is "banned"
158
159
160 | 0-1 | 2 | 3 4 | description |
161 | --- | --- |---------|-------------------------- |
162 | 00 | 0 | sz dz | normal mode |
163 | 00 | rsv | rsvd | reserved |
164 | 01 | inv | CR-bit | Rc=1: ffirst CR sel |
165 | 01 | inv | sz RC1 | Rc=0: ffirst z/nonz |
166 | 10 | N | sz dz | sat mode: N=0/1 u/s |
167 | 11 | inv | CR-bit | Rc=1: pred-result CR sel |
168 | 11 | inv | sz RC1 | Rc=0: pred-result z/nonz |
169
170 imm(RA) RT.v RA.v no stride allowed
171 imm(RA) RY.s RA.v no stride allowed
172 imm(RA) RT.v RA.s stride-select needed
173 imm(RA) RT.s RA.s not vectorised
174 RA,RB RT.v RA/RB.v ffirst banned
175 RA,RB RT.s RA/RB.v ffirst banned
176 RA,RB RT.v RA/RB.s vsplat activated
177 RA,RB RT.s RA/RB.s not vectirised
178
179 # LOAD/STORE Elwidths <a name="ldst"></a>
180
181 Loads and Stores are almost unique in that the OpenPOWER Scalar ISA
182 provides a width for the operation (lb, lh, lw, ld). Only `extsb` and
183 others like it provide an explicit operation width. There are therefore
184 *three* widths involved:
185
186 * operation width (lb=8, lh=16, lw=32, ld=64)
187 s src elelent width override
188 * destination element width override
189
190 Some care is therefore needed to express and make clear the transformations,
191 which are expressly in this order:
192
193 * Load at the operation width (lb/lh/lw/ld) as usual
194 * byte-reversal as usual
195 * Non-saturated mode:
196 - zero-extension or truncation from operation width to source elwidth
197 - zero/truncation to dest elwidth
198 * Saturated mode:
199 - Sign-extension or truncation from operation width to source width
200 - signed/unsigned saturation down to dest elwidth
201
202 In order to respect OpenPOWER v3.0B Scalar behaviour the memory side
203 is treated effectively as completely separate and distinct from SV
204 augmentation. This is primarily down to quirks surrounding LE/BE and
205 byte-reversal in OpenPOWER.
206
207 Note the following regarding the pseudocode to follow:
208
209 * `scalar identity behaviour` SV Context parameter conditions turn this
210 into a straight absolute fully-compliant Scalar v3.0B LD operation
211 * `brev` selects whether the operation is the byte-reversed variant (`ldbrx`
212 rather than `ld`)
213 * `op_width` specifies the operation width (`lb`, `lh`, `lw`, `ld`) as
214 a "normal" part of Scalar v3.0B LD
215 * `imm_offs` specifies the immediate offset `ld r3, imm_offs(r5)`, again
216 as a "normal" part of Scalar v3.0B LD
217 * `svctx` specifies the SV Context and includes VL as well as
218 source and destination elwidth overrides.
219
220 Below is the pseudocode for Unit-Strided LD (which includes Vector capability).
221
222 Note that twin predication, predication-zeroing, saturation
223 and other modes have all been removed, for clarity and simplicity:
224
225 # LD not VLD! (ldbrx if brev=True)
226 # this covers unit stride mode and a type of vector offset
227 function op_ld(RT, RA, brev, op_width, imm_offs, svctx)
228 for (int i = 0, int j = 0; i < svctx.VL && j < svctx.VL;):
229
230 if not svctx.unit/el-strided:
231 # strange vector mode, compute 64 bit address which is
232 # not polymorphic! elwidth hardcoded to 64 here
233 srcbase = get_polymorphed_reg(RA, 64, i)
234 else:
235 # unit / element stride mode, compute 64 bit address
236 srcbase = get_polymorphed_reg(RA, 64, 0)
237 # adjust for unit/el-stride
238 srcbase += ....
239
240 # takes care of (merges) processor LE/BE and ld/ldbrx
241 bytereverse = brev XNOR MSR.LE
242
243 # read the underlying memory
244 memread <= mem[srcbase + imm_offs];
245
246 # optionally performs byteswap at op width
247 if (bytereverse):
248 memread = byteswap(memread, op_width)
249
250
251 # check saturation.
252 if svpctx.saturation_mode:
253 ... saturation adjustment...
254 else:
255 # truncate/extend to over-ridden source width.
256 memread = adjust_wid(memread, op_width, svctx.src_elwidth)
257
258 # takes care of inserting memory-read (now correctly byteswapped)
259 # into regfile underlying LE-defined order, into the right place
260 # within the NEON-like register, respecting destination element
261 # bitwidth, and the element index (j)
262 set_polymorphed_reg(RT, svctx.dest_bitwidth, j, memread)
263
264 # increments both src and dest element indices (no predication here)
265 i++;
266 j++;
267
268 # Remapped LD/ST
269
270 In the [[sv/propagation]] page the concept of "Remapping" is described.
271 Whilst it is expensive to set up (2 64-bit opcodes minimum) it provides
272 a way to arbitrarily perform 1D, 2D and 3D "remapping" of up to 64
273 elements worth of LDs or STs. The usual interest in such re-mapping
274 is for example in separating out 24-bit RGB channel data into separate
275 contiguous registers. NEON covers this as shown in the diagram below:
276
277 <img src="https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/Loading-RGB-data-with-structured-load.png" >
278
279 Remap easily covers this capability, and with dest
280 elwidth overrides and saturation may do so with built-in conversion that
281 would normally require additional width-extension, sign-extension and
282 min/max Vectorised instructions as post-processing stages.
283
284 Thus we do not need to provide specialist LD/ST "Structure Packed" opcodes
285 because the generic abstracted concept of "Remapping", when applied to
286 LD/ST, will give that same capability, with far more flexibility.