44124a2e263655b794d71b132de55c3b6409f465
[libreriscv.git] / openpower / sv / normal.mdwn
1 # Appendix
2
3 * <https://bugs.libre-soc.org/show_bug.cgi?id=574>
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
5
6 Table of contents:
7
8 [[!toc]]
9
10
11 # Rounding, clamp and saturate
12
13 see [[av_opcodes]].
14
15 To help ensure that audio quality is not compromised by overflow,
16 "saturation" is provided, as well as a way to detect when saturation
17 occurred if desired (Rc=1). When Rc=1 there will be a *vector* of CRs,
18 one CR per element in the result (Note: this is different from VSX which
19 has a single CR per block).
20
21 When N=0 the result is saturated to within the maximum range of an
22 unsigned value. For integer ops this will be 0 to 2^elwidth-1. Similar
23 logic applies to FP operations, with the result being saturated to
24 maximum rather than returning INF, and the minimum to +0.0
25
26 When N=1 the same occurs except that the result is saturated to the min
27 or max of a signed result, and for FP to the min and max value rather
28 than returning +/- INF.
29
30 When Rc=1, the CR "overflow" bit is set on the CR associated with the
31 element, to indicate whether saturation occurred. Note that due to
32 the hugely detrimental effect it has on parallel processing, XER.SO is
33 **ignored** completely and is **not** brought into play here. The CR
34 overflow bit is therefore simply set to zero if saturation did not occur,
35 and to one if it did.
36
37 Note also that saturate on operations that produce a carry output are
38 prohibited due to the conflicting use of the CR.so bit for storing if
39 saturation occurred.
40
41 Post-analysis of the Vector of CRs to find out if any given element hit
42 saturation may be done using a mapreduced CR op (cror), or by using the
43 new crweird instruction, transferring the relevant CR bits to a scalar
44 integer and testing it for nonzero. see [[sv/cr_int_predication]]
45
46 Note that the operation takes place at the maximum bitwidth (max of
47 src and dest elwidth) and that truncation occurs to the range of the
48 dest elwidth.
49
50 # Reduce mode
51
52 Reduction in SVP64 is deterministic and somewhat of a misnomer. A normal
53 Vector ISA would have explicit Reduce opcodes with defibed characteristics
54 per operation: in SX Aurora there is even an additional scalar argument
55 containing the initial reduction value. SVP64 fundamentally has to
56 utilise *existing* Scalar Power ISA v3.0B operations, which presents some
57 unique challenges.
58
59 The solution turns out to be to simply define reduction as permitting
60 deterministic element-based schedules to be issued using the base Scalar
61 operations, and to rely on the underlying microarchitecture to resolve
62 Register Hazards at the element level. This goes back to
63 the fundamental principle that SV is nothing more than a Sub-Program-Counter
64 sitting between Decode and Issue phases.
65
66 Microarchitectures *may* take opportunities to parallelise the reduction
67 but only if in doing so they preserve Program Order at the Element Level.
68 Opportunities where this is possible include an `OR` operation
69 or a MIN/MAX operation: it may be possible to parallelise the reduction,
70 but for Floating Point it is not permitted due to different results
71 being obtained if the reduction is not executed in strict sequential
72 order.
73
74 ## Scalar result reduce mode
75
76 In this mode, which is suited to operations involving carry or overflow,
77 one register must be identified by the programmer as being the "accumulator".
78 Scalar reduction is thus categorised by:
79
80 * One of the sources is a Vector
81 * the destination is a scalar
82 * optionally but most usefully when one source register is also the destination
83 * That the source register type is the same as the destination register
84 type identified as the "accumulator". scalar reduction on `cmp`,
85 `setb` or `isel` makes no sense for example because of the mixture
86 between CRs and GPRs.
87
88 Typical applications include simple operations such as `ADD r3, r10.v,
89 r3` where, clearly, r3 is being used to accumulate the addition of all
90 elements is the vector starting at r10.
91
92 # add RT, RA,RB but when RT==RA
93 for i in range(VL):
94 iregs[RA] += iregs[RB+i] # RT==RA
95
96 However, *unless* the operation is marked as "mapreduce", SV ordinarily
97 **terminates** at the first scalar operation. Only by marking the
98 operation as "mapreduce" will it continue to issue multiple sub-looped
99 (element) instructions in `Program Order`.
100
101 To.perform the loop in reverse order, the ```RG``` (reverse gear) bit must be set. This is useful for leaving a cumulative suffix sum in reverse order:
102
103 for i in (VL-1 downto 0):
104 # RT-1 = RA gives a suffix sum
105 iregs[RT+i] = iregs[RA+i] - iregs[RB+i]
106
107 Other examples include shift-mask operations where a Vector of inserts
108 into a single destination register is required, as a way to construct
109 a value quickly from multiple arbitrary bit-ranges and bit-offsets.
110 Using the same register as both the source and destination, with Vectors
111 of different offsets masks and values to be inserted has multiple
112 applications including Video, cryptography and JIT compilation.
113
114 Subtract and Divide are still permitted to be executed in this mode,
115 although from an algorithmic perspective it is strongly discouraged.
116 It would be better to use addition followed by one final subtract,
117 or in the case of divide, to get better accuracy, to perform a multiply
118 cascade followed by a final divide.
119
120 Note that single-operand or three-operand scalar-dest reduce is perfectly
121 well permitted: both still meet the qualifying characteristics that one
122 source operand can also be the destination, which allows the "accumulator"
123 to be identified.
124
125 If the "accumulator" cannot be identified (one of the sources is also
126 a destination) the results are **UNDEFINED**. This permits implementations
127 to not have to have complex decoding analysis of register fields: it
128 is thus up to the programmer to ensure that one of the source registers
129 is also a destination register in order to take advantage of Scalar
130 Reduce Mode.
131
132 If an interrupt or exception occurs in the middle of the scalar mapreduce,
133 the scalar destination register **MUST** be updated with the current
134 (intermediate) result, because this is how ```Program Order``` is
135 preserved (Vector Loops are to be considered to be just another way of issuing instructions
136 in Program Order). In this way, after return from interrupt,
137 the scalar mapreduce may continue where it left off. This provides
138 "precise" exception behaviour.
139
140 Note that hardware is perfectly permitted to perform multi-issue
141 parallel optimisation of the scalar reduce operation: it's just that
142 as far as the user is concerned, all exceptions and interrupts **MUST**
143 be precise.
144
145 ## Vector result reduce mode
146
147 Vector result reduce mode may utilise the destination vector for
148 the purposes of storing intermediary results. Interrupts and exceptions
149 can therefore also be precise. The result will be in the first
150 non-predicate-masked-out destination element. Note that unlike
151 Scalar reduce mode, Vector reduce
152 mode is *not* suited to operations which involve carry or overflow.
153
154 Programs **MUST NOT** rely on the contents of the intermediate results:
155 they may change from hardware implementation to hardware implementation.
156 Some implementations may perform an incremental update, whilst others
157 may choose to use the available Vector space for a binary tree reduction.
158 If an incremental Vector is required (```x[i] = x[i-1] + y[i]```) then
159 a *straight* SVP64 Vector instruction can be issued, where the source and
160 destination registers overlap: ```sv.add 1.v, 9.v, 2.v```. Due to
161 respecting ```Program Order``` being mandatory in SVP64, hardware should
162 and must detect this case and issue an incremental sequence of scalar
163 element instructions.
164
165 1. limited to single predicated dual src operations (add RT, RA, RB).
166 triple source operations are prohibited (such as fma).
167 2. limited to operations that make sense. divide is excluded, as is
168 subtract (X - Y - Z produces different answers depending on the order)
169 and asymmetric CRops (crandc, crorc). sane operations:
170 multiply, min/max, add, logical bitwise OR, most other CR ops.
171 operations that do have the same source and dest register type are
172 also excluded (isel, cmp). operations involving carry or overflow
173 (XER.CA / OV) are also prohibited.
174 3. the destination is a vector but the result is stored, ultimately,
175 in the first nonzero predicated element. all other nonzero predicated
176 elements are undefined. *this includes the CR vector* when Rc=1
177 4. implementations may use any ordering and any algorithm to reduce
178 down to a single result. However it must be equivalent to a straight
179 application of mapreduce. The destination vector (except masked out
180 elements) may be used for storing any intermediate results. these may
181 be left in the vector (undefined).
182 5. CRM applies when Rc=1. When CRM is zero, the CR associated with
183 the result is regarded as a "some results met standard CR result
184 criteria". When CRM is one, this changes to "all results met standard
185 CR criteria".
186 6. implementations MAY use destoffs as well as srcoffs (see [[sv/sprs]])
187 in order to store sufficient state to resume operation should an
188 interrupt occur. this is also why implementations are permitted to use
189 the destination vector to store intermediary computations
190 7. *Predication may be applied*. zeroing mode is not an option. masked-out
191 inputs are ignored; masked-out elements in the destination vector are
192 unaltered (not used for the purposes of intermediary storage); the
193 scalar result is placed in the first available unmasked element.
194
195 Pseudocode for the case where RA==RB:
196
197 result = op(iregs[RA], iregs[RA+1])
198 CR = analyse(result)
199 for i in range(2, VL):
200 result = op(result, iregs[RA+i])
201 CRnew = analyse(result)
202 if Rc=1
203 if CRM:
204 CR = CR bitwise or CRnew
205 else:
206 CR = CR bitwise AND CRnew
207
208 TODO: case where RA!=RB which involves first a vector of 2-operand
209 results followed by a mapreduce on the intermediates.
210
211 Note that when SVM is clear and SUBVL!=1 the sub-elements are
212 *independent*, i.e. they are mapreduced per *sub-element* as a result.
213 illustration with a vec2:
214
215 result.x = op(iregs[RA].x, iregs[RA+1].x)
216 result.y = op(iregs[RA].y, iregs[RA+1].y)
217 for i in range(2, VL):
218 result.x = op(result.x, iregs[RA+i].x)
219 result.y = op(result.y, iregs[RA+i].y)
220
221 Note here that Rc=1 does not make sense when SVM is clear and SUBVL!=1.
222
223 When SVM is set and SUBVL!=1, another variant is enabled: horizontal
224 subvector mode. Example for a vec3:
225
226 for i in range(VL):
227 result = op(iregs[RA+i].x, iregs[RA+i].x)
228 result = op(result, iregs[RA+i].y)
229 result = op(result, iregs[RA+i].z)
230 iregs[RT+i] = result
231
232 In this mode, when Rc=1 the Vector of CRs is as normal: each result
233 element creates a corresponding CR element.
234
235 # Fail-on-first
236
237 Data-dependent fail-on-first has two distinct variants: one for LD/ST,
238 the other for arithmetic operations (actually, CR-driven). Note in each
239 case the assumption is that vector elements are required appear to be
240 executed in sequential Program Order, element 0 being the first.
241
242 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
243 ordinary one. Exceptions occur "as normal". However for elements 1
244 and above, if an exception would occur, then VL is **truncated** to the
245 previous element.
246 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
247 CR-creating operation produces a result (including cmp). Similar to
248 branch, an analysis of the CR is performed and if the test fails, the
249 vector operation terminates and discards all element operations at and
250 above the current one, and VL is truncated to either
251 the *previous* element or the current one, depending on whether
252 VLi (VL "inclusive") is set.
253
254 Thus the new VL comprises a contiguous vector of results,
255 all of which pass the testing criteria (equal to zero, less than zero).
256
257 The CR-based data-driven fail-on-first is new and not found in ARM
258 SVE or RVV. It is extremely useful for reducing instruction count,
259 however requires speculative execution involving modifications of VL
260 to get high performance implementations. An additional mode (RC1=1)
261 effectively turns what would otherwise be an arithmetic operation
262 into a type of `cmp`. The CR is stored (and the CR.eq bit tested
263 against the `inv` field).
264 If the CR.eq bit is equal to `inv` then the Vector is truncated and
265 the loop ends.
266 Note that when RC1=1 the result elements are never stored, only the CRs.
267
268 VLi is only available as an option when `Rc=0` (or for instructions
269 which do not have Rc). When set, the current element is always
270 also included in the count (the new length that VL will be set to).
271 This may be useful in combination with "inv" to truncate the Vector
272 to `exclude` elements that fail a test, or, in the case of implementations
273 of strncpy, to include the terminating zero.
274
275 In CR-based data-driven fail-on-first there is only the option to select
276 and test one bit of each CR (just as with branch BO). For more complex
277 tests this may be insufficient. If that is the case, a vectorised crops
278 (crand, cror) may be used, and ffirst applied to the crop instead of to
279 the arithmetic vector.
280
281 One extremely important aspect of ffirst is:
282
283 * LDST ffirst may never set VL equal to zero. This because on the first
284 element an exception must be raised "as normal".
285 * CR-based data-dependent ffirst on the other hand **can** set VL equal
286 to zero. This is the only means in the entirety of SV that VL may be set
287 to zero (with the exception of via the SV.STATE SPR). When VL is set
288 zero due to the first element failing the CR bit-test, all subsequent
289 vectorised operations are effectively `nops` which is
290 *precisely the desired and intended behaviour*.
291
292 Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
293 to a nonzero value for any implementation-specific reason. For example:
294 it is perfectly reasonable for implementations to alter VL when ffirst
295 LD or ST operations are initiated on a nonaligned boundary, such that
296 within a loop the subsequent iteration of that loop begins subsequent
297 ffirst LD/ST operations on an aligned boundary. Likewise, to reduce
298 workloads or balance resources.
299
300 CR-based data-dependent first on the other hand MUST not truncate VL
301 arbitrarily to a length decided by the hardware: VL MUST only be
302 truncated based explicitly on whether a test fails.
303 This because it is a precise test on which algorithms
304 will rely.
305
306 ## Data-dependent fail-first on CR operations (crand etc)
307
308 Operations that actually produce or alter CR Field as a result
309 do not also in turn have an Rc=1 mode. However it makes no
310 sense to try to test the 4 bits of a CR Field for being equal
311 or not equal to zero. Moreover, the result is already in the
312 form that is desired: it is a CR field. Therefore,
313 CR-based operations have their own SVP64 Mode, described
314 in [[sv/cr_ops]]
315
316 There are two primary different types of CR operations:
317
318 * Those which have a 3-bit operand field (referring to a CR Field)
319 * Those which have a 5-bit operand (referring to a bit within the
320 whole 32-bit CR)
321
322 More details can be found in [[sv/cr_ops]].
323
324 # pred-result mode
325
326 This mode merges common CR testing with predication, saving on instruction
327 count. Below is the pseudocode excluding predicate zeroing and elwidth
328 overrides. Note that the paeudocode for [[sv/cr_ops]] is slightly different.
329
330 for i in range(VL):
331 # predication test, skip all masked out elements.
332 if predicate_masked_out(i):
333 continue
334 result = op(iregs[RA+i], iregs[RB+i])
335 CRnew = analyse(result) # calculates eq/lt/gt
336 # Rc=1 always stores the CR
337 if Rc=1 or RC1:
338 crregs[offs+i] = CRnew
339 # now test CR, similar to branch
340 if RC1 or CRnew[BO[0:1]] != BO[2]:
341 continue # test failed: cancel store
342 # result optionally stored but CR always is
343 iregs[RT+i] = result
344
345 The reason for allowing the CR element to be stored is so that
346 post-analysis of the CR Vector may be carried out. For example:
347 Saturation may have occurred (and been prevented from updating, by the
348 test) but it is desirable to know *which* elements fail saturation.
349
350 Note that RC1 Mode basically turns all operations into `cmp`. The
351 calculation is performed but it is only the CR that is written. The
352 element result is *always* discarded, never written (just like `cmp`).
353
354 Note that predication is still respected: predicate zeroing is slightly
355 different: elements that fail the CR test *or* are masked out are zero'd.
356
357 ## pred-result mode on CR ops
358
359 CR operations (mtcr, crand, cror) may be Vectorised,
360 predicated, and also pred-result mode applied to it.
361 Vectorisation applies to 4-bit CR Fields which are treated as
362 elements, not the individual bits of the 32-bit CR.
363 CR ops and how to identify them is described in [[sv/cr_ops]]
364