(no commit message)
[libreriscv.git] / openpower / sv / svp64 / appendix.mdwn
1 # Appendix
2
3 * <https://bugs.libre-soc.org/show_bug.cgi?id=574>
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
5
6 This is the appendix to [[sv/svp64]], providing explanations of modes
7 etc. leaving the main svp64 page's primary purpose as outlining the
8 instruction format.
9
10 Table of contents:
11
12 [[!toc]]
13
14 # XER, SO and other global flags
15
16 Vector systems are expected to be high performance. This is achieved
17 through parallelism, which requires that elements in the vector be
18 independent. XER SO and other global "accumulation" flags (CR.OV) cause
19 Read-Write Hazards on single-bit global resources, having a significant
20 detrimental effect.
21
22 Consequently in SV, XER.SO and CR.OV behaviour is disregarded (including
23 in `cmp` instructions). XER is simply neither read nor written.
24 This includes when `scalar identity behaviour` occurs. If precise
25 OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1
26 instructions should be used without an SV Prefix.
27
28 An interesting side-effect of this decision is that the OE flag is now
29 free for other uses when SV Prefixing is used.
30
31 Regarding XER.CA: this does not fit either: it was designed for a scalar
32 ISA. Instead, both carry-in and carry-out go into the CR.so bit of a given
33 Vector element. This provides a means to perform large parallel batches
34 of Vectorised carry-capable additions. crweird instructions can be used
35 to transfer the CRs in and out of an integer, where bitmanipulation
36 may be performed to analyse the carry bits (including carry lookahead
37 propagation) before continuing with further parallel additions.
38
39 # v3.0B/v3.1B relevant instructions
40
41 SV is primarily designed for use as an efficient hybrid 3D GPU / VPU /
42 CPU ISA.
43
44 As mentioned above, OE=1 is not applicable in SV, freeing this bit for
45 alternative uses. Additionally, Vectorisation of the VSX SIMD system
46 likewise makes no sense whatsoever. SV *replaces* VSX and provides,
47 at the very minimum, predication (which VSX was designed without).
48 Thus all VSX Major Opcodes - all of them - are "unused" and must raise
49 illegal instruction exceptions in SV Prefix Mode.
50
51 Likewise, `lq` (Load Quad), and Load/Store Multiple make no sense to
52 have because they are not only provided by SV, the SV alternatives may
53 be predicated as well, making them far better suited to use in function
54 calls and context-switching.
55
56 Additionally, some v3.0/1 instructions simply make no sense at all in a
57 Vector context: `rfid` falls into this category,
58 as well as `sc` and `scv`. Here there is simply no point
59 trying to Vectorise them: the standard OpenPOWER v3.0/1 instructions
60 should be called instead.
61
62 Fortuitously this leaves several Major Opcodes free for use by SV
63 to fit alternative future instructions. In a 3D context this means
64 Vector Product, Vector Normalise, [[sv/mv.swizzle]], Texture LD/ST
65 operations, and others critical to an efficient, effective 3D GPU and
66 VPU ISA. With such instructions being included as standard in other
67 commercially-successful GPU ISAs it is likewise critical that a 3D
68 GPU/VPU based on svp64 also have such instructions.
69
70 Note however that svp64 is stand-alone and is in no way
71 critically dependent on the existence or provision of 3D GPU or VPU
72 instructions. These should be considered extensions, and their discussion
73 and specification is out of scope for this document.
74
75 Note, again: this is *only* under svp64 prefixing. Standard v3.0B /
76 v3.1B is *not* altered by svp64 in any way.
77
78 ## Major opcode map (v3.0B)
79
80 This table is taken from v3.0B.
81 Table 9: Primary Opcode Map (opcode bits 0:5)
82
83 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
84 000 | | | tdi | twi | EXT04 | | | mulli | 000
85 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001
86 010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm | | rlwnm | 010
87 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011
88 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100
89 101 | lhz | lhzu | lha | lhau | sth | sthu | lmw | stmw | 101
90 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110
91 111 | lq | EXT57 | EXT58 | EXT59 | EXT60 | EXT61 | EXT62 | EXT63 | 111
92 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
93
94 ## Suitable for svp64-only
95
96 This is the same table containing v3.0B Primary Opcodes except those that
97 make no sense in a Vectorisation Context have been removed. These removed
98 POs can, *in the SV Vector Context only*, be assigned to alternative
99 (Vectorised-only) instructions, including future extensions.
100
101 Note, again, to emphasise: outside of svp64 these opcodes **do not**
102 change. When not prefixed with svp64 these opcodes **specifically**
103 retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning.
104
105 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
106 000 | | | | | | | | mulli | 000
107 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001
108 010 | | | | EXT19 | rlwimi| rlwinm | | rlwnm | 010
109 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011
110 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100
111 101 | lhz | lhzu | lha | lhau | sth | sthu | | | 101
112 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110
113 111 | | | EXT58 | EXT59 | | EXT61 | | EXT63 | 111
114 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
115
116 It is important to note that having a different v3.0B Scalar opcode
117 that is different from an SVP64 one is highly undesirable: the complexity
118 in the decoder is greatly increased.
119
120 # Single Predication
121
122 This is a standard mode normally found in Vector ISAs. every element in every source Vector and in the destination uses the same bit of one single predicate mask.
123
124 In SVSTATE, for Single-predication, implementors MUST increment both srcstep and dststep: unlike Twin-Predication the two must be equal at all times.
125
126 # Twin Predication
127
128 This is a novel concept that allows predication to be applied to a single
129 source and a single dest register. The following types of traditional
130 Vector operations may be encoded with it, *without requiring explicit
131 opcodes to do so*
132
133 * VSPLAT (a single scalar distributed across a vector)
134 * VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
135 * VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
136 * VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
137 * VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
138
139 Those patterns (and more) may be applied to:
140
141 * mv (the usual way that V\* ISA operations are created)
142 * exts\* sign-extension
143 * rwlinm and other RS-RA shift operations (**note**: excluding
144 those that take RA as both a src and dest. These are not
145 1-src 1-dest, they are 2-src, 1-dest)
146 * LD and ST (treating AGEN as one source)
147 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
148 * Condition Register ops mfcr, mtcr and other similar
149
150 This is a huge list that creates extremely powerful combinations,
151 particularly given that one of the predicate options is `(1<<r3)`
152
153 Additional unusual capabilities of Twin Predication include a back-to-back
154 version of VCOMPRESS-VEXPAND which is effectively the ability to do
155 sequentially ordered multiple VINSERTs. The source predicate selects a
156 sequentially ordered subset of elements to be inserted; the destination
157 predicate specifies the sequentially ordered recipient locations.
158 This is equivalent to
159 `llvm.masked.compressstore.*`
160 followed by
161 `llvm.masked.expandload.*`
162
163 # Rounding, clamp and saturate
164
165 see [[av_opcodes]].
166
167 To help ensure that audio quality is not compromised by overflow,
168 "saturation" is provided, as well as a way to detect when saturation
169 occurred if desired (Rc=1). When Rc=1 there will be a *vector* of CRs,
170 one CR per element in the result (Note: this is different from VSX which
171 has a single CR per block).
172
173 When N=0 the result is saturated to within the maximum range of an
174 unsigned value. For integer ops this will be 0 to 2^elwidth-1. Similar
175 logic applies to FP operations, with the result being saturated to
176 maximum rather than returning INF, and the minimum to +0.0
177
178 When N=1 the same occurs except that the result is saturated to the min
179 or max of a signed result, and for FP to the min and max value rather
180 than returning +/- INF.
181
182 When Rc=1, the CR "overflow" bit is set on the CR associated with the
183 element, to indicate whether saturation occurred. Note that due to
184 the hugely detrimental effect it has on parallel processing, XER.SO is
185 **ignored** completely and is **not** brought into play here. The CR
186 overflow bit is therefore simply set to zero if saturation did not occur,
187 and to one if it did.
188
189 Note also that saturate on operations that produce a carry output are
190 prohibited due to the conflicting use of the CR.so bit for storing if
191 saturation occurred.
192
193 Post-analysis of the Vector of CRs to find out if any given element hit
194 saturation may be done using a mapreduced CR op (cror), or by using the
195 new crweird instruction, transferring the relevant CR bits to a scalar
196 integer and testing it for nonzero. see [[sv/cr_int_predication]]
197
198 Note that the operation takes place at the maximum bitwidth (max of
199 src and dest elwidth) and that truncation occurs to the range of the
200 dest elwidth.
201
202 # Reduce mode
203
204 There are two variants here. The first is when the destination is scalar
205 and at least one of the sources is Vector. The second is more complex
206 and involves map-reduction on vectors.
207
208 The first defining characteristic distinguishing Scalar-dest reduce mode
209 from Vector reduce mode is that Scalar-dest reduce issues VL element
210 operations, whereas Vector reduce mode performs an actual map-reduce
211 (tree reduction): typically `O(VL log VL)` actual computations.
212
213 The second defining characteristic of scalar-dest reduce mode is that it
214 is, in simplistic and shallow terms *serial and sequential in nature*,
215 whereas the Vector reduce mode is definitely inherently paralleliseable.
216
217 The reason why scalar-dest reduce mode is "simplistically" serial and
218 sequential is that in certain circumstances (such as an `OR` operation
219 or a MIN/MAX operation) it may be possible to parallelise the reduction.
220
221 ## Scalar result reduce mode
222
223 In this mode, which is suited to operations involving carry or overflow,
224 one register must be identified by the programmer as being the "accumulator".
225 Scalar reduction is thus categorised by:
226
227 * One of the sources is a Vector
228 * the destination is a scalar
229 * optionally but most usefully when one source register is also the destination
230 * That the source register type is the same as the destination register
231 type identified as the "accumulator". scalar reduction on `cmp`,
232 `setb` or `isel` makes no sense for example because of the mixture
233 between CRs and GPRs.
234
235 Typical applications include simple operations such as `ADD r3, r10.v,
236 r3` where, clearly, r3 is being used to accumulate the addition of all
237 elements is the vector starting at r10.
238
239 # add RT, RA,RB but when RT==RA
240 for i in range(VL):
241 iregs[RA] += iregs[RB+i] # RT==RA
242
243 However, *unless* the operation is marked as "mapreduce", SV ordinarily
244 **terminates** at the first scalar operation. Only by marking the
245 operation as "mapreduce" will it continue to issue multiple sub-looped
246 (element) instructions in `Program Order`.
247
248 To.perform the loop in reverse order, the ```RG``` (reverse gear) bit must be set. This is useful for leaving a cumulative suffix sum in reverse order:
249
250 for i in (VL-1 downto 0):
251 # RT-1 = RA gives a suffix sum
252 iregs[RT+i] = iregs[RA+i] - iregs[RB+i]
253
254 Other examples include shift-mask operations where a Vector of inserts
255 into a single destination register is required, as a way to construct
256 a value quickly from multiple arbitrary bit-ranges and bit-offsets.
257 Using the same register as both the source and destination, with Vectors
258 of different offsets masks and values to be inserted has multiple
259 applications including Video, cryptography and JIT compilation.
260
261 Subtract and Divide are still permitted to be executed in this mode,
262 although from an algorithmic perspective it is strongly discouraged.
263 It would be better to use addition followed by one final subtract,
264 or in the case of divide, to get better accuracy, to perform a multiply
265 cascade followed by a final divide.
266
267 Note that single-operand or three-operand scalar-dest reduce is perfectly
268 well permitted: both still meet the qualifying characteristics that one
269 source operand can also be the destination, which allows the "accumulator"
270 to be identified.
271
272 If the "accumulator" cannot be identified (one of the sources is also
273 a destination) the results are **UNDEFINED**. This permits implementations
274 to not have to have complex decoding analysis of register fields: it
275 is thus up to the programmer to ensure that one of the source registers
276 is also a destination register in order to take advantage of Scalar
277 Reduce Mode.
278
279 If an interrupt or exception occurs in the middle of the scalar mapreduce,
280 the scalar destination register **MUST** be updated with the current
281 (intermediate) result, because this is how ```Program Order``` is
282 preserved (Vector Loops are to be considered to be just another way of issuing instructions
283 in Program Order). In this way, after return from interrupt,
284 the scalar mapreduce may continue where it left off. This provides
285 "precise" exception behaviour.
286
287 Note that hardware is perfectly permitted to perform multi-issue
288 parallel optimisation of the scalar reduce operation: it's just that
289 as far as the user is concerned, all exceptions and interrupts **MUST**
290 be precise.
291
292 ## Vector result reduce mode
293
294 Vector result reduce mode may utilise the destination vector for
295 the purposes of storing intermediary results. Interrupts and exceptions
296 can therefore also be precise. The result will be in the first
297 non-predicate-masked-out destination element. Note that unlike
298 Scalar reduce mode, Vector reduce
299 mode is *not* suited to operations which involve carry or overflow.
300
301 Programs **MUST NOT** rely on the contents of the intermediate results:
302 they may change from hardware implementation to hardware implementation.
303 Some implementations may perform an incremental update, whilst others
304 may choose to use the available Vector space for a binary tree reduction.
305 If an incremental Vector is required (```x[i] = x[i-1] + y[i]```) then
306 a *straight* SVP64 Vector instruction can be issued, where the source and
307 destination registers overlap: ```sv.add 1.v, 9.v, 2.v```. Due to
308 respecting ```Program Order``` being mandatory in SVP64, hardware should
309 and must detect this case and issue an incremental sequence of scalar
310 element instructions.
311
312 1. limited to single predicated dual src operations (add RT, RA, RB).
313 triple source operations are prohibited (such as fma).
314 2. limited to operations that make sense. divide is excluded, as is
315 subtract (X - Y - Z produces different answers depending on the order)
316 and asymmetric CRops (crandc, crorc). sane operations:
317 multiply, min/max, add, logical bitwise OR, most other CR ops.
318 operations that do have the same source and dest register type are
319 also excluded (isel, cmp). operations involving carry or overflow
320 (XER.CA / OV) are also prohibited.
321 3. the destination is a vector but the result is stored, ultimately,
322 in the first nonzero predicated element. all other nonzero predicated
323 elements are undefined. *this includes the CR vector* when Rc=1
324 4. implementations may use any ordering and any algorithm to reduce
325 down to a single result. However it must be equivalent to a straight
326 application of mapreduce. The destination vector (except masked out
327 elements) may be used for storing any intermediate results. these may
328 be left in the vector (undefined).
329 5. CRM applies when Rc=1. When CRM is zero, the CR associated with
330 the result is regarded as a "some results met standard CR result
331 criteria". When CRM is one, this changes to "all results met standard
332 CR criteria".
333 6. implementations MAY use destoffs as well as srcoffs (see [[sv/sprs]])
334 in order to store sufficient state to resume operation should an
335 interrupt occur. this is also why implementations are permitted to use
336 the destination vector to store intermediary computations
337 7. *Predication may be applied*. zeroing mode is not an option. masked-out
338 inputs are ignored; masked-out elements in the destination vector are
339 unaltered (not used for the purposes of intermediary storage); the
340 scalar result is placed in the first available unmasked element.
341
342 Pseudocode for the case where RA==RB:
343
344 result = op(iregs[RA], iregs[RA+1])
345 CR = analyse(result)
346 for i in range(2, VL):
347 result = op(result, iregs[RA+i])
348 CRnew = analyse(result)
349 if Rc=1
350 if CRM:
351 CR = CR bitwise or CRnew
352 else:
353 CR = CR bitwise AND CRnew
354
355 TODO: case where RA!=RB which involves first a vector of 2-operand
356 results followed by a mapreduce on the intermediates.
357
358 Note that when SVM is clear and SUBVL!=1 the sub-elements are
359 *independent*, i.e. they are mapreduced per *sub-element* as a result.
360 illustration with a vec2:
361
362 result.x = op(iregs[RA].x, iregs[RA+1].x)
363 result.y = op(iregs[RA].y, iregs[RA+1].y)
364 for i in range(2, VL):
365 result.x = op(result.x, iregs[RA+i].x)
366 result.y = op(result.y, iregs[RA+i].y)
367
368 Note here that Rc=1 does not make sense when SVM is clear and SUBVL!=1.
369
370 When SVM is set and SUBVL!=1, another variant is enabled: horizontal
371 subvector mode. Example for a vec3:
372
373 for i in range(VL):
374 result = op(iregs[RA+i].x, iregs[RA+i].x)
375 result = op(result, iregs[RA+i].y)
376 result = op(result, iregs[RA+i].z)
377 iregs[RT+i] = result
378
379 In this mode, when Rc=1 the Vector of CRs is as normal: each result
380 element creates a corresponding CR element.
381
382 # Fail-on-first
383
384 Data-dependent fail-on-first has two distinct variants: one for LD/ST,
385 the other for arithmetic operations (actually, CR-driven). Note in each
386 case the assumption is that vector elements are required appear to be
387 executed in sequential Program Order, element 0 being the first.
388
389 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
390 ordinary one. Exceptions occur "as normal". However for elements 1
391 and above, if an exception would occur, then VL is **truncated** to the
392 previous element.
393 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
394 CR-creating operation produces a result (including cmp). Similar to
395 branch, an analysis of the CR is performed and if the test fails, the
396 vector operation terminates and discards all element operations at and
397 above the current one, and VL is truncated to either
398 the *previous* element or the current one, depending on whether
399 VLi (VL "inclusive") is set.
400
401 Thus the new VL comprises a contiguous vector of results,
402 all of which pass the testing criteria (equal to zero, less than zero).
403
404 The CR-based data-driven fail-on-first is new and not found in ARM
405 SVE or RVV. It is extremely useful for reducing instruction count,
406 however requires speculative execution involving modifications of VL
407 to get high performance implementations. An additional mode (RC1=1)
408 effectively turns what would otherwise be an arithmetic operation
409 into a type of `cmp`. The CR is stored (and the CR.eq bit tested
410 against the `inv` field).
411 If the CR.eq bit is equal to `inv` then the Vector is truncated and
412 the loop ends.
413 Note that when RC1=1 the result elements are never stored, only the CRs.
414
415 VLi is only available as an option when `Rc=0` (or for instructions
416 which do not have Rc). When set, the current element is always
417 also included in the count (the new length that VL will be set to).
418 This may be useful in combination with "inv" to truncate the Vector
419 to `exclude` elements that fail a test, or, in the case of implementations
420 of strncpy, to include the terminating zero.
421
422 In CR-based data-driven fail-on-first there is only the option to select
423 and test one bit of each CR (just as with branch BO). For more complex
424 tests this may be insufficient. If that is the case, a vectorised crops
425 (crand, cror) may be used, and ffirst applied to the crop instead of to
426 the arithmetic vector.
427
428 One extremely important aspect of ffirst is:
429
430 * LDST ffirst may never set VL equal to zero. This because on the first
431 element an exception must be raised "as normal".
432 * CR-based data-dependent ffirst on the other hand **can** set VL equal
433 to zero. This is the only means in the entirety of SV that VL may be set
434 to zero (with the exception of via the SV.STATE SPR). When VL is set
435 zero due to the first element failing the CR bit-test, all subsequent
436 vectorised operations are effectively `nops` which is
437 *precisely the desired and intended behaviour*.
438
439 Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
440 to a nonzero value for any implementation-specific reason. For example:
441 it is perfectly reasonable for implementations to alter VL when ffirst
442 LD or ST operations are initiated on a nonaligned boundary, such that
443 within a loop the subsequent iteration of that loop begins subsequent
444 ffirst LD/ST operations on an aligned boundary. Likewise, to reduce
445 workloads or balance resources.
446
447 CR-based data-dependent first on the other hand MUST not truncate VL
448 arbitrarily to a length decided by the hardware: VL MUST only be
449 truncated based explicitly on whether a test fails.
450 This because it is a precise test on which algorithms
451 will rely.
452
453 ## Data-dependent fail-first on CR operations (crand etc)
454
455 Operations that actually produce or alter CR Field as a result
456 do not also in turn have an Rc=1 mode. However it makes no
457 sense to try to test the 4 bits of a CR Field for being equal
458 or not equal to zero. Moreover, the result is already in the
459 form that is desired: it is a CR field.
460
461 There are two primary different types of CR operations:
462
463 * Those which have a 3-bit operand field (referring to a CR Field)
464 * Those which have a 5-bit operand (referring to a bit within the
465 whole 32-bit CR)
466
467 Examining these two as has already been done it is observed that
468 the difference may be considered to be that the 5-bit variant
469 provides additional information about which CR Field bit
470 (EQ, GE, LT, SO) is to be operated on by the instruction.
471
472 Thus, logically, we may set the following rule:
473
474 * When a 5-bit CR Result field is used in an instruction, the
475 `inv, VLi and RC1` variant of Data-Dependent Fail-First
476 must be used. i.e. the bit of the CR field to be tested is
477 the one that has just been modified by the operation.
478 * When a 3-bit CR Result field is used the `inv CRbit` variant
479 must be used in order to select which CR Field bit shall
480 be tested (EQ, LE, GE, SO).
481
482 Examples of the former type:
483
484 * crand, cror, crnor. These all are 5-bit (BA, BB, BT). The bit
485 to be tested against `inv` is the one selected by `BT`
486 * mcrf. This has only 3-bit (BF, BFA). In order to select the
487 bit to be tested, the alternative FFirst encoding must be used.
488
489 This limits sv.mcrf in that it may not use the `VLi` (VL inclusive)
490 Mode. This is unfortunste but unavoidable due to encoding pressure
491 on SVP64.
492
493 # pred-result mode
494
495 This mode merges common CR testing with predication, saving on instruction
496 count. Below is the pseudocode excluding predicate zeroing and elwidth
497 overrides. Note that the paeudocode for [[sv/cr_ops]] is slightly different.
498
499 for i in range(VL):
500 # predication test, skip all masked out elements.
501 if predicate_masked_out(i):
502 continue
503 result = op(iregs[RA+i], iregs[RB+i])
504 CRnew = analyse(result) # calculates eq/lt/gt
505 # Rc=1 always stores the CR
506 if Rc=1 or RC1:
507 crregs[offs+i] = CRnew
508 # now test CR, similar to branch
509 if RC1 or CRnew[BO[0:1]] != BO[2]:
510 continue # test failed: cancel store
511 # result optionally stored but CR always is
512 iregs[RT+i] = result
513
514 The reason for allowing the CR element to be stored is so that
515 post-analysis of the CR Vector may be carried out. For example:
516 Saturation may have occurred (and been prevented from updating, by the
517 test) but it is desirable to know *which* elements fail saturation.
518
519 Note that RC1 Mode basically turns all operations into `cmp`. The
520 calculation is performed but it is only the CR that is written. The
521 element result is *always* discarded, never written (just like `cmp`).
522
523 Note that predication is still respected: predicate zeroing is slightly
524 different: elements that fail the CR test *or* are masked out are zero'd.
525
526 ## pred-result mode on CR ops
527
528 Yes, really: CR operations (mtcr, crand, cror) may be Vectorised,
529 predicated, and also pred-result mode applied to it. In this case,
530 the Vectorisation applies to the batch of 4 bits, i.e. it is not the CR
531 individual bits that are treated as the Vector, but the CRs themselves
532 (CR0, CR8, CR9...).
533
534 Put another way: Vectorised crand uses the higher bits of BA BB BC
535 to select the CR Field: these will increment sequentially as the Vector
536 loop progresses, whereas the lower 2 bits (selecting one of eq, ge, le, ov)
537 remain the same.
538
539 Thus after each Vectorised operation (crand) a test of the CR result
540 can in fact be performed. However the only meaningful comparision will
541 be "eq" or "ne", given that the result is only one bit.
542
543 # CR Operations
544
545 CRs are slightly more involved than INT or FP registers due to the
546 possibility for indexing individual bits (crops BA/BB/BT). Again however
547 the access pattern needs to be understandable in relation to v3.0B / v3.1B
548 numbering, with a clear linear relationship and mapping existing when
549 SV is applied.
550
551 ## CR EXTRA mapping table and algorithm
552
553 Numbering relationships for CR fields are already complex due to being
554 in BE format (*the relationship is not clearly explained in the v3.0B
555 or v3.1B specification*). However with some care and consideration
556 the exact same mapping used for INT and FP regfiles may be applied,
557 just to the upper bits, as explained below. The notation
558 `CR{field number}` is used to indicate access to a particular
559 Condition Register Field (as opposed to the notation `CR[bit]`
560 which accesses one bit of the 32 bit Power ISA v3.0B
561 Condition Register)
562
563 In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits. The top 3 bits (0:2)
564 select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits
565 *in* that CR. The numbering was determined (after 4 months of
566 analysis and research) to be as follows:
567
568 CR_index = 7-(BA>>2) # top 3 bits but BE
569 bit_index = 3-(BA & 0b11) # low 2 bits but BE
570 CR_reg = CR{CR_index} # get the CR
571 # finally get the bit from the CR.
572 CR_bit = (CR_reg & (1<<bit_index)) != 0
573
574 When it comes to applying SV, it is the CR\_reg number to which SV EXTRA2/3
575 applies, **not** the CR\_bit portion (bits 3:4):
576
577 if extra3_mode:
578 spec = EXTRA3
579 else:
580 spec = EXTRA2<<1 | 0b0
581 if spec[0]:
582 # vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
583 return ((BA >> 2)<<6) | # hi 3 bits shifted up
584 (spec[1:2]<<4) | # to make room for these
585 (BA & 0b11) # CR_bit on the end
586 else:
587 # scalar constructs "00 spec[1:2] BA[0:4]"
588 return (spec[1:2] << 5) | BA
589
590 Thus, for example, to access a given bit for a CR in SV mode, the v3.0B
591 algorithm to determin CR\_reg is modified to as follows:
592
593 CR_index = 7-(BA>>2) # top 3 bits but BE
594 if spec[0]:
595 # vector mode, 0-124 increments of 4
596 CR_index = (CR_index<<4) | (spec[1:2] << 2)
597 else:
598 # scalar mode, 0-32 increments of 1
599 CR_index = (spec[1:2]<<3) | CR_index
600 # same as for v3.0/v3.1 from this point onwards
601 bit_index = 3-(BA & 0b11) # low 2 bits but BE
602 CR_reg = CR{CR_index} # get the CR
603 # finally get the bit from the CR.
604 CR_bit = (CR_reg & (1<<bit_index)) != 0
605
606 Note here that the decoding pattern to determine CR\_bit does not change.
607
608 Note: high-performance implementations may read/write Vectors of CRs in
609 batches of aligned 32-bit chunks (CR0-7, CR7-15). This is to greatly
610 simplify internal design. If instructions are issued where CR Vectors
611 do not start on a 32-bit aligned boundary, performance may be affected.
612
613 ## CR fields as inputs/outputs of vector operations
614
615 CRs (or, the arithmetic operations associated with them)
616 may be marked as Vectorised or Scalar. When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised. Likewise if the destination is scalar then so is the CR.
617
618 When vectorized, the CR inputs/outputs are sequentially read/written
619 to 4-bit CR fields. Vectorised Integer results, when Rc=1, will begin
620 writing to CR8 (TBD evaluate) and increase sequentially from there.
621 This is so that:
622
623 * implementations may rely on the Vector CRs being aligned to 8. This
624 means that CRs may be read or written in aligned batches of 32 bits
625 (8 CRs per batch), for high performance implementations.
626 * scalar Rc=1 operation (CR0, CR1) and callee-saved CRs (CR2-4) are not
627 overwritten by vector Rc=1 operations except for very large VL
628 * CR-based predication, from CR32, is also not interfered with
629 (except by large VL).
630
631 However when the SV result (destination) is marked as a scalar by the
632 EXTRA field the *standard* v3.0B behaviour applies: the accompanying
633 CR when Rc=1 is written to. This is CR0 for integer operations and CR1
634 for FP operations.
635
636 Note that yes, the CRs are genuinely Vectorised. Unlike in SIMD VSX which
637 has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
638 v3.0B scalar operations produce a **tuple** of element results: the
639 result of the operation as one part of that element *and a corresponding
640 CR element*. Greatly simplified pseudocode:
641
642 for i in range(VL):
643 # calculate the vector result of an add iregs[RT+i] = iregs[RA+i]
644 + iregs[RB+i] # now calculate CR bits CRs{8+i}.eq = iregs[RT+i]
645 == 0 CRs{8+i}.gt = iregs[RT+i] > 0 ... etc
646
647 If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
648 then a followup instruction must be performed, setting "reduce" mode on
649 the Vector of CRs, using cr ops (crand, crnor)to do so. This provides far
650 more flexibility in analysing vectors than standard Vector ISAs. Normal
651 Vector ISAs are typically restricted to "were all results nonzero" and
652 "were some results nonzero". The application of mapreduce to Vectorised
653 cr operations allows far more sophisticated analysis, particularly in
654 conjunction with the new crweird operations see [[sv/cr_int_predication]].
655
656 Note in particular that the use of a separate instruction in this way
657 ensures that high performance multi-issue OoO inplementations do not
658 have the computation of the cumulative analysis CR as a bottleneck and
659 hindrance, regardless of the length of VL.
660
661 (see [[discussion]]. some alternative schemes are described there)
662
663 ## Rc=1 when SUBVL!=1
664
665 sub-vectors are effectively a form of SIMD (length 2 to 4). Only 1 bit of
666 predicate is allocated per subvector; likewise only one CR is allocated
667 per subvector.
668
669 This leaves a conundrum as to how to apply CR computation per subvector,
670 when normally Rc=1 is exclusively applied to scalar elements. A solution
671 is to perform a bitwise OR or AND of the subvector tests. Given that
672 OE is ignored, rhis field may (when available) be used to select OR or
673 AND behavior.
674
675 ### Table of CR fields
676
677 CR[i] is the notation used by the OpenPower spec to refer to CR field #i,
678 so FP instructions with Rc=1 write to CR[1] aka SVCR1_000.
679
680 CRs are not stored in SPRs: they are registers in their own right.
681 Therefore context-switching the full set of CRs involves a Vectorised
682 mfcr or mtcr, using VL=64, elwidth=8 to do so. This is exactly as how
683 scalar OpenPOWER context-switches CRs: it is just that there are now
684 more of them.
685
686 The 64 SV CRs are arranged similarly to the way the 128 integer registers
687 are arranged. TODO a python program that auto-generates a CSV file
688 which can be included in a table, which is in a new page (so as not to
689 overwhelm this one). [[svp64/cr_names]]
690
691 # Register Profiles
692
693 **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see
694 <https://bugs.libre-soc.org/show_bug.cgi?id=548> for details.
695
696 Instructions are broken down by Register Profiles as listed in the
697 following auto-generated page: [[opcode_regs_deduped]]. "Non-SV"
698 indicates that the operations with this Register Profile cannot be
699 Vectorised (mtspr, bc, dcbz, twi)
700
701 TODO generate table which will be here [[svp64/reg_profiles]]
702
703 # SV pseudocode illilustration
704
705 ## Single-predicated Instruction
706
707 illustration of normal mode add operation: zeroing not included, elwidth
708 overrides not included. if there is no predicate, it is set to all 1s
709
710 function op_add(rd, rs1, rs2) # add not VADD!
711 int i, id=0, irs1=0, irs2=0; predval = get_pred_val(FALSE, rd);
712 for (i = 0; i < VL; i++)
713 STATE.srcoffs = i # save context if (predval & 1<<i) # predication
714 uses intregs
715 ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2]; if (!int_vec[rd
716 ].isvec) break;
717 if (rd.isvec) { id += 1; } if (rs1.isvec) { irs1 += 1; } if
718 (rs2.isvec) { irs2 += 1; } if (id == VL or irs1 == VL or irs2 ==
719 VL) {
720 # end VL hardware loop STATE.srcoffs = 0; # reset return;
721 }
722
723 This has several modes:
724
725 * RT.v = RA.v RB.v * RT.v = RA.v RB.s (and RA.s RB.v) * RT.v = RA.s RB.s *
726 RT.s = RA.v RB.v * RT.s = RA.v RB.s (and RA.s RB.v) * RT.s = RA.s RB.s
727
728 All of these may be predicated. Vector-Vector is straightfoward.
729 When one of source is a Vector and the other a Scalar, it is clear that
730 each element of the Vector source should be added to the Scalar source,
731 each result placed into the Vector (or, if the destination is a scalar,
732 only the first nonpredicated result).
733
734 The one that is not obvious is RT=vector but both RA/RB=scalar.
735 Here this acts as a "splat scalar result", copying the same result into
736 all nonpredicated result elements. If a fixed destination scalar was
737 intended, then an all-Scalar operation should be used.
738
739 See <https://bugs.libre-soc.org/show_bug.cgi?id=552>
740
741 # Assembly Annotation
742
743 Assembly code annotation is required for SV to be able to successfully
744 mark instructions as "prefixed".
745
746 A reasonable (prototype) starting point:
747
748 svp64 [field=value]*
749
750 Fields:
751
752 * ew=8/16/32 - element width
753 * sew=8/16/32 - source element width
754 * vec=2/3/4 - SUBVL
755 * mode=reduce/satu/sats/crpred
756 * pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne
757 * spred={reg spec}
758
759 similar to x86 "rex" prefix.
760
761 For actual assembler:
762
763 sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s
764
765 Qualifiers:
766
767 * m={pred}: predicate mask mode
768 * sm={pred}: source-predicate mask mode (only allowed in Twin-predication)
769 * vec{N}: vec2 OR vec3 OR vec4 - sets SUBVL=2/3/4
770 * ew={N}: ew=8/16/32 - sets elwidth override
771 * sw={N}: sw=8/16/32 - sets source elwidth override
772 * ff={xx}: see fail-first mode
773 * pr={xx}: see predicate-result mode
774 * sat{x}: satu / sats - see saturation mode
775 * mr: see map-reduce mode
776 * mr.svm see map-reduce with sub-vector mode
777 * crm: see map-reduce CR mode
778 * crm.svm see map-reduce CR with sub-vector mode
779 * sz: predication with source-zeroing
780 * dz: predication with dest-zeroing
781
782 For modes:
783
784 * pred-result:
785 - pm=lt/gt/le/ge/eq/ne/so/ns OR
786 - pm=RC1 OR pm=~RC1
787 * fail-first
788 - ff=lt/gt/le/ge/eq/ne/so/ns OR
789 - ff=RC1 OR ff=~RC1
790 * saturation:
791 - sats
792 - satu
793 * map-reduce:
794 - mr OR crm: "normal" map-reduce mode or CR-mode.
795 - mr.svm OR crm.svm: when vec2/3/4 set, sub-vector mapreduce is enabled
796
797 # Proposed Parallel-reduction algorithm
798
799 ```
800 /// reference implementation of proposed SimpleV reduction semantics.
801 ///
802 // reduction operation -- we still use this algorithm even
803 // if the reduction operation isn't associative or
804 // commutative.
805 /// `temp_pred` is a user-visible Vector Condition register
806 ///
807 /// all input arrays have length `vl`
808 def reduce( vl, vec, pred, pred,):
809 step = 1;
810 while step < vl
811 step *= 2;
812 for i in (0..vl).step_by(step)
813 other = i + step / 2;
814 other_pred = other < vl && pred[other];
815 if pred[i] && other_pred
816 vec[i] += vec[other];
817 else if other_pred
818 vec[i] = vec[other];
819 pred[i] |= other_pred;
820
821 def reduce( vl, vec, pred, pred,):
822 j = 0
823 vi = [] # array of lookup indices to skip nonpredicated
824 for i, pbit in enumerate(pred):
825 if pbit:
826 vi[j] = i
827 j += 1
828 step = 2
829 while step <= vl
830 halfstep = step // 2
831 for i in (0..vl).step_by(step)
832 other = vi[i + halfstep]
833 i = vi[i]
834 other_pred = other < vl && pred[other]
835 if pred[i] && other_pred
836 vec[i] += vec[other]
837 pred[i] |= other_pred
838 step *= 2
839
840 ```