3e9955c4787153f3c56962033c550e27cc44664a
[libreriscv.git] / openpower / sv / svp64 / appendix.mdwn
1 # Appendix
2
3 * <https://bugs.libre-soc.org/show_bug.cgi?id=574>
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47>
5
6 This is the appendix to [[sv/svp64]], providing explanations of modes
7 etc. leaving the main svp64 page's primary purpose as outlining the
8 instruction format.
9
10 Table of contents:
11
12 [[!toc]]
13
14 # XER, SO and other global flags
15
16 Vector systems are expected to be high performance. This is achieved
17 through parallelism, which requires that elements in the vector be
18 independent. XER SO and other global "accumulation" flags (CR.OV) cause
19 Read-Write Hazards on single-bit global resources, having a significant
20 detrimental effect.
21
22 Consequently in SV, XER.SO and CR.OV behaviour is disregarded (including
23 in `cmp` instructions). XER is simply neither read nor written.
24 This includes when `scalar identity behaviour` occurs. If precise
25 OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1
26 instructions should be used without an SV Prefix.
27
28 An interesting side-effect of this decision is that the OE flag is now
29 free for other uses when SV Prefixing is used.
30
31 Regarding XER.CA: this does not fit either: it was designed for a scalar
32 ISA. Instead, both carry-in and carry-out go into the CR.so bit of a given
33 Vector element. This provides a means to perform large parallel batches
34 of Vectorised carry-capable additions. crweird instructions can be used
35 to transfer the CRs in and out of an integer, where bitmanipulation
36 may be performed to analyse the carry bits (including carry lookahead
37 propagation) before continuing with further parallel additions.
38
39 # v3.0B/v3.1B relevant instructions
40
41 SV is primarily designed for use as an efficient hybrid 3D GPU / VPU /
42 CPU ISA.
43
44 As mentioned above, OE=1 is not applicable in SV, freeing this bit for
45 alternative uses. Additionally, Vectorisation of the VSX SIMD system
46 likewise makes no sense whatsoever. SV *replaces* VSX and provides,
47 at the very minimum, predication (which VSX was designed without).
48 Thus all VSX Major Opcodes - all of them - are "unused" and must raise
49 illegal instruction exceptions in SV Prefix Mode.
50
51 Likewise, `lq` (Load Quad), and Load/Store Multiple make no sense to
52 have because they are not only provided by SV, the SV alternatives may
53 be predicated as well, making them far better suited to use in function
54 calls and context-switching.
55
56 Additionally, some v3.0/1 instructions simply make no sense at all in a
57 Vector context: `twi` and `tdi` fall into this category, as do branch
58 operations as well as `sc` and `scv`. Here there is simply no point
59 trying to Vectorise them: the standard OpenPOWER v3.0/1 instructions
60 should be called instead.
61
62 Fortuitously this leaves several Major Opcodes free for use by SV
63 to fit alternative future instructions. In a 3D context this means
64 Vector Product, Vector Normalise, [[sv/mv.swizzle]], Texture LD/ST
65 operations, and others critical to an efficient, effective 3D GPU and
66 VPU ISA. With such instructions being included as standard in other
67 commercially-successful GPU ISAs it is likewise critical that a 3D
68 GPU/VPU based on svp64 also have such instructions.
69
70 Note however that svp64 is stand-alone and is in no way
71 critically dependent on the existence or provision of 3D GPU or VPU
72 instructions. These should be considered extensions, and their discussion
73 and specification is out of scope for this document.
74
75 Note, again: this is *only* under svp64 prefixing. Standard v3.0B /
76 v3.1B is *not* altered by svp64 in any way.
77
78 ## Major opcode map (v3.0B)
79
80 This table is taken from v3.0B.
81 Table 9: Primary Opcode Map (opcode bits 0:5)
82
83 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
84 000 | | | tdi | twi | EXT04 | | | mulli | 000
85 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001
86 010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm | | rlwnm | 010
87 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011
88 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100
89 101 | lhz | lhzu | lha | lhau | sth | sthu | lmw | stmw | 101
90 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110
91 111 | lq | EXT57 | EXT58 | EXT59 | EXT60 | EXT61 | EXT62 | EXT63 | 111
92 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
93
94 ## Suitable for svp64
95
96 This is the same table containing v3.0B Primary Opcodes except those that
97 make no sense in a Vectorisation Context have been removed. These removed
98 POs can, *in the SV Vector Context only*, be assigned to alternative
99 (Vectorised-only) instructions, including future extensions.
100
101 Note, again, to emphasise: outside of svp64 these opcodes **do not**
102 change. When not prefixed with svp64 these opcodes **specifically**
103 retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning.
104
105 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
106 000 | | | | | | | | mulli | 000
107 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001
108 010 | | | | EXT19 | rlwimi| rlwinm | | rlwnm | 010
109 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011
110 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100
111 101 | lhz | lhzu | lha | lhau | sth | sthu | | | 101
112 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110
113 111 | | | EXT58 | EXT59 | | EXT61 | | EXT63 | 111
114 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
115
116 # Single Predication
117
118 This is a standard mode normally found in Vector ISAs. every element in rvery source Vector and in the destination uses the same bit of one single predicate mask.
119
120 Note however that in SVSTATE, implementors MUST increment both srcstep and dststep, and that the two must be equal at all times.
121
122 # Twin Predication
123
124 This is a novel concept that allows predication to be applied to a single
125 source and a single dest register. The following types of traditional
126 Vector operations may be encoded with it, *without requiring explicit
127 opcodes to do so*
128
129 * VSPLAT (a single scalar distributed across a vector)
130 * VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
131 * VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
132 * VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
133 * VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
134
135 Those patterns (and more) may be applied to:
136
137 * mv (the usual way that V\* ISA operations are created)
138 * exts\* sign-extension
139 * rwlinm and other RS-RA shift operations (**note**: excluding
140 those that take RA as both a src and dest. These are not
141 1-src 1-dest, they are 2-src, 1-dest)
142 * LD and ST (treating AGEN as one source)
143 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
144 * Condition Register ops mfcr, mtcr and other similar
145
146 This is a huge list that creates extremely powerful combinations,
147 particularly given that one of the predicate options is `(1<<r3)`
148
149 Additional unusual capabilities of Twin Predication include a back-to-back
150 version of VCOMPRESS-VEXPAND which is effectively the ability to do
151 sequentially ordered multiple VINSERTs. The source predicate selects a
152 sequentially ordered subset of elements to be inserted; the destination
153 predicate specifies the sequentially ordered recipient locations.
154 This is equivalent to
155 `llvm.masked.compressstore.*`
156 followed by
157 `llvm.masked.expandload.*`
158
159
160 # Rounding, clamp and saturate
161
162 see [[av_opcodes]].
163
164 To help ensure that audio quality is not compromised by overflow,
165 "saturation" is provided, as well as a way to detect when saturation
166 occurred if desired (Rc=1). When Rc=1 there will be a *vector* of CRs,
167 one CR per element in the result (Note: this is different from VSX which
168 has a single CR per block).
169
170 When N=0 the result is saturated to within the maximum range of an
171 unsigned value. For integer ops this will be 0 to 2^elwidth-1. Similar
172 logic applies to FP operations, with the result being saturated to
173 maximum rather than returning INF, and the minimum to +0.0
174
175 When N=1 the same occurs except that the result is saturated to the min
176 or max of a signed result, and for FP to the min and max value rather
177 than returning +/- INF.
178
179 When Rc=1, the CR "overflow" bit is set on the CR associated with the
180 element, to indicate whether saturation occurred. Note that due to
181 the hugely detrimental effect it has on parallel processing, XER.SO is
182 **ignored** completely and is **not** brought into play here. The CR
183 overflow bit is therefore simply set to zero if saturation did not occur,
184 and to one if it did.
185
186 Note also that saturate on operations that produce a carry output are
187 prohibited due to the conflicting use of the CR.so bit for storing if
188 saturation occurred.
189
190 Post-analysis of the Vector of CRs to find out if any given element hit
191 saturation may be done using a mapreduced CR op (cror), or by using the
192 new crweird instruction, transferring the relevant CR bits to a scalar
193 integer and testing it for nonzero. see [[sv/cr_int_predication]]
194
195 Note that the operation takes place at the maximum bitwidth (max of
196 src and dest elwidth) and that truncation occurs to the range of the
197 dest elwidth.
198
199 # Reduce mode
200
201 There are two variants here. The first is when the destination is scalar
202 and at least one of the sources is Vector. The second is more complex
203 and involves map-reduction on vectors.
204
205 The first defining characteristic distinguishing Scalar-dest reduce mode
206 from Vector reduce mode is that Scalar-dest reduce issues VL element
207 operations, whereas Vector reduce mode performs an actual map-reduce
208 (tree reduction): typically `O(VL log VL)` actual computations.
209
210 The second defining characteristic of scalar-dest reduce mode is that it
211 is, in simplistic and shallow terms *serial and sequential in nature*,
212 whereas the Vector reduce mode is definitely inherently paralleliseable.
213
214 The reason why scalar-dest reduce mode is "simplistically" serial and
215 sequential is that in certain circumstances (such as an `OR` operation
216 or a MIN/MAX operation) it may be possible to parallelise the reduction.
217
218 ## Scalar result reduce mode
219
220 In this mode, one register is identified as being the "accumulator".
221 Scalar reduction is thus categorised by:
222
223 * One of the sources is a Vector
224 * the destination is a scalar
225 * optionally but most usefully when one source register is also the destination
226 * That the source register type is the same as the destination register
227 type identified as the "accumulator". scalar reduction on `cmp`,
228 `setb` or `isel` is not possible for example because of the mixture
229 between CRs and GPRs.
230
231 Typical applications include simple operations such as `ADD r3, r10.v,
232 r3` where, clearly, r3 is being used to accumulate the addition of all
233 elements is the vector starting at r10.
234
235 # add RT, RA,RB but when RT==RA
236 for i in range(VL):
237 iregs[RA] += iregs[RB+i] # RT==RA
238
239 However, *unless* the operation is marked as "mapreduce", SV ordinarily
240 **terminates** at the first scalar operation. Only by marking the
241 operation as "mapreduce" will it continue to issue multiple sub-looped
242 (element) instructions in `Program Order`.
243
244 Other examples include shift-mask operations where a Vector of inserts
245 into a single destination register is required, as a way to construct
246 a value quickly from multiple arbitrary bit-ranges and bit-offsets.
247 Using the same register as both the source and destination, with Vectors
248 of different offsets masks and values to be inserted has multiple
249 applications including Video, cryptography and JIT compilation.
250
251 Subtract and Divide are still permitted to be executed in this mode,
252 although from an algorithmic perspective it is strongly discouraged.
253 It would be better to use addition followed by one final subtract,
254 or in the case of divide, to get better accuracy, to perform a multiply
255 cascade followed by a final divide.
256
257 Note that single-operand or three-operand scalar-dest reduce is perfectly
258 well permitted: both still meet the qualifying characteristics that one
259 source operand can also be the destination, which allows the "accumulator"
260 to be identified.
261
262 ## Vector result reduce mode
263
264 1. limited to single predicated dual src operations (add RT, RA, RB).
265 triple source operations are prohibited (fma).
266 2. limited to operations that make sense. divide is excluded, as is
267 subtract (X - Y - Z produces different answers depending on the order)
268 and asymmetric CRops (crandc, crorc). sane operations:
269 multiply, min/max, add, logical bitwise OR, most other CR ops.
270 operations that do have the same source and dest register type are
271 also excluded (isel, cmp). operations involving carry or overflow
272 (XER.CA / OV) are also prohibited.
273 3. the destination is a vector but the result is stored, ultimately,
274 in the first nonzero predicated element. all other nonzero predicated
275 elements are undefined. *this includes the CR vector* when Rc=1
276 4. implementations may use any ordering and any algorithm to reduce
277 down to a single result. However it must be equivalent to a straight
278 application of mapreduce. The destination vector (except masked out
279 elements) may be used for storing any intermediate results. these may
280 be left in the vector (undefined).
281 5. CRM applies when Rc=1. When CRM is zero, the CR associated with
282 the result is regarded as a "some results met standard CR result
283 criteria". When CRM is one, this changes to "all results met standard
284 CR criteria".
285 6. implementations MAY use destoffs as well as srcoffs (see [[sv/sprs]])
286 in order to store sufficient state to resume operation should an
287 interrupt occur. this is also why implementations are permitted to use
288 the destination vector to store intermediary computations
289 7. *Predication may be applied*. zeroing mode is not an option. masked-out
290 inputs are ignored; masked-out elements in the destination vector are
291 unaltered (not used for the purposes of intermediary storage); the
292 scalar result is placed in the first available unmasked element.
293
294 Pseudocode for the case where RA==RB:
295
296 result = op(iregs[RA], iregs[RA+1])
297 CR = analyse(result)
298 for i in range(2, VL):
299 result = op(result, iregs[RA+i])
300 CRnew = analyse(result)
301 if Rc=1
302 if CRM:
303 CR = CR bitwise or CRnew
304 else:
305 CR = CR bitwise AND CRnew
306
307 TODO: case where RA!=RB which involves first a vector of 2-operand
308 results followed by a mapreduce on the intermediates.
309
310 Note that when SVM is clear and SUBVL!=1 the sub-elements are
311 *independent*, i.e. they are mapreduced per *sub-element* as a result.
312 illustration with a vec2:
313
314 result.x = op(iregs[RA].x, iregs[RA+1].x)
315 result.y = op(iregs[RA].y, iregs[RA+1].y)
316 for i in range(2, VL):
317 result.x = op(result.x, iregs[RA+i].x)
318 result.y = op(result.y, iregs[RA+i].y)
319
320 Note here that Rc=1 does not make sense when SVM is clear and SUBVL!=1.
321
322 When SVM is set and SUBVL!=1, another variant is enabled: horizontal
323 subvector mode. Example for a vec3:
324
325 for i in range(VL):
326 result = op(iregs[RA+i].x, iregs[RA+i].x)
327 result = op(result, iregs[RA+i].y)
328 result = op(result, iregs[RA+i].z)
329 iregs[RT+i] = result
330
331 In this mode, when Rc=1 the Vector of CRs is as normal: each result
332 element creates a corresponding CR element.
333
334 # Fail-on-first
335
336 Data-dependent fail-on-first has two distinct variants: one for LD/ST,
337 the other for arithmetic operations (actually, CR-driven). Note in each
338 case the assumption is that vector elements are required appear to be
339 executed in sequential Program Order, element 0 being the first.
340
341 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
342 ordinary one. Exceptions occur "as normal". However for elements 1
343 and above, if an exception would occur, then VL is **truncated** to the
344 previous element.
345 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
346 CR-creating operation produces a result (including cmp). Similar to
347 branch, an analysis of the CR is performed and if the test fails, the
348 vector operation terminates and discards all element operations at and
349 above the current one, and VL is truncated to the *previous* element.
350 Thus the new VL comprises a contiguous vector of results, all of which
351 pass the testing criteria (equal to zero, less than zero).
352
353 The CR-based data-driven fail-on-first is new and not found in ARM
354 SVE or RVV. It is extremely useful for reducing instruction count,
355 however requires speculative execution involving modifications of VL
356 to get high performance implementations. An additional mode (RC1=1)
357 effectively turns what would otherwise be an arithmetic operation
358 into a type of `cmp`. The CR is stored (and the CR.eq bit tested).
359 If the CR.eq bit fails then the Vector is truncated and the loop ends.
360 Note that when RC1=1 the result elements arw never stored, only the CRs.
361
362 In CR-based data-driven fail-on-first there is only the option to select
363 and test one bit of each CR (just as with branch BO). For more complex
364 tests this may be insufficient. If that is the case, a vectorised crops
365 (crand, cror) may be used, and ffirst applied to the crop instead of to
366 the arithmetic vector.
367
368 One extremely important aspect of ffirst is:
369
370 * LDST ffirst may never set VL equal to zero. This because on the first
371 element an exception must be raised "as normal".
372 * CR-based data-dependent ffirst on the other hand **can** set VL equal
373 to zero. This is the only means in the entirety of SV that VL may be set
374 to zero (with the exception of via the SV.STATE SPR). When VL is set
375 zero due to the first element failing the CR bit-test, all subsequent
376 vectorised operations are effectively `nops` which is
377 *precisely the desired and intended behaviour*.
378
379 Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
380 to a nonzero value for any implementation-specific reason. For example:
381 it is perfectly reasonable for implementations to alter VL when ffirst
382 LD or ST operations are initiated on a nonaligned boundary, such that
383 within a loop the subsequent iteration of that loop begins subsequent
384 ffirst LD/ST operations on an aligned boundary. Likewise, to reduce
385 workloads or balance resources.
386
387 CR-based data-dependent first on the other hand MUST not truncate VL
388 arbitrarily. This because it is a precise test on which algorithms
389 will rely.
390
391 # pred-result mode
392
393 This mode merges common CR testing with predication, saving on instruction
394 count. Below is the pseudocode excluding predicate zeroing and elwidth
395 overrides.
396
397 for i in range(VL):
398 # predication test, skip all masked out elements.
399 if predicate_masked_out(i):
400 continue
401 result = op(iregs[RA+i], iregs[RB+i])
402 CRnew = analyse(result) # calculates eq/lt/gt
403 # Rc=1 always stores the CR
404 if Rc=1 or RC1:
405 crregs[offs+i] = CRnew
406 # now test CR, similar to branch
407 if RC1 or CRnew[BO[0:1]] != BO[2]:
408 continue # test failed: cancel store
409 # result optionally stored but CR always is
410 iregs[RT+i] = result
411
412 The reason for allowing the CR element to be stored is so that
413 post-analysis of the CR Vector may be carried out. For example:
414 Saturation may have occurred (and been prevented from updating, by the
415 test) but it is desirable to know *which* elements fail saturation.
416
417 Note that RC1 Mode basically turns all operations into `cmp`. The
418 calculation is performed but it is only the CR that is written. The
419 element result is *always* discarded, never written (just like `cmp`).
420
421 Note that predication is still respected: predicate zeroing is slightly
422 different: elements that fail the CR test *or* are masked out are zero'd.
423
424 ## pred-result mode on CR ops
425
426 Yes, really: CR operations (mtcr, crand, cror) may be Vectorised,
427 predicated, and also pred-result mode applied to it. In this case,
428 the Vectorisation applies to the batch of 4 bits, i.e. it is not the CR
429 individual bits that are treated as the Vector, but the CRs themselves
430 (CR0, CR8, CR9...)
431
432 Thus after each Vectorised operation (crand) a test of the CR result
433 can in fact be performed.
434
435 # CR Operations
436
437 CRs are slightly more involved than INT or FP registers due to the
438 possibility for indexing individual bits (crops BA/BB/BT). Again however
439 the access pattern needs to be understandable in relation to v3.0B / v3.1B
440 numbering, with a clear linear relationship and mapping existing when
441 SV is applied.
442
443 ## CR EXTRA mapping table and algorithm
444
445 Numbering relationships for CR fields are already complex due to being
446 in BE format (*the relationship is not clearly explained in the v3.0B
447 or v3.1B specification*). However with some care and consideration
448 the exact same mapping used for INT and FP regfiles may be applied,
449 just to the upper bits, as explained below.
450
451 In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits. The top 3 bits (0:2)
452 select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits
453 *in* that CR. The numbering was determined (after 4 months of
454 analysis and research) to be as follows:
455
456 CR_index = 7-(BA>>2) # top 3 bits but BE
457 bit_index = 3-(BA & 0b11) # low 2 bits but BE
458 CR_reg = CR{CR_index} # get the CR
459 # finally get the bit from the CR.
460 CR_bit = (CR_reg & (1<<bit_index)) != 0
461
462 When it comes to applying SV, it is the CR\_reg number to which SV EXTRA2/3
463 applies, **not** the CR\_bit portion (bits 3:4):
464
465 if extra3_mode:
466 spec = EXTRA3
467 else:
468 spec = EXTRA2<<1 | 0b0
469 if spec[0]:
470 # vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
471 return ((BA >> 2)<<6) | # hi 3 bits shifted up
472 (spec[1:2]<<4) | # to make room for these
473 (BA & 0b11) # CR_bit on the end
474 else:
475 # scalar constructs "00 spec[1:2] BA[0:4]"
476 return (spec[1:2] << 5) | BA
477
478 Thus, for example, to access a given bit for a CR in SV mode, the v3.0B
479 algorithm to determin CR\_reg is modified to as follows:
480
481 CR_index = 7-(BA>>2) # top 3 bits but BE
482 if spec[0]:
483 # vector mode, 0-124 increments of 4
484 CR_index = (CR_index<<4) | (spec[1:2] << 2)
485 else:
486 # scalar mode, 0-32 increments of 1
487 CR_index = (spec[1:2]<<3) | CR_index
488 # same as for v3.0/v3.1 from this point onwards
489 bit_index = 3-(BA & 0b11) # low 2 bits but BE
490 CR_reg = CR{CR_index} # get the CR
491 # finally get the bit from the CR.
492 CR_bit = (CR_reg & (1<<bit_index)) != 0
493
494 Note here that the decoding pattern to determine CR\_bit does not change.
495
496 Note: high-performance implementations may read/write Vectors of CRs in
497 batches of aligned 32-bit chunks (CR0-7, CR7-15). This is to greatly
498 simplify internal design. If instructions are issued where CR Vectors
499 do not start on a 32-bit aligned boundary, performance may be affected.
500
501 ## CR fields as inputs/outputs of vector operations
502
503 CRs (or, the arithmetic operations associated with them)
504 may be marked as Vectorised or Scalar. When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised. Likewise if the destination is scalar then so is the CR.
505
506 When vectorized, the CR inputs/outputs are sequentially read/written
507 to 4-bit CR fields. Vectorised Integer results, when Rc=1, will begin
508 writing to CR8 (TBD evaluate) and increase sequentially from there.
509 This is so that:
510
511 * implementations may rely on the Vector CRs being aligned to 8. This
512 means that CRs may be read or written in aligned batches of 32 bits
513 (8 CRs per batch), for high performance implementations.
514 * scalar Rc=1 operation (CR0, CR1) and callee-saved CRs (CR2-4) are not
515 overwritten by vector Rc=1 operations except for very large VL
516 * CR-based predication, from CR32, is also not interfered with
517 (except by large VL).
518
519 However when the SV result (destination) is marked as a scalar by the
520 EXTRA field the *standard* v3.0B behaviour applies: the accompanying
521 CR when Rc=1 is written to. This is CR0 for integer operations and CR1
522 for FP operations.
523
524 Note that yes, the CRs are genuinely Vectorised. Unlike in SIMD VSX which
525 has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
526 v3.0B scalar operations produce a **tuple** of element results: the
527 result of the operation as one part of that element *and a corresponding
528 CR element*. Greatly simplified pseudocode:
529
530 for i in range(VL):
531 # calculate the vector result of an add iregs[RT+i] = iregs[RA+i]
532 + iregs[RB+i] # now calculate CR bits CRs{8+i}.eq = iregs[RT+i]
533 == 0 CRs{8+i}.gt = iregs[RT+i] > 0 ... etc
534
535 If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
536 then a followup instruction must be performed, setting "reduce" mode on
537 the Vector of CRs, using cr ops (crand, crnor)to do so. This provides far
538 more flexibility in analysing vectors than standard Vector ISAs. Normal
539 Vector ISAs are typically restricted to "were all results nonzero" and
540 "were some results nonzero". The application of mapreduce to Vectorised
541 cr operations allows far more sophisticated analysis, particularly in
542 conjunction with the new crweird operations see [[sv/cr_int_predication]].
543
544 Note in particular that the use of a separate instruction in this way
545 ensures that high performance multi-issue OoO inplementations do not
546 have the computation of the cumulative analysis CR as a bottleneck and
547 hindrance, regardless of the length of VL.
548
549 (see [[discussion]]. some alternative schemes are described there)
550
551 ## Rc=1 when SUBVL!=1
552
553 sub-vectors are effectively a form of SIMD (length 2 to 4). Only 1 bit of
554 predicate is allocated per subvector; likewise only one CR is allocated
555 per subvector.
556
557 This leaves a conundrum as to how to apply CR computation per subvector,
558 when normally Rc=1 is exclusively applied to scalar elements. A solution
559 is to perform a bitwise OR or AND of the subvector tests. Given that
560 OE is ignored, rhis field may (when available) be used to select OR or
561 AND behavior.
562
563 ### Table of CR fields
564
565 CR[i] is the notation used by the OpenPower spec to refer to CR field #i,
566 so FP instructions with Rc=1 write to CR[1] aka SVCR1_000.
567
568 CRs are not stored in SPRs: they are registers in their own right.
569 Therefore context-switching the full set of CRs involves a Vectorised
570 mfcr or mtcr, using VL=64, elwidth=8 to do so. This is exactly as how
571 scalar OpenPOWER context-switches CRs: it is just that there are now
572 more of them.
573
574 The 64 SV CRs are arranged similarly to the way the 128 integer registers
575 are arranged. TODO a python program that auto-generates a CSV file
576 which can be included in a table, which is in a new page (so as not to
577 overwhelm this one). [[svp64/cr_names]]
578
579 # Register Profiles
580
581 **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see
582 <https://bugs.libre-soc.org/show_bug.cgi?id=548> for details.
583
584 Instructions are broken down by Register Profiles as listed in the
585 following auto-generated page: [[opcode_regs_deduped]]. "Non-SV"
586 indicates that the operations with this Register Profile cannot be
587 Vectorised (mtspr, bc, dcbz, twi)
588
589 TODO generate table which will be here [[svp64/reg_profiles]]
590
591 # SV pseudocode illilustration
592
593 ## Single-predicated Instruction
594
595 illustration of normal mode add operation: zeroing not included, elwidth
596 overrides not included. if there is no predicate, it is set to all 1s
597
598 function op_add(rd, rs1, rs2) # add not VADD!
599 int i, id=0, irs1=0, irs2=0; predval = get_pred_val(FALSE, rd);
600 for (i = 0; i < VL; i++)
601 STATE.srcoffs = i # save context if (predval & 1<<i) # predication
602 uses intregs
603 ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2]; if (!int_vec[rd
604 ].isvec) break;
605 if (rd.isvec) { id += 1; } if (rs1.isvec) { irs1 += 1; } if
606 (rs2.isvec) { irs2 += 1; } if (id == VL or irs1 == VL or irs2 ==
607 VL) {
608 # end VL hardware loop STATE.srcoffs = 0; # reset return;
609 }
610
611 This has several modes:
612
613 * RT.v = RA.v RB.v * RT.v = RA.v RB.s (and RA.s RB.v) * RT.v = RA.s RB.s *
614 RT.s = RA.v RB.v * RT.s = RA.v RB.s (and RA.s RB.v) * RT.s = RA.s RB.s
615
616 All of these may be predicated. Vector-Vector is straightfoward.
617 When one of source is a Vector and the other a Scalar, it is clear that
618 each element of the Vector source should be added to the Scalar source,
619 each result placed into the Vector (or, if the destination is a scalar,
620 only the first nonpredicated result).
621
622 The one that is not obvious is RT=vector but both RA/RB=scalar.
623 Here this acts as a "splat scalar result", copying the same result into
624 all nonpredicated result elements. If a fixed destination scalar was
625 intended, then an all-Scalar operation should be used.
626
627 See <https://bugs.libre-soc.org/show_bug.cgi?id=552>
628
629 # Assembly Annotation
630
631 Assembly code annotation is required for SV to be able to successfully
632 mark instructions as "prefixed".
633
634 A reasonable (prototype) starting point:
635
636 svp64 [field=value]*
637
638 Fields:
639
640 * ew=8/16/32 - element width
641 * sew=8/16/32 - source element width
642 * vec=2/3/4 - SUBVL
643 * mode=reduce/satu/sats/crpred
644 * pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne
645 * spred={reg spec}
646
647 similar to x86 "rex" prefix.
648
649 For actual assembler:
650
651 sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s
652
653 Qualifiers:
654
655 * m={pred}: predicate mask mode
656 * sm={pred}: source-predicate mask mode (only allowed in Twin-predication)
657 * vec{N}: vec2 OR vec3 OR vec4 - sets SUBVL=2/3/4
658 * ew={N}: ew=8/16/32 - sets elwidth override
659 * sw={N}: sw=8/16/32 - sets source elwidth override
660 * ff={xx}: see fail-first mode
661 * pr={xx}: see predicate-result mode
662 * sat{x}: satu / sats - see saturation mode
663 * mr: see map-reduce mode
664 * mr.svm see map-reduce with sub-vector mode
665 * crm: see map-reduce CR mode
666 * crm.svm see map-reduce CR with sub-vector mode
667 * sz: predication with source-zeroing
668 * dz: predication with dest-zeroing
669
670 For modes:
671
672 * pred-result:
673 - pm=lt/gt/le/ge/eq/ne/so/ns OR
674 - pm=RC1 OR pm=~RC1
675 * fail-first
676 - ff=lt/gt/le/ge/eq/ne/so/ns OR
677 - ff=RC1 OR ff=~RC1
678 * saturation:
679 - sats
680 - satu
681 * map-reduce:
682 - mr OR crm: "normal" map-reduce mode or CR-mode.
683 - mr.svm OR crm.svm: when vec2/3/4 set, sub-vector mapreduce is enabled
684