(no commit message)
[libreriscv.git] / openpower / sv / svp64 / appendix.mdwn
1 # Appendix
2
3 * <https://bugs.libre-soc.org/show_bug.cgi?id=574> Saturation
4 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47> Parallel Prefix
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=697> Reduce Modes
6 * <https://bugs.libre-soc.org/show_bug.cgi?id=809> OV sv.addex discussion
7
8 This is the appendix to [[sv/svp64]], providing explanations of modes
9 etc. leaving the main svp64 page's primary purpose as outlining the
10 instruction format.
11
12 Table of contents:
13
14 [[!toc]]
15
16 # XER, SO and other global flags
17
18 Vector systems are expected to be high performance. This is achieved
19 through parallelism, which requires that elements in the vector be
20 independent. XER SO/OV and other global "accumulation" flags (CR.SO) cause
21 Read-Write Hazards on single-bit global resources, having a significant
22 detrimental effect.
23
24 Consequently in SV, XER.SO behaviour is disregarded (including
25 in `cmp` instructions). XER.SO is not read, but XER.OV may be written,
26 breaking the Read-Modify-Write Hazard Chain that complicates
27 microarchitectural implementations.
28 This includes when `scalar identity behaviour` occurs. If precise
29 OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1
30 instructions should be used without an SV Prefix.
31
32 TODO jacob add about OV https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
33
34 Of note here is that XER.SO and OV may already be disregarded in the
35 Power ISA v3.0/1 SFFS (Scalar Fixed and Floating) Compliancy Subset.
36 SVP64 simply makes it mandatory to disregard XER.SO even for other Subsets,
37 but only for SVP64 Prefixed Operations.
38
39 XER.CA/CA32 on the other hand is expected and required to be implemented
40 according to standard Power ISA Scalar behaviour. Interestingly, due
41 to SVP64 being in effect a hardware for-loop around Scalar instructions
42 executing in precise Program Order, a little thought shows that a Vectorised
43 Carry-In-Out add is in effect a Big Integer Add, taking a single bit Carry In
44 and producing, at the end, a single bit Carry out. High performance
45 implementations may exploit this observation to deploy efficient
46 Parallel Carry Lookahead.
47
48 # assume VL=4, this results in 4 sequential ops (below)
49 sv.adde r0.v, r4.v, r8.v
50
51 # instructions that get executed in backend hardware:
52 adde r0, r4, r8 # takes carry-in, produces carry-out
53 adde r1, r5, r9 # takes carry from previous
54 ...
55 adde r3, r7, r11 # likewise
56
57 It can clearly be seen that the carry chains from one
58 64 bit add to the next, the end result being that a
59 256-bit "Big Integer Add" has been performed, and that
60 CA contains the 257th bit. A one-instruction 512-bit Add
61 may be performed by setting VL=8, and a one-instruction
62 1024-bit add by setting VL=16, and so on.
63
64 # v3.0B/v3.1 relevant instructions
65
66 SV is primarily designed for use as an efficient hybrid 3D GPU / VPU /
67 CPU ISA.
68
69 Vectorisation of the VSX Packed SIMD system
70 likewise makes no sense whatsoever. SV *replaces* VSX and provides,
71 at the very minimum, predication (which VSX was designed without).
72 Thus all VSX Major Opcodes - all of them - are "unused" and must raise
73 illegal instruction exceptions in SV Prefix Mode.
74
75 Likewise, `lq` (Load Quad), and Load/Store Multiple make no sense to
76 have because they are not only provided by SV, the SV alternatives may
77 be predicated as well, making them far better suited to use in function
78 calls and context-switching.
79
80 Additionally, some v3.0/1 instructions simply make no sense at all in a
81 Vector context: `rfid` falls into this category,
82 as well as `sc` and `scv`. Here there is simply no point
83 trying to Vectorise them: the standard OpenPOWER v3.0/1 instructions
84 should be called instead.
85
86 Fortuitously this leaves several Major Opcodes free for use by SV
87 to fit alternative future instructions. In a 3D context this means
88 Vector Product, Vector Normalise, [[sv/mv.swizzle]], Texture LD/ST
89 operations, and others critical to an efficient, effective 3D GPU and
90 VPU ISA. With such instructions being included as standard in other
91 commercially-successful GPU ISAs it is likewise critical that a 3D
92 GPU/VPU based on svp64 also have such instructions.
93
94 Note however that svp64 is stand-alone and is in no way
95 critically dependent on the existence or provision of 3D GPU or VPU
96 instructions. These should be considered extensions, and their discussion
97 and specification is out of scope for this document.
98
99 Note, again: this is *only* under svp64 prefixing. Standard v3.0B /
100 v3.1B is *not* altered by svp64 in any way.
101
102 ## Major opcode map (v3.0B)
103
104 This table is taken from v3.0B.
105 Table 9: Primary Opcode Map (opcode bits 0:5)
106
107 ```
108 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
109 000 | | | tdi | twi | EXT04 | | | mulli | 000
110 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001
111 010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm | | rlwnm | 010
112 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011
113 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100
114 101 | lhz | lhzu | lha | lhau | sth | sthu | lmw | stmw | 101
115 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110
116 111 | lq | EXT57 | EXT58 | EXT59 | EXT60 | EXT61 | EXT62 | EXT63 | 111
117 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
118 ```
119
120 ## Suitable for svp64-only
121
122 This is the same table containing v3.0B Primary Opcodes except those that
123 make no sense in a Vectorisation Context have been removed. These removed
124 POs can, *in the SV Vector Context only*, be assigned to alternative
125 (Vectorised-only) instructions, including future extensions.
126 EXT04 retains the scalar `madd*` operations but would have all PackedSIMD
127 (aka VSX) operations removed.
128
129 Note, again, to emphasise: outside of svp64 these opcodes **do not**
130 change. When not prefixed with svp64 these opcodes **specifically**
131 retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning.
132
133 ```
134 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
135 000 | | | | | EXT04 | | | mulli | 000
136 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001
137 010 | bc/l/a | | | EXT19 | rlwimi| rlwinm | | rlwnm | 010
138 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011
139 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100
140 101 | lhz | lhzu | lha | lhau | sth | sthu | | | 101
141 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110
142 111 | | | EXT58 | EXT59 | | EXT61 | | EXT63 | 111
143 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
144 ```
145
146 It is important to note that having a different v3.0B Scalar opcode
147 that is different from an SVP64 one is highly undesirable: the complexity
148 in the decoder is greatly increased.
149
150 # EXTRA Field Mapping
151
152 The purpose of the 9-bit EXTRA field mapping is to mark individual
153 registers (RT, RA, BFA) as either scalar or vector, and to extend
154 their numbering from 0..31 in Power ISA v3.0 to 0..127 in SVP64.
155 Three of the 9 bits may also be used up for a 2nd Predicate (Twin
156 Predication) leaving a mere 6 bits for qualifying registers. As can
157 be seen there is significant pressure on these (and in fact all) SVP64 bits.
158
159 In Power ISA v3.1 prefixing there are bits which describe and classify
160 the prefix in a fashion that is independent of the suffix. MLSS for
161 example. For SVP64 there is insufficient space to make the SVP64 Prefix
162 "self-describing", and consequently every single Scalar instruction
163 had to be individually analysed, by rote, to craft an EXTRA Field Mapping.
164 This process was semi-automated and is described in this section.
165 The final results, which are part of the SVP64 Specification, are here:
166
167 * [[openpower/opcode_regs_deduped]]
168
169 Firstly, every instruction's mnemonic (`add RT, RA, RB`) was analysed
170 from reading the markdown formatted version of the Scalar pseudocode
171 which is machine-readable and found in [[openpower/isatables]]. The
172 analysis gives, by instruction, a "Register Profile". `add RT, RA, RB`
173 for example is given a designation `RM-2R-1W` because it requires
174 two GPR reads and one GPR write.
175
176 Secondly, the total number of registers was added up (2R-1W is 3 registers)
177 and if less than or equal to three then that instruction could be given an
178 EXTRA3 designation. Four or more is given an EXTRA2 designation because
179 there are only 9 bits available.
180
181 Thirdly, the instruction was analysed to see if Twin or Single
182 Predication was suitable. As a general rule this was if there
183 was only a single operand and a single result (`extw` and LD/ST)
184 however it was found that some 2 or 3 operand instructions also
185 qualify. Given that 3 of the 9 bits of EXTRA had to be sacrificed for use
186 in Twin Predication, some compromises were made, here. LDST is
187 Twin but also has 3 operands in some operations, so only EXTRA2 can be used.
188
189 Fourthly, a packing format was decided: for 2R-1W an EXTRA3 indexing
190 could have been decided
191 that RA would be indexed 0 (EXTRA bits 0-2), RB indexed 1 (EXTRA bits 3-5)
192 and RT indexed 2 (EXTRA bits 6-8). In some cases (LD/ST with update)
193 RA-as-a-source is given a **different** EXTRA index from RA-as-a-result
194 (because it is possible to do, and perceived to be useful). Rc=1
195 co-results (CR0, CR1) are always given the same EXTRA index as their
196 main result (RT, FRT).
197
198 Fifthly, in an automated process the results of the analysis
199 were outputted in CSV Format for use in machine-readable form
200 by sv_analysis.py <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;hb=HEAD>
201
202 This process was laborious but logical, and, crucially, once a
203 decision is made (and ratified) cannot be reversed.
204 Qualifying future Power ISA Scalar instructions for SVP64
205 is **strongly** advised to utilise this same process and the same
206 sv_analysis.py program as a canonical method of maintaining the
207 relationships. Alterations to that same program which
208 change the Designation is **prohibited** once finalised (ratified
209 through the Power ISA WG Process). It would
210 be similar to deciding that `add` should be changed from X-Form
211 to D-Form.
212
213 # Single Predication
214
215 This is a standard mode normally found in Vector ISAs. every element in every source Vector and in the destination uses the same bit of one single predicate mask.
216
217 In SVSTATE, for Single-predication, implementors MUST increment both srcstep and dststep: unlike Twin-Predication the two must be equal at all times.
218
219 # Twin Predication
220
221 This is a novel concept that allows predication to be applied to a single
222 source and a single dest register. The following types of traditional
223 Vector operations may be encoded with it, *without requiring explicit
224 opcodes to do so*
225
226 * VSPLAT (a single scalar distributed across a vector)
227 * VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
228 * VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
229 * VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
230 * VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
231
232 Those patterns (and more) may be applied to:
233
234 * mv (the usual way that V\* ISA operations are created)
235 * exts\* sign-extension
236 * rwlinm and other RS-RA shift operations (**note**: excluding
237 those that take RA as both a src and dest. These are not
238 1-src 1-dest, they are 2-src, 1-dest)
239 * LD and ST (treating AGEN as one source)
240 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
241 * Condition Register ops mfcr, mtcr and other similar
242
243 This is a huge list that creates extremely powerful combinations,
244 particularly given that one of the predicate options is `(1<<r3)`
245
246 Additional unusual capabilities of Twin Predication include a back-to-back
247 version of VCOMPRESS-VEXPAND which is effectively the ability to do
248 sequentially ordered multiple VINSERTs. The source predicate selects a
249 sequentially ordered subset of elements to be inserted; the destination
250 predicate specifies the sequentially ordered recipient locations.
251 This is equivalent to
252 `llvm.masked.compressstore.*`
253 followed by
254 `llvm.masked.expandload.*`
255 with a single instruction.
256
257 This extreme power and flexibility comes down to the fact that SVP64
258 is not actually a Vector ISA: it is a loop-abstraction-concept that
259 is applied *in general* to Scalar operations, just like the x86
260 `REP` instruction (if put on steroids).
261
262 # Reduce modes
263
264 Reduction in SVP64 is deterministic and somewhat of a misnomer. A normal
265 Vector ISA would have explicit Reduce opcodes with defined characteristics
266 per operation: in SX Aurora there is even an additional scalar argument
267 containing the initial reduction value, and the default is either 0
268 or 1 depending on the specifics of the explicit opcode.
269 SVP64 fundamentally has to
270 utilise *existing* Scalar Power ISA v3.0B operations, which presents some
271 unique challenges.
272
273 The solution turns out to be to simply define reduction as permitting
274 deterministic element-based schedules to be issued using the base Scalar
275 operations, and to rely on the underlying microarchitecture to resolve
276 Register Hazards at the element level. This goes back to
277 the fundamental principle that SV is nothing more than a Sub-Program-Counter
278 sitting between Decode and Issue phases.
279
280 Microarchitectures *may* take opportunities to parallelise the reduction
281 but only if in doing so they preserve Program Order at the Element Level.
282 Opportunities where this is possible include an `OR` operation
283 or a MIN/MAX operation: it may be possible to parallelise the reduction,
284 but for Floating Point it is not permitted due to different results
285 being obtained if the reduction is not executed in strict Program-Sequential
286 Order.
287
288 In essence it becomes the programmer's responsibility to leverage the
289 pre-determined schedules to desired effect.
290
291 ## Scalar result reduction and iteration
292
293 Scalar Reduction per se does not exist, instead is implemented in SVP64
294 as a simple and natural relaxation of the usual restriction on the Vector
295 Looping which would terminate if the destination was marked as a Scalar.
296 Scalar Reduction by contrast *keeps issuing Vector Element Operations*
297 even though the destination register is marked as scalar.
298 Thus it is up to the programmer to be aware of this, observe some
299 conventions, and thus end up achieving the desired outcome of scalar
300 reduction.
301
302 It is also important to appreciate that there is no
303 actual imposition or restriction on how this mode is utilised: there
304 will therefore be several valuable uses (including Vector Iteration
305 and "Reverse-Gear")
306 and it is up to the programmer to make best use of the
307 (strictly deterministic) capability
308 provided.
309
310 In this mode, which is suited to operations involving carry or overflow,
311 one register must be assigned, by convention by the programmer to be the
312 "accumulator". Scalar reduction is thus categorised by:
313
314 * One of the sources is a Vector
315 * the destination is a scalar
316 * optionally but most usefully when one source scalar register is
317 also the scalar destination (which may be informally termed
318 the "accumulator")
319 * That the source register type is the same as the destination register
320 type identified as the "accumulator". Scalar reduction on `cmp`,
321 `setb` or `isel` makes no sense for example because of the mixture
322 between CRs and GPRs.
323
324 *Note that issuing instructions in Scalar reduce mode such as `setb`
325 are neither `UNDEFINED` nor prohibited, despite them not making much
326 sense at first glance.
327 Scalar reduce is strictly defined behaviour, and the cost in
328 hardware terms of prohibition of seemingly non-sensical operations is too great.
329 Therefore it is permitted and required to be executed successfully.
330 Implementors **MAY** choose to optimise such instructions in instances
331 where their use results in "extraneous execution", i.e. where it is clear
332 that the sequence of operations, comprising multiple overwrites to
333 a scalar destination **without** cumulative, iterative, or reductive
334 behaviour (no "accumulator"), may discard all but the last element
335 operation. Identification
336 of such is trivial to do for `setb` and `cmp`: the source register type is
337 a completely different register file from the destination.
338 Likewise Scalar reduction when the destination is a Vector
339 is as if the Reduction Mode was not requested.*
340
341 Typical applications include simple operations such as `ADD r3, r10.v,
342 r3` where, clearly, r3 is being used to accumulate the addition of all
343 elements of the vector starting at r10.
344
345 # add RT, RA,RB but when RT==RA
346 for i in range(VL):
347 iregs[RA] += iregs[RB+i] # RT==RA
348
349 However, *unless* the operation is marked as "mapreduce" (`sv.add/mr`)
350 SV ordinarily
351 **terminates** at the first scalar operation. Only by marking the
352 operation as "mapreduce" will it continue to issue multiple sub-looped
353 (element) instructions in `Program Order`.
354
355 To perform the loop in reverse order, the ```RG``` (reverse gear) bit must be set. This may be useful in situations where the results may be different
356 (floating-point) if executed in a different order. Given that there is
357 no actual prohibition on Reduce Mode being applied when the destination
358 is a Vector, the "Reverse Gear" bit turns out to be a way to apply Iterative
359 or Cumulative Vector operations in reverse. `sv.add/rg r3.v, r4.v, r4.v`
360 for example will start at the opposite end of the Vector and push
361 a cumulative series of overlapping add operations into the Execution units of
362 the underlying hardware.
363
364 Other examples include shift-mask operations where a Vector of inserts
365 into a single destination register is required (see [[sv/bitmanip]], bmset),
366 as a way to construct
367 a value quickly from multiple arbitrary bit-ranges and bit-offsets.
368 Using the same register as both the source and destination, with Vectors
369 of different offsets masks and values to be inserted has multiple
370 applications including Video, cryptography and JIT compilation.
371
372 # assume VL=4:
373 # * Vector of shift-offsets contained in RC (r12.v)
374 # * Vector of masks contained in RB (r8.v)
375 # * Vector of values to be masked-in in RA (r4.v)
376 # * Scalar destination RT (r0) to receive all mask-offset values
377 sv.bmset/mr r0, r4.v, r8.v, r12.v
378
379 Due to the Deterministic Scheduling,
380 Subtract and Divide are still permitted to be executed in this mode,
381 although from an algorithmic perspective it is strongly discouraged.
382 It would be better to use addition followed by one final subtract,
383 or in the case of divide, to get better accuracy, to perform a multiply
384 cascade followed by a final divide.
385
386 Note that single-operand or three-operand scalar-dest reduce is perfectly
387 well permitted: the programmer may still declare one register, used as
388 both a Vector source and Scalar destination, to be utilised as
389 the "accumulator". In the case of `sv.fmadds` and `sv.maddhw` etc
390 this naturally fits well with the normal expected usage of these
391 operations.
392
393 If an interrupt or exception occurs in the middle of the scalar mapreduce,
394 the scalar destination register **MUST** be updated with the current
395 (intermediate) result, because this is how ```Program Order``` is
396 preserved (Vector Loops are to be considered to be just another way of issuing instructions
397 in Program Order). In this way, after return from interrupt,
398 the scalar mapreduce may continue where it left off. This provides
399 "precise" exception behaviour.
400
401 Note that hardware is perfectly permitted to perform multi-issue
402 parallel optimisation of the scalar reduce operation: it's just that
403 as far as the user is concerned, all exceptions and interrupts **MUST**
404 be precise.
405
406 ## Vector result reduce mode
407
408 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base"
409 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
410 *appearance* and *effect* of Reduction.
411
412 Given that the tree-reduction schedule is deterministic,
413 Interrupts and exceptions
414 can therefore also be precise. The final result will be in the first
415 non-predicate-masked-out destination element, but due again to
416 the deterministic schedule programmers may find uses for the intermediate
417 results.
418
419 When Rc=1 a corresponding Vector of co-resultant CRs is also
420 created. No special action is taken: the result and its CR Field
421 are stored "as usual" exactly as all other SVP64 Rc=1 operations.
422
423 ## Sub-Vector Horizontal Reduction
424
425 Note that when SVM is clear and SUBVL!=1 the sub-elements are
426 *independent*, i.e. they are mapreduced per *sub-element* as a result.
427 illustration with a vec2, assuming RA==RT, e.g `sv.add/mr/vec2 r4, r4, r16`
428
429 for i in range(0, VL):
430 # RA==RT in the instruction. does not have to be
431 iregs[RT].x = op(iregs[RT].x, iregs[RB+i].x)
432 iregs[RT].y = op(iregs[RT].y, iregs[RB+i].y)
433
434 Thus logically there is nothing special or unanticipated about
435 `SVM=0`: it is expected behaviour according to standard SVP64
436 Sub-Vector rules.
437
438 By contrast, when SVM is set and SUBVL!=1, a Horizontal
439 Subvector mode is enabled, which behaves very much more
440 like a traditional Vector Processor Reduction instruction.
441 Example for a vec3:
442
443 for i in range(VL):
444 result = iregs[RA+i].x
445 result = op(result, iregs[RA+i].y)
446 result = op(result, iregs[RA+i].z)
447 iregs[RT+i] = result
448
449 In this mode, when Rc=1 the Vector of CRs is as normal: each result
450 element creates a corresponding CR element (for the final, reduced, result).
451
452 # Fail-on-first
453
454 Data-dependent fail-on-first has two distinct variants: one for LD/ST
455 (see [[sv/ldst]],
456 the other for arithmetic operations (actually, CR-driven)
457 ([[sv/normal]]) and CR operations ([[sv/cr_ops]]).
458 Note in each
459 case the assumption is that vector elements are required appear to be
460 executed in sequential Program Order, element 0 being the first.
461
462 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
463 ordinary one. Exceptions occur "as normal". However for elements 1
464 and above, if an exception would occur, then VL is **truncated** to the
465 previous element.
466 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
467 CR-creating operation produces a result (including cmp). Similar to
468 branch, an analysis of the CR is performed and if the test fails, the
469 vector operation terminates and discards all element operations
470 above the current one (and the current one if VLi is not set),
471 and VL is truncated to either
472 the *previous* element or the current one, depending on whether
473 VLi (VL "inclusive") is set.
474
475 Thus the new VL comprises a contiguous vector of results,
476 all of which pass the testing criteria (equal to zero, less than zero).
477
478 The CR-based data-driven fail-on-first is new and not found in ARM
479 SVE or RVV. It is extremely useful for reducing instruction count,
480 however requires speculative execution involving modifications of VL
481 to get high performance implementations. An additional mode (RC1=1)
482 effectively turns what would otherwise be an arithmetic operation
483 into a type of `cmp`. The CR is stored (and the CR.eq bit tested
484 against the `inv` field).
485 If the CR.eq bit is equal to `inv` then the Vector is truncated and
486 the loop ends.
487 Note that when RC1=1 the result elements are never stored, only the CRs.
488
489 VLi is only available as an option when `Rc=0` (or for instructions
490 which do not have Rc). When set, the current element is always
491 also included in the count (the new length that VL will be set to).
492 This may be useful in combination with "inv" to truncate the Vector
493 to `exclude` elements that fail a test, or, in the case of implementations
494 of strncpy, to include the terminating zero.
495
496 In CR-based data-driven fail-on-first there is only the option to select
497 and test one bit of each CR (just as with branch BO). For more complex
498 tests this may be insufficient. If that is the case, a vectorised crops
499 (crand, cror) may be used, and ffirst applied to the crop instead of to
500 the arithmetic vector.
501
502 One extremely important aspect of ffirst is:
503
504 * LDST ffirst may never set VL equal to zero. This because on the first
505 element an exception must be raised "as normal".
506 * CR-based data-dependent ffirst on the other hand **can** set VL equal
507 to zero. This is the only means in the entirety of SV that VL may be set
508 to zero (with the exception of via the SV.STATE SPR). When VL is set
509 zero due to the first element failing the CR bit-test, all subsequent
510 vectorised operations are effectively `nops` which is
511 *precisely the desired and intended behaviour*.
512
513 Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
514 to a nonzero value for any implementation-specific reason. For example:
515 it is perfectly reasonable for implementations to alter VL when ffirst
516 LD or ST operations are initiated on a nonaligned boundary, such that
517 within a loop the subsequent iteration of that loop begins subsequent
518 ffirst LD/ST operations on an aligned boundary. Likewise, to reduce
519 workloads or balance resources.
520
521 CR-based data-dependent first on the other hand MUST not truncate VL
522 arbitrarily to a length decided by the hardware: VL MUST only be
523 truncated based explicitly on whether a test fails.
524 This because it is a precise test on which algorithms
525 will rely.
526
527 ## Data-dependent fail-first on CR operations (crand etc)
528
529 Operations that actually produce or alter CR Field as a result
530 do not also in turn have an Rc=1 mode. However it makes no
531 sense to try to test the 4 bits of a CR Field for being equal
532 or not equal to zero. Moreover, the result is already in the
533 form that is desired: it is a CR field. Therefore,
534 CR-based operations have their own SVP64 Mode, described
535 in [[sv/cr_ops]]
536
537 There are two primary different types of CR operations:
538
539 * Those which have a 3-bit operand field (referring to a CR Field)
540 * Those which have a 5-bit operand (referring to a bit within the
541 whole 32-bit CR)
542
543 More details can be found in [[sv/cr_ops]].
544
545 # pred-result mode
546
547 Predicate-result merges common CR testing with predication, saving on
548 instruction count. In essence, a Condition Register Field test
549 is performed, and if it fails it is considered to have been
550 *as if* the destination predicate bit was zero.
551 Arithmetic and Logical Pred-result is covered in [[sv/normal]]
552
553 Ped-result mode may not be applied on CR ops.
554
555 Although CR operations (mtcr, crand, cror) may be Vectorised,
556 predicated, pred-result mode applies to operations that have
557 an Rc=1 mode, or make sense to add an RC1 option.
558
559 # CR Operations
560
561 CRs are slightly more involved than INT or FP registers due to the
562 possibility for indexing individual bits (crops BA/BB/BT). Again however
563 the access pattern needs to be understandable in relation to v3.0B / v3.1B
564 numbering, with a clear linear relationship and mapping existing when
565 SV is applied.
566
567 ## CR EXTRA mapping table and algorithm
568
569 Numbering relationships for CR fields are already complex due to being
570 in BE format (*the relationship is not clearly explained in the v3.0B
571 or v3.1 specification*). However with some care and consideration
572 the exact same mapping used for INT and FP regfiles may be applied,
573 just to the upper bits, as explained below. The notation
574 `CR{field number}` is used to indicate access to a particular
575 Condition Register Field (as opposed to the notation `CR[bit]`
576 which accesses one bit of the 32 bit Power ISA v3.0B
577 Condition Register)
578
579 `CR{n}` refers to `CR0` when `n=0` and consequently, for CR0-7, is defined, in v3.0B pseudocode, as:
580
581 CR{7-n} = CR[32+n*4:35+n*4]
582
583 For SVP64 the relationship for the sequential
584 numbering of elements is to the CR **fields** within
585 the CR Register, not to individual bits within the CR register.
586
587 In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits. The top 3 bits (0:2)
588 select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits
589 *in* that CR. The numbering was determined (after 4 months of
590 analysis and research) to be as follows:
591
592 CR_index = 7-(BA>>2) # top 3 bits but BE
593 bit_index = 3-(BA & 0b11) # low 2 bits but BE
594 CR_reg = CR{CR_index} # get the CR
595 # finally get the bit from the CR.
596 CR_bit = (CR_reg & (1<<bit_index)) != 0
597
598 When it comes to applying SV, it is the CR\_reg number to which SV EXTRA2/3
599 applies, **not** the CR\_bit portion (bits 3:4):
600
601 if extra3_mode:
602 spec = EXTRA3
603 else:
604 spec = EXTRA2<<1 | 0b0
605 if spec[0]:
606 # vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
607 return ((BA >> 2)<<6) | # hi 3 bits shifted up
608 (spec[1:2]<<4) | # to make room for these
609 (BA & 0b11) # CR_bit on the end
610 else:
611 # scalar constructs "00 spec[1:2] BA[0:4]"
612 return (spec[1:2] << 5) | BA
613
614 Thus, for example, to access a given bit for a CR in SV mode, the v3.0B
615 algorithm to determin CR\_reg is modified to as follows:
616
617 CR_index = 7-(BA>>2) # top 3 bits but BE
618 if spec[0]:
619 # vector mode, 0-124 increments of 4
620 CR_index = (CR_index<<4) | (spec[1:2] << 2)
621 else:
622 # scalar mode, 0-32 increments of 1
623 CR_index = (spec[1:2]<<3) | CR_index
624 # same as for v3.0/v3.1 from this point onwards
625 bit_index = 3-(BA & 0b11) # low 2 bits but BE
626 CR_reg = CR{CR_index} # get the CR
627 # finally get the bit from the CR.
628 CR_bit = (CR_reg & (1<<bit_index)) != 0
629
630 Note here that the decoding pattern to determine CR\_bit does not change.
631
632 Note: high-performance implementations may read/write Vectors of CRs in
633 batches of aligned 32-bit chunks (CR0-7, CR7-15). This is to greatly
634 simplify internal design. If instructions are issued where CR Vectors
635 do not start on a 32-bit aligned boundary, performance may be affected.
636
637 ## CR fields as inputs/outputs of vector operations
638
639 CRs (or, the arithmetic operations associated with them)
640 may be marked as Vectorised or Scalar. When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised. Likewise if the destination is scalar then so is the CR.
641
642 When vectorized, the CR inputs/outputs are sequentially read/written
643 to 4-bit CR fields. Vectorised Integer results, when Rc=1, will begin
644 writing to CR8 (TBD evaluate) and increase sequentially from there.
645 This is so that:
646
647 * implementations may rely on the Vector CRs being aligned to 8. This
648 means that CRs may be read or written in aligned batches of 32 bits
649 (8 CRs per batch), for high performance implementations.
650 * scalar Rc=1 operation (CR0, CR1) and callee-saved CRs (CR2-4) are not
651 overwritten by vector Rc=1 operations except for very large VL
652 * CR-based predication, from CR32, is also not interfered with
653 (except by large VL).
654
655 However when the SV result (destination) is marked as a scalar by the
656 EXTRA field the *standard* v3.0B behaviour applies: the accompanying
657 CR when Rc=1 is written to. This is CR0 for integer operations and CR1
658 for FP operations.
659
660 Note that yes, the CR Fields are genuinely Vectorised. Unlike in SIMD VSX which
661 has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
662 v3.0B scalar operations produce a **tuple** of element results: the
663 result of the operation as one part of that element *and a corresponding
664 CR element*. Greatly simplified pseudocode:
665
666 for i in range(VL):
667 # calculate the vector result of an add
668 iregs[RT+i] = iregs[RA+i] + iregs[RB+i]
669 # now calculate CR bits
670 CRs{8+i}.eq = iregs[RT+i] == 0
671 CRs{8+i}.gt = iregs[RT+i] > 0
672 ... etc
673
674 If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
675 then a followup instruction must be performed, setting "reduce" mode on
676 the Vector of CRs, using cr ops (crand, crnor) to do so. This provides far
677 more flexibility in analysing vectors than standard Vector ISAs. Normal
678 Vector ISAs are typically restricted to "were all results nonzero" and
679 "were some results nonzero". The application of mapreduce to Vectorised
680 cr operations allows far more sophisticated analysis, particularly in
681 conjunction with the new crweird operations see [[sv/cr_int_predication]].
682
683 Note in particular that the use of a separate instruction in this way
684 ensures that high performance multi-issue OoO inplementations do not
685 have the computation of the cumulative analysis CR as a bottleneck and
686 hindrance, regardless of the length of VL.
687
688 Additionally,
689 SVP64 [[sv/branches]] may be used, even when the branch itself is to
690 the following instruction. The combined side-effects of CTR reduction
691 and VL truncation provide several benefits.
692
693 (see [[discussion]]. some alternative schemes are described there)
694
695 ## Rc=1 when SUBVL!=1
696
697 sub-vectors are effectively a form of Packed SIMD (length 2 to 4). Only 1 bit of
698 predicate is allocated per subvector; likewise only one CR is allocated
699 per subvector.
700
701 This leaves a conundrum as to how to apply CR computation per subvector,
702 when normally Rc=1 is exclusively applied to scalar elements. A solution
703 is to perform a bitwise OR or AND of the subvector tests. Given that
704 OE is ignored in SVP64, this field may (when available) be used to select OR or
705 AND behavior.
706
707 ### Table of CR fields
708
709 CRn is the notation used by the OpenPower spec to refer to CR field #i,
710 so FP instructions with Rc=1 write to CR1 (n=1).
711
712 CRs are not stored in SPRs: they are registers in their own right.
713 Therefore context-switching the full set of CRs involves a Vectorised
714 mfcr or mtcr, using VL=8 to do so. This is exactly as how
715 scalar OpenPOWER context-switches CRs: it is just that there are now
716 more of them.
717
718 The 64 SV CRs are arranged similarly to the way the 128 integer registers
719 are arranged. TODO a python program that auto-generates a CSV file
720 which can be included in a table, which is in a new page (so as not to
721 overwhelm this one). [[svp64/cr_names]]
722
723 # Register Profiles
724
725 **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see
726 <https://bugs.libre-soc.org/show_bug.cgi?id=548> for details.
727
728 Instructions are broken down by Register Profiles as listed in the
729 following auto-generated page: [[opcode_regs_deduped]]. "Non-SV"
730 indicates that the operations with this Register Profile cannot be
731 Vectorised (mtspr, bc, dcbz, twi)
732
733 TODO generate table which will be here [[svp64/reg_profiles]]
734
735 # SV pseudocode illilustration
736
737 ## Single-predicated Instruction
738
739 illustration of normal mode add operation: zeroing not included, elwidth
740 overrides not included. if there is no predicate, it is set to all 1s
741
742 function op_add(rd, rs1, rs2) # add not VADD!
743 int i, id=0, irs1=0, irs2=0;
744 predval = get_pred_val(FALSE, rd);
745 for (i = 0; i < VL; i++)
746 STATE.srcoffs = i # save context
747 if (predval & 1<<i) # predication uses intregs
748 ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
749 if (!int_vec[rd].isvec) break;
750 if (rd.isvec) { id += 1; }
751 if (rs1.isvec) { irs1 += 1; }
752 if (rs2.isvec) { irs2 += 1; }
753 if (id == VL or irs1 == VL or irs2 == VL)
754 {
755 # end VL hardware loop
756 STATE.srcoffs = 0; # reset
757 return;
758 }
759
760 This has several modes:
761
762 * RT.v = RA.v RB.v
763 * RT.v = RA.v RB.s (and RA.s RB.v)
764 * RT.v = RA.s RB.s
765 * RT.s = RA.v RB.v
766 * RT.s = RA.v RB.s (and RA.s RB.v)
767 * RT.s = RA.s RB.s
768
769 All of these may be predicated. Vector-Vector is straightfoward.
770 When one of source is a Vector and the other a Scalar, it is clear that
771 each element of the Vector source should be added to the Scalar source,
772 each result placed into the Vector (or, if the destination is a scalar,
773 only the first nonpredicated result).
774
775 The one that is not obvious is RT=vector but both RA/RB=scalar.
776 Here this acts as a "splat scalar result", copying the same result into
777 all nonpredicated result elements. If a fixed destination scalar was
778 intended, then an all-Scalar operation should be used.
779
780 See <https://bugs.libre-soc.org/show_bug.cgi?id=552>
781
782 # Assembly Annotation
783
784 Assembly code annotation is required for SV to be able to successfully
785 mark instructions as "prefixed".
786
787 A reasonable (prototype) starting point:
788
789 svp64 [field=value]*
790
791 Fields:
792
793 * ew=8/16/32 - element width
794 * sew=8/16/32 - source element width
795 * vec=2/3/4 - SUBVL
796 * mode=reduce/satu/sats/crpred
797 * pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne
798 * spred={reg spec}
799
800 similar to x86 "rex" prefix.
801
802 For actual assembler:
803
804 sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s
805
806 Qualifiers:
807
808 * m={pred}: predicate mask mode
809 * sm={pred}: source-predicate mask mode (only allowed in Twin-predication)
810 * vec{N}: vec2 OR vec3 OR vec4 - sets SUBVL=2/3/4
811 * ew={N}: ew=8/16/32 - sets elwidth override
812 * sw={N}: sw=8/16/32 - sets source elwidth override
813 * ff={xx}: see fail-first mode
814 * pr={xx}: see predicate-result mode
815 * sat{x}: satu / sats - see saturation mode
816 * mr: see map-reduce mode
817 * mr.svm see map-reduce with sub-vector mode
818 * crm: see map-reduce CR mode
819 * crm.svm see map-reduce CR with sub-vector mode
820 * sz: predication with source-zeroing
821 * dz: predication with dest-zeroing
822
823 For modes:
824
825 * pred-result:
826 - pm=lt/gt/le/ge/eq/ne/so/ns OR
827 - pm=RC1 OR pm=~RC1
828 * fail-first
829 - ff=lt/gt/le/ge/eq/ne/so/ns OR
830 - ff=RC1 OR ff=~RC1
831 * saturation:
832 - sats
833 - satu
834 * map-reduce:
835 - mr OR crm: "normal" map-reduce mode or CR-mode.
836 - mr.svm OR crm.svm: when vec2/3/4 set, sub-vector mapreduce is enabled
837
838 # Proposed Parallel-reduction algorithm
839
840 **This algorithm contains a MV operation and may NOT be used. Removal
841 of the MV operation may be achieved by using index-redirection as was
842 achieved in DCT and FFT REMAP**
843
844 ```
845 /// reference implementation of proposed SimpleV reduction semantics.
846 ///
847 // reduction operation -- we still use this algorithm even
848 // if the reduction operation isn't associative or
849 // commutative.
850 XXX VIOLATION OF SVP64 DESIGN PRINCIPLES XXXX
851 /// XXX `pred` is a user-visible Vector Condition register XXXX
852 XXX VIOLATION OF SVP64 DESIGN PRINCIPLES XXXX
853 ///
854 /// all input arrays have length `vl`
855 def reduce(vl, vec, pred):
856 pred = copy(pred) # must not damage predicate
857 step = 1;
858 while step < vl
859 step *= 2;
860 for i in (0..vl).step_by(step)
861 other = i + step / 2;
862 other_pred = other < vl && pred[other];
863 if pred[i] && other_pred
864 vec[i] += vec[other];
865 else if other_pred
866 XXX VIOLATION OF SVP64 DESIGN XXX
867 XXX vec[i] = vec[other]; XXX
868 XXX VIOLATION OF SVP64 DESIGN XXX
869 pred[i] |= other_pred;
870 ```
871
872 The first principle in SVP64 being violated is that SVP64 is a fully-independent
873 Abstraction of hardware-looping in between issue and execute phases
874 that has no relation to the operation it issues. The above pseudocode
875 conditionally changes not only the type of element operation issued
876 (a MV in some cases) but also the number of arguments (2 for a MV).
877 At the very least, for Vertical-First Mode this will result in unanticipated and unexpected behaviour (maximise "surprises" for programmers) in
878 the middle of loops, that will be far too hard to explain.
879
880 The second principle being violated by the above algorithm is the expectation
881 that temporary storage is available for a modified predicate: there is no
882 such space, and predicates are read-only to reduce complexity at the
883 micro-architectural level.
884 SVP64 is founded on the principle that all operations are
885 "re-entrant" with respect to interrupts and exceptions: SVSTATE must
886 be saved and restored alongside PC and MSR, but nothing more. It is perfectly
887 fine to have context-switching back to the operation be somewhat slower,
888 through "reconstruction" of temporary internal state based on what SVSTATE
889 contains, but nothing more.
890
891 An alternative algorithm is therefore required that does not perform MVs,
892 and does not require additional state to be saved on context-switching.
893
894 ```
895 def reduce( vl, vec, pred ):
896 pred = copy(pred) # must not damage predicate
897 j = 0
898 vi = [] # array of lookup indices to skip nonpredicated
899 for i, pbit in enumerate(pred):
900 if pbit:
901 vi[j] = i
902 j += 1
903 step = 2
904 while step <= vl
905 halfstep = step // 2
906 for i in (0..vl).step_by(step)
907 other = vi[i + halfstep]
908 ir = vi[i]
909 other_pred = other < vl && pred[other]
910 if pred[i] && other_pred
911 vec[ir] += vec[other]
912 else if other_pred:
913 vi[ir] = vi[other] # index redirection, no MV
914 pred[ir] |= other_pred # reconstructed on context-switch
915 step *= 2
916 ```
917
918 In this version the need for an explicit MV is made unnecessary by instead
919 leaving elements *in situ*. The internal modifications to the predicate may,
920 due to the reduction being entirely deterministic, be "reconstructed"
921 on a context-switch. This may make some implementations slower.
922
923 *Implementor's Note: many SIMD-based Parallel Reduction Algorithms are
924 implemented in hardware with MVs that ensure lane-crossing is minimised.
925 The mistake which would be catastrophic to SVP64 to make is to then
926 limit the Reduction Sequence for all implementors
927 based solely and exclusively on what one
928 specific internal microarchitecture does.
929 In SIMD ISAs the internal SIMD Architectural design is exposed and imposed on the programmer. Cray-style Vector ISAs on the other hand provide convenient,
930 compact and efficient encodings of abstract concepts.
931 It is the Implementor's responsibility to produce a design
932 that complies with the above algorithm,
933 utilising internal Micro-coding and other techniques to transparently
934 insert MV operations
935 if necessary or desired, to give the level of efficiency or performance
936 required.*
937
938 # Element-width overrides
939
940 Element-width overrides are best illustrated with a packed structure
941 union in the c programming language. The following should be taken
942 literally, and assume always a little-endian layout:
943
944 typedef union {
945 uint8_t b[];
946 uint16_t s[];
947 uint32_t i[];
948 uint64_t l[];
949 uint8_t actual_bytes[8];
950 } el_reg_t;
951
952 elreg_t int_regfile[128];
953
954 get_polymorphed_reg(reg, bitwidth, offset):
955 el_reg_t res;
956 res.l = 0; // TODO: going to need sign-extending / zero-extending
957 if bitwidth == 8:
958 reg.b = int_regfile[reg].b[offset]
959 elif bitwidth == 16:
960 reg.s = int_regfile[reg].s[offset]
961 elif bitwidth == 32:
962 reg.i = int_regfile[reg].i[offset]
963 elif bitwidth == 64:
964 reg.l = int_regfile[reg].l[offset]
965 return res
966
967 set_polymorphed_reg(reg, bitwidth, offset, val):
968 if (!reg.isvec):
969 # not a vector: first element only, overwrites high bits
970 int_regfile[reg].l[0] = val
971 elif bitwidth == 8:
972 int_regfile[reg].b[offset] = val
973 elif bitwidth == 16:
974 int_regfile[reg].s[offset] = val
975 elif bitwidth == 32:
976 int_regfile[reg].i[offset] = val
977 elif bitwidth == 64:
978 int_regfile[reg].l[offset] = val
979
980 In effect the GPR registers r0 to r127 (and corresponding FPRs fp0
981 to fp127) are reinterpreted to be "starting points" in a byte-addressable
982 memory. Vectors - which become just a virtual naming construct - effectively
983 overlap.
984
985 It is extremely important for implementors to note that the only circumstance
986 where upper portions of an underlying 64-bit register are zero'd out is
987 when the destination is a scalar. The ideal register file has byte-level
988 write-enable lines, just like most SRAMs.
989
990 An example ADD operation with predication and element width overrides:
991
992  for (i = 0; i < VL; i++)
993 if (predval & 1<<i) # predication
994 src1 = get_polymorphed_reg(RA, srcwid, irs1)
995 src2 = get_polymorphed_reg(RB, srcwid, irs2)
996 result = src1 + src2 # actual add here
997 set_polymorphed_reg(RT, destwid, ird, result)
998 if (!RT.isvec) break
999 if (RT.isvec)  { id += 1; }
1000 if (RA.isvec)  { irs1 += 1; }
1001 if (RB.isvec)  { irs2 += 1; }
1002
1003 # Twin (implicit) result operations
1004
1005 Some operations in the Power ISA already target two 64-bit scalar
1006 registers: `lq` for example. Some mathematical algorithms are more
1007 efficient when there are two outputs rather than one, providing
1008 feedback loops between elements. 64-bit multiply
1009 for example actually internally produces a 128 bit result, which clearly
1010 cannot be stored in a single 64 bit register. Some ISAs recommend
1011 "macro op fusion": the practice of setting a convention whereby if
1012 two commonly used instructions (mullo, mulhi) use the same ALU but
1013 one selects the low part of an identical operation and the other
1014 selects the high part, then optimised micro-architectures may
1015 "fuse" those two instructions together, using Micro-coding techniques,
1016 internally.
1017
1018 The practice and convention of macro-op fusion however is not compatible
1019 with SVP64 Horizontal-First, because Horizontal Mode may only
1020 be applied to a single instruction at a time. Thus it becomes
1021 necessary to add explicit more complex single instructions with
1022 more operands than would normally be seen in another ISA. If it
1023 was not for Power ISA already having LD/ST with update as well as
1024 Condition Codes and `lq` this would be hard to justify.
1025
1026 With limited space in the `EXTRA` Field, and Power ISA opcodes
1027 being only 32 bit, 5 operands is quite an ask. `lq` however sets
1028 a precedent: `RTp` stands for "RT pair". In other words the result
1029 is stored in RT and RT+1. For Scalar operations, following this
1030 precedent is perfectly reasonable. In Scalar mode,
1031 `madded` therefore stores the two halves of the 128-bit multiply
1032 into RT and RT+1.
1033
1034 What, then, of `sv.madded`? If the destination is hard-coded to
1035 RT and RT+1 the instruction is not useful when Vectorised because
1036 the output will be overwritten on the next element. To solve this
1037 is easy: define the destination registers as RT and RT+MAXVL
1038 respectively. This makes it easy for compilers to statically allocate
1039 registers even when VL changes dynamically.
1040
1041 Bear in mind that both RT and RT+MAXVL are starting points for Vectors,
1042 and bear in mind that element-width overrides still have to be taken
1043 into consideration, the starting point for the implicit destination
1044 is best illustrated in pseudocode:
1045
1046 # demo of madded
1047  for (i = 0; i < VL; i++)
1048 if (predval & 1<<i) # predication
1049 src1 = get_polymorphed_reg(RA, srcwid, irs1)
1050 src2 = get_polymorphed_reg(RB, srcwid, irs2)
1051 src2 = get_polymorphed_reg(RC, srcwid, irs3)
1052 result = src1*src2 + src2
1053 destmask = (2<<destwid)-1
1054 # store two halves of result
1055 set_polymorphed_reg(RT, destwid, ird , result&destmask)
1056 set_polymorphed_reg(RT, destwid, ird+MAXVL, result>>destwid)
1057 if (!RT.isvec) break
1058 if (RT.isvec)  { id += 1; }
1059 if (RA.isvec)  { irs1 += 1; }
1060 if (RB.isvec)  { irs2 += 1; }
1061 if (RC.isvec)  { irs3 += 1; }
1062
1063 The significant part here is that the second half is stored
1064 starting not from RT+MAXVL at all: it is the *element* index
1065 that is offset by MAXVL, both starting from RT.
1066
1067 * [[isa/svfixedarith]]
1068 * [[isa/svfparith]]
1069