(no commit message)
[libreriscv.git] / openpower / sv / svp64 / appendix.mdwn
1 [[!tag standards]]
2
3 # Appendix
4
5 * <https://bugs.libre-soc.org/show_bug.cgi?id=574> Saturation
6 * <https://bugs.libre-soc.org/show_bug.cgi?id=558#c47> Parallel Prefix
7 * <https://bugs.libre-soc.org/show_bug.cgi?id=697> Reduce Modes
8 * <https://bugs.libre-soc.org/show_bug.cgi?id=864> parallel prefix simulator
9 * <https://bugs.libre-soc.org/show_bug.cgi?id=809> OV sv.addex discussion
10
11 This is the appendix to [[sv/svp64]], providing explanations of modes
12 etc. leaving the main svp64 page's primary purpose as outlining the
13 instruction format.
14
15 Table of contents:
16
17 [[!toc]]
18
19 # Partial Implementations
20
21 It is perfectly legal to implement subsets of SVP64 as long as illegal
22 instruction traps are always raised on unimplemented features,
23 so that soft-emulation is possible,
24 even for future revisions of SVP64. With SVP64 being partly controlled
25 through contextual SPRs, a little care has to be taken.
26
27 **All** SPRs
28 not implemented including reserved ones for future use must raise an illegal
29 instruction trap if read or written. This allows software the
30 opportunity to emulate the context created by the given SPR.
31
32 See [[sv/compliancy_levels]] for full details.
33
34 # XER, SO and other global flags
35
36 Vector systems are expected to be high performance. This is achieved
37 through parallelism, which requires that elements in the vector be
38 independent. XER SO/OV and other global "accumulation" flags (CR.SO) cause
39 Read-Write Hazards on single-bit global resources, having a significant
40 detrimental effect.
41
42 Consequently in SV, XER.SO behaviour is disregarded (including
43 in `cmp` instructions). XER.SO is not read, but XER.OV may be written,
44 breaking the Read-Modify-Write Hazard Chain that complicates
45 microarchitectural implementations.
46 This includes when `scalar identity behaviour` occurs. If precise
47 OpenPOWER v3.0/1 scalar behaviour is desired then OpenPOWER v3.0/1
48 instructions should be used without an SV Prefix.
49
50 TODO jacob add about OV https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
51
52 Of note here is that XER.SO and OV may already be disregarded in the
53 Power ISA v3.0/1 SFFS (Scalar Fixed and Floating) Compliancy Subset.
54 SVP64 simply makes it mandatory to disregard XER.SO even for other Subsets,
55 but only for SVP64 Prefixed Operations.
56
57 XER.CA/CA32 on the other hand is expected and required to be implemented
58 according to standard Power ISA Scalar behaviour. Interestingly, due
59 to SVP64 being in effect a hardware for-loop around Scalar instructions
60 executing in precise Program Order, a little thought shows that a Vectorised
61 Carry-In-Out add is in effect a Big Integer Add, taking a single bit Carry In
62 and producing, at the end, a single bit Carry out. High performance
63 implementations may exploit this observation to deploy efficient
64 Parallel Carry Lookahead.
65
66 # assume VL=4, this results in 4 sequential ops (below)
67 sv.adde r0.v, r4.v, r8.v
68
69 # instructions that get executed in backend hardware:
70 adde r0, r4, r8 # takes carry-in, produces carry-out
71 adde r1, r5, r9 # takes carry from previous
72 ...
73 adde r3, r7, r11 # likewise
74
75 It can clearly be seen that the carry chains from one
76 64 bit add to the next, the end result being that a
77 256-bit "Big Integer Add" has been performed, and that
78 CA contains the 257th bit. A one-instruction 512-bit Add
79 may be performed by setting VL=8, and a one-instruction
80 1024-bit add by setting VL=16, and so on. More on
81 this in [[openpower/sv/biginteger]]
82
83 # v3.0B/v3.1 relevant instructions
84
85 SV is primarily designed for use as an efficient hybrid 3D GPU / VPU /
86 CPU ISA.
87
88 Vectorisation of the VSX Packed SIMD system makes no sense whatsoever,
89 the sole exceptions potentially being any operations with 128-bit
90 operands such as `vrlq` (Rotate Quad Word) and `xsaddqp` (Scalar
91 Quad-precision Add).
92 SV effectively *replaces* the majority of VSX, requiring far less
93 instructions, and provides, at the very minimum, predication
94 (which VSX was designed without).
95
96 Likewise, Load/Store Multiple make no sense to
97 have because they are not only provided by SV, the SV alternatives may
98 be predicated as well, making them far better suited to use in function
99 calls and context-switching.
100
101 Additionally, some v3.0/1 instructions simply make no sense at all in a
102 Vector context: `rfid` falls into this category,
103 as well as `sc` and `scv`. Here there is simply no point
104 trying to Vectorise them: the standard OpenPOWER v3.0/1 instructions
105 should be called instead.
106
107 Fortuitously this leaves several Major Opcodes free for use by SV
108 to fit alternative future instructions. In a 3D context this means
109 Vector Product, Vector Normalise, [[sv/mv.swizzle]], Texture LD/ST
110 operations, and others critical to an efficient, effective 3D GPU and
111 VPU ISA. With such instructions being included as standard in other
112 commercially-successful GPU ISAs it is likewise critical that a 3D
113 GPU/VPU based on svp64 also have such instructions.
114
115 Note however that svp64 is stand-alone and is in no way
116 critically dependent on the existence or provision of 3D GPU or VPU
117 instructions. These should be considered extensions, and their discussion
118 and specification is out of scope for this document.
119
120 Note, again: this is *only* under svp64 prefixing. Standard v3.0B /
121 v3.1B is *not* altered by svp64 in any way.
122
123 ## Major opcode map (v3.0B)
124
125 This table is taken from v3.0B.
126 Table 9: Primary Opcode Map (opcode bits 0:5)
127
128 ```
129 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
130 000 | | | tdi | twi | EXT04 | | | mulli | 000
131 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001
132 010 | bc/l/a | EXT17 | b/l/a | EXT19 | rlwimi| rlwinm | | rlwnm | 010
133 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011
134 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100
135 101 | lhz | lhzu | lha | lhau | sth | sthu | lmw | stmw | 101
136 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110
137 111 | lq | EXT57 | EXT58 | EXT59 | EXT60 | EXT61 | EXT62 | EXT63 | 111
138 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
139 ```
140
141 ## Suitable for svp64-only
142
143 This is the same table containing v3.0B Primary Opcodes except those that
144 make no sense in a Vectorisation Context have been removed. These removed
145 POs can, *in the SV Vector Context only*, be assigned to alternative
146 (Vectorised-only) instructions, including future extensions.
147 EXT04 retains the scalar `madd*` operations but would have all PackedSIMD
148 (aka VSX) operations removed.
149
150 Note, again, to emphasise: outside of svp64 these opcodes **do not**
151 change. When not prefixed with svp64 these opcodes **specifically**
152 retain their v3.0B / v3.1B OpenPOWER Standard compliant meaning.
153
154 ```
155 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
156 000 | | | | | EXT04 | | | mulli | 000
157 001 | subfic | | cmpli | cmpi | addic | addic. | addi | addis | 001
158 010 | bc/l/a | | | EXT19 | rlwimi| rlwinm | | rlwnm | 010
159 011 | ori | oris | xori | xoris | andi. | andis. | EXT30 | EXT31 | 011
160 100 | lwz | lwzu | lbz | lbzu | stw | stwu | stb | stbu | 100
161 101 | lhz | lhzu | lha | lhau | sth | sthu | | | 101
162 110 | lfs | lfsu | lfd | lfdu | stfs | stfsu | stfd | stfdu | 110
163 111 | | | EXT58 | EXT59 | | EXT61 | | EXT63 | 111
164 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111
165 ```
166
167 It is important to note that having a different v3.0B Scalar opcode
168 that is different from an SVP64 one is highly undesirable: the complexity
169 in the decoder is greatly increased.
170
171 # EXTRA Field Mapping
172
173 The purpose of the 9-bit EXTRA field mapping is to mark individual
174 registers (RT, RA, BFA) as either scalar or vector, and to extend
175 their numbering from 0..31 in Power ISA v3.0 to 0..127 in SVP64.
176 Three of the 9 bits may also be used up for a 2nd Predicate (Twin
177 Predication) leaving a mere 6 bits for qualifying registers. As can
178 be seen there is significant pressure on these (and in fact all) SVP64 bits.
179
180 In Power ISA v3.1 prefixing there are bits which describe and classify
181 the prefix in a fashion that is independent of the suffix. MLSS for
182 example. For SVP64 there is insufficient space to make the SVP64 Prefix
183 "self-describing", and consequently every single Scalar instruction
184 had to be individually analysed, by rote, to craft an EXTRA Field Mapping.
185 This process was semi-automated and is described in this section.
186 The final results, which are part of the SVP64 Specification, are here:
187
188 * [[openpower/opcode_regs_deduped]]
189
190 Firstly, every instruction's mnemonic (`add RT, RA, RB`) was analysed
191 from reading the markdown formatted version of the Scalar pseudocode
192 which is machine-readable and found in [[openpower/isatables]]. The
193 analysis gives, by instruction, a "Register Profile". `add RT, RA, RB`
194 for example is given a designation `RM-2R-1W` because it requires
195 two GPR reads and one GPR write.
196
197 Secondly, the total number of registers was added up (2R-1W is 3 registers)
198 and if less than or equal to three then that instruction could be given an
199 EXTRA3 designation. Four or more is given an EXTRA2 designation because
200 there are only 9 bits available.
201
202 Thirdly, the instruction was analysed to see if Twin or Single
203 Predication was suitable. As a general rule this was if there
204 was only a single operand and a single result (`extw` and LD/ST)
205 however it was found that some 2 or 3 operand instructions also
206 qualify. Given that 3 of the 9 bits of EXTRA had to be sacrificed for use
207 in Twin Predication, some compromises were made, here. LDST is
208 Twin but also has 3 operands in some operations, so only EXTRA2 can be used.
209
210 Fourthly, a packing format was decided: for 2R-1W an EXTRA3 indexing
211 could have been decided
212 that RA would be indexed 0 (EXTRA bits 0-2), RB indexed 1 (EXTRA bits 3-5)
213 and RT indexed 2 (EXTRA bits 6-8). In some cases (LD/ST with update)
214 RA-as-a-source is given a **different** EXTRA index from RA-as-a-result
215 (because it is possible to do, and perceived to be useful). Rc=1
216 co-results (CR0, CR1) are always given the same EXTRA index as their
217 main result (RT, FRT).
218
219 Fifthly, in an automated process the results of the analysis
220 were outputted in CSV Format for use in machine-readable form
221 by sv_analysis.py <https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/sv/sv_analysis.py;hb=HEAD>
222
223 This process was laborious but logical, and, crucially, once a
224 decision is made (and ratified) cannot be reversed.
225 Qualifying future Power ISA Scalar instructions for SVP64
226 is **strongly** advised to utilise this same process and the same
227 sv_analysis.py program as a canonical method of maintaining the
228 relationships. Alterations to that same program which
229 change the Designation is **prohibited** once finalised (ratified
230 through the Power ISA WG Process). It would
231 be similar to deciding that `add` should be changed from X-Form
232 to D-Form.
233
234 # Single Predication <a name="1p"> </a>
235
236 This is a standard mode normally found in Vector ISAs. every element in every source Vector and in the destination uses the same bit of one single predicate mask.
237
238 In SVSTATE, for Single-predication, implementors MUST increment both srcstep and dststep, but depending on whether sz and/or dz are set, srcstep and
239 dststep can still potentially become different indices. Only when sz=dz
240 is srcstep guaranteed to equal dststep at all times.
241
242 Note that in some Mode Formats there is only one flag (zz). This indicates
243 that *both* sz *and* dz are set to the same.
244
245 Example 1:
246
247 * VL=4
248 * mask=0b1101
249 * sz=0, dz=1
250
251 The following schedule for srcstep and dststep will occur:
252
253 | srcstep | dststep | comment |
254 | ---- | ----- | -------- |
255 | 0 | 0 | both mask[src=0] and mask[dst=0] are 1 |
256 | 1 | 2 | sz=1 but dz=0: dst skips mask[1], src soes not |
257 | 2 | 3 | mask[src=2] and mask[dst=3] are 1 |
258 | end | end | loop has ended because dst reached VL-1 |
259
260 Example 2:
261
262 * VL=4
263 * mask=0b1101
264 * sz=1, dz=0
265
266 The following schedule for srcstep and dststep will occur:
267
268 | srcstep | dststep | comment |
269 | ---- | ----- | -------- |
270 | 0 | 0 | both mask[src=0] and mask[dst=0] are 1 |
271 | 2 | 1 | sz=0 but dz=1: src skips mask[1], dst does not |
272 | 3 | 2 | mask[src=3] and mask[dst=2] are 1 |
273 | end | end | loop has ended because src reached VL-1 |
274
275 In both these examples it is crucial to note that despite there being
276 a single predicate mask, with sz and dz being different, srcstep and
277 dststep are being requested to react differently.
278
279 Example 3:
280
281 * VL=4
282 * mask=0b1101
283 * sz=0, dz=0
284
285 The following schedule for srcstep and dststep will occur:
286
287 | srcstep | dststep | comment |
288 | ---- | ----- | -------- |
289 | 0 | 0 | both mask[src=0] and mask[dst=0] are 1 |
290 | 2 | 2 | sz=0 and dz=0: both src and dst skip mask[1] |
291 | 3 | 3 | mask[src=3] and mask[dst=3] are 1 |
292 | end | end | loop has ended because src and dst reached VL-1 |
293
294 Here, both srcstep and dststep remain in lockstep because sz=dz=1
295
296 # EXTRA Pack/Unpack Modes
297
298 The pack/unpack concept of VSX `vpack` is abstracted out as a Sub-Vector
299 reordering Schedule, named `RM-2P-1S1D-PU`.
300 The usual RM-2P-1S1D is reduced from EXTRA3 to EXTRA2, making
301 room for 2 extra bits that enable either "packing" or "unpacking"
302 on the subvectors vec2/3/4.
303
304 Illustrating a
305 "normal" SVP64 operation with `SUBVL!=1:` (assuming no elwidth overrides):
306
307 def index():
308 for i in range(VL):
309 for j in range(SUBVL):
310 yield i*SUBVL+j
311
312 for idx in index():
313 operation_on(RA+idx)
314
315 For pack/unpack (again, no elwidth overrides):
316
317 # yield an outer-SUBVL or inner VL loop with SUBVL
318 def index_p(outer):
319 if outer:
320 for j in range(SUBVL):
321 for i in range(VL):
322 yield i+VL*j
323 else:
324 for i in range(VL):
325 for j in range(SUBVL):
326 yield i*SUBVL+j
327
328 # walk through both source and dest indices simultaneously
329 for src_idx, dst_idx in zip(index_p(PACK), index_p(UNPACK)):
330 move_operation(RT+dst_idx, RA+src_idx)
331
332 "yield" from python is used here for simplicity and clarity.
333 The two Finite State Machines for the generation of the source
334 and destination element offsets progress incrementally in
335 lock-step.
336
337 Setting of both `PACK_en` and `UNPACK_en` is neither prohibited nor
338 `UNDEFINED` because the reordering is fully deterministic, and
339 additional REMAP reordering may be applied. For Matrix this would
340 give potentially up to 4 Dimensions of reordering.
341
342 Pack/Unpack applies to mv operations and some other single-source
343 single-destination operations such as Indexed LD/ST and extsw.
344 [[sv/mv.swizzle] has a slightly different pseudocode algorithm
345 for Vertical-First Mode.
346
347 # Twin Predication <a name="2p"> </a>
348
349 This is a novel concept that allows predication to be applied to a single
350 source and a single dest register. The following types of traditional
351 Vector operations may be encoded with it, *without requiring explicit
352 opcodes to do so*
353
354 * VSPLAT (a single scalar distributed across a vector)
355 * VEXTRACT (like LLVM IR [`extractelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#extractelement-instruction))
356 * VINSERT (like LLVM IR [`insertelement`](https://releases.llvm.org/11.0.0/docs/LangRef.html#insertelement-instruction))
357 * VCOMPRESS (like LLVM IR [`llvm.masked.compressstore.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics))
358 * VEXPAND (like LLVM IR [`llvm.masked.expandload.*`](https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics))
359
360 Those patterns (and more) may be applied to:
361
362 * mv (the usual way that V\* ISA operations are created)
363 * exts\* sign-extension
364 * rwlinm and other RS-RA shift operations (**note**: excluding
365 those that take RA as both a src and dest. These are not
366 1-src 1-dest, they are 2-src, 1-dest)
367 * LD and ST (treating AGEN as one source)
368 * FP fclass, fsgn, fneg, fabs, fcvt, frecip, fsqrt etc.
369 * Condition Register ops mfcr, mtcr and other similar
370
371 This is a huge list that creates extremely powerful combinations,
372 particularly given that one of the predicate options is `(1<<r3)`
373
374 Additional unusual capabilities of Twin Predication include a back-to-back
375 version of VCOMPRESS-VEXPAND which is effectively the ability to do
376 sequentially ordered multiple VINSERTs. The source predicate selects a
377 sequentially ordered subset of elements to be inserted; the destination
378 predicate specifies the sequentially ordered recipient locations.
379 This is equivalent to
380 `llvm.masked.compressstore.*`
381 followed by
382 `llvm.masked.expandload.*`
383 with a single instruction.
384
385 This extreme power and flexibility comes down to the fact that SVP64
386 is not actually a Vector ISA: it is a loop-abstraction-concept that
387 is applied *in general* to Scalar operations, just like the x86
388 `REP` instruction (if put on steroids).
389
390 # Reduce modes
391
392 Reduction in SVP64 is deterministic and somewhat of a misnomer. A normal
393 Vector ISA would have explicit Reduce opcodes with defined characteristics
394 per operation: in SX Aurora there is even an additional scalar argument
395 containing the initial reduction value, and the default is either 0
396 or 1 depending on the specifics of the explicit opcode.
397 SVP64 fundamentally has to
398 utilise *existing* Scalar Power ISA v3.0B operations, which presents some
399 unique challenges.
400
401 The solution turns out to be to simply define reduction as permitting
402 deterministic element-based schedules to be issued using the base Scalar
403 operations, and to rely on the underlying microarchitecture to resolve
404 Register Hazards at the element level. This goes back to
405 the fundamental principle that SV is nothing more than a Sub-Program-Counter
406 sitting between Decode and Issue phases.
407
408 Microarchitectures *may* take opportunities to parallelise the reduction
409 but only if in doing so they preserve Program Order at the Element Level.
410 Opportunities where this is possible include an `OR` operation
411 or a MIN/MAX operation: it may be possible to parallelise the reduction,
412 but for Floating Point it is not permitted due to different results
413 being obtained if the reduction is not executed in strict Program-Sequential
414 Order.
415
416 In essence it becomes the programmer's responsibility to leverage the
417 pre-determined schedules to desired effect.
418
419 ## Scalar result reduction and iteration
420
421 Scalar Reduction per se does not exist, instead is implemented in SVP64
422 as a simple and natural relaxation of the usual restriction on the Vector
423 Looping which would terminate if the destination was marked as a Scalar.
424 Scalar Reduction by contrast *keeps issuing Vector Element Operations*
425 even though the destination register is marked as scalar.
426 Thus it is up to the programmer to be aware of this, observe some
427 conventions, and thus end up achieving the desired outcome of scalar
428 reduction.
429
430 It is also important to appreciate that there is no
431 actual imposition or restriction on how this mode is utilised: there
432 will therefore be several valuable uses (including Vector Iteration
433 and "Reverse-Gear")
434 and it is up to the programmer to make best use of the
435 (strictly deterministic) capability
436 provided.
437
438 In this mode, which is suited to operations involving carry or overflow,
439 one register must be assigned, by convention by the programmer to be the
440 "accumulator". Scalar reduction is thus categorised by:
441
442 * One of the sources is a Vector
443 * the destination is a scalar
444 * optionally but most usefully when one source scalar register is
445 also the scalar destination (which may be informally termed
446 the "accumulator")
447 * That the source register type is the same as the destination register
448 type identified as the "accumulator". Scalar reduction on `cmp`,
449 `setb` or `isel` makes no sense for example because of the mixture
450 between CRs and GPRs.
451
452 *Note that issuing instructions in Scalar reduce mode such as `setb`
453 are neither `UNDEFINED` nor prohibited, despite them not making much
454 sense at first glance.
455 Scalar reduce is strictly defined behaviour, and the cost in
456 hardware terms of prohibition of seemingly non-sensical operations is too great.
457 Therefore it is permitted and required to be executed successfully.
458 Implementors **MAY** choose to optimise such instructions in instances
459 where their use results in "extraneous execution", i.e. where it is clear
460 that the sequence of operations, comprising multiple overwrites to
461 a scalar destination **without** cumulative, iterative, or reductive
462 behaviour (no "accumulator"), may discard all but the last element
463 operation. Identification
464 of such is trivial to do for `setb` and `cmp`: the source register type is
465 a completely different register file from the destination.
466 Likewise Scalar reduction when the destination is a Vector
467 is as if the Reduction Mode was not requested.*
468
469 Typical applications include simple operations such as `ADD r3, r10.v,
470 r3` where, clearly, r3 is being used to accumulate the addition of all
471 elements of the vector starting at r10.
472
473 # add RT, RA,RB but when RT==RA
474 for i in range(VL):
475 iregs[RA] += iregs[RB+i] # RT==RA
476
477 However, *unless* the operation is marked as "mapreduce" (`sv.add/mr`)
478 SV ordinarily
479 **terminates** at the first scalar operation. Only by marking the
480 operation as "mapreduce" will it continue to issue multiple sub-looped
481 (element) instructions in `Program Order`.
482
483 To perform the loop in reverse order, the ```RG``` (reverse gear) bit must be set. This may be useful in situations where the results may be different
484 (floating-point) if executed in a different order. Given that there is
485 no actual prohibition on Reduce Mode being applied when the destination
486 is a Vector, the "Reverse Gear" bit turns out to be a way to apply Iterative
487 or Cumulative Vector operations in reverse. `sv.add/rg r3.v, r4.v, r4.v`
488 for example will start at the opposite end of the Vector and push
489 a cumulative series of overlapping add operations into the Execution units of
490 the underlying hardware.
491
492 Other examples include shift-mask operations where a Vector of inserts
493 into a single destination register is required (see [[sv/bitmanip]], bmset),
494 as a way to construct
495 a value quickly from multiple arbitrary bit-ranges and bit-offsets.
496 Using the same register as both the source and destination, with Vectors
497 of different offsets masks and values to be inserted has multiple
498 applications including Video, cryptography and JIT compilation.
499
500 # assume VL=4:
501 # * Vector of shift-offsets contained in RC (r12.v)
502 # * Vector of masks contained in RB (r8.v)
503 # * Vector of values to be masked-in in RA (r4.v)
504 # * Scalar destination RT (r0) to receive all mask-offset values
505 sv.bmset/mr r0, r4.v, r8.v, r12.v
506
507 Due to the Deterministic Scheduling,
508 Subtract and Divide are still permitted to be executed in this mode,
509 although from an algorithmic perspective it is strongly discouraged.
510 It would be better to use addition followed by one final subtract,
511 or in the case of divide, to get better accuracy, to perform a multiply
512 cascade followed by a final divide.
513
514 Note that single-operand or three-operand scalar-dest reduce is perfectly
515 well permitted: the programmer may still declare one register, used as
516 both a Vector source and Scalar destination, to be utilised as
517 the "accumulator". In the case of `sv.fmadds` and `sv.maddhw` etc
518 this naturally fits well with the normal expected usage of these
519 operations.
520
521 If an interrupt or exception occurs in the middle of the scalar mapreduce,
522 the scalar destination register **MUST** be updated with the current
523 (intermediate) result, because this is how ```Program Order``` is
524 preserved (Vector Loops are to be considered to be just another way of issuing instructions
525 in Program Order). In this way, after return from interrupt,
526 the scalar mapreduce may continue where it left off. This provides
527 "precise" exception behaviour.
528
529 Note that hardware is perfectly permitted to perform multi-issue
530 parallel optimisation of the scalar reduce operation: it's just that
531 as far as the user is concerned, all exceptions and interrupts **MUST**
532 be precise.
533
534 ## Vector result reduce mode
535
536 Vector Reduce Mode issues a deterministic tree-reduction schedule to the underlying micro-architecture. Like Scalar reduction, the "Scalar Base"
537 (Power ISA v3.0B) operation is leveraged, unmodified, to give the
538 *appearance* and *effect* of Reduction.
539
540 Vector-result reduction **requires**
541 the destination to be a Vector, which will be used to store
542 intermediary results.
543
544 Given that the tree-reduction schedule is deterministic,
545 Interrupts and exceptions
546 can therefore also be precise. The final result will be in the first
547 non-predicate-masked-out destination element, but due again to
548 the deterministic schedule programmers may find uses for the intermediate
549 results.
550
551 When Rc=1 a corresponding Vector of co-resultant CRs is also
552 created. No special action is taken: the result and its CR Field
553 are stored "as usual" exactly as all other SVP64 Rc=1 operations.
554
555 Note that the Schedule only makes sense on top of certain instructions:
556 X-Form with a Register Profile of `RT,RA,RB` is fine. Like Scalar
557 Reduction, nothing is prohibited:
558 the results of execution on an unsuitable instruction may simply
559 not make sense. Many 3-input instructions (madd, fmadd) unlike Scalar
560 Reduction in particular do not make sense, but `ternlogi`, if used
561 with care, would.
562
563 **Parallel-Reduction with Predication**
564
565 To avoid breaking the strict RISC-paradigm, keeping the Issue-Schedule
566 completely separate from the actual element-level (scalar) operations,
567 Move operations are **not** included in the Schedule. This means that
568 the Schedule leaves the final (scalar) result in the first-non-masked
569 element of the Vector used. With the predicate mask being dynamic
570 (but deterministic) this result could be anywhere.
571
572 If that result is needed to be moved to a (single) scalar register
573 then a follow-up `sv.mv/sm=predicate rt, ra.v` instruction will be
574 needed to get it, where the predicate is the exact same predicate used
575 in the prior Parallel-Reduction instruction. For *some* implementations
576 this may be a slow operation. It may be better to perform a pre-copy
577 of the values, compressing them (VREDUCE-style) into a contiguous block,
578 which will guarantee that the result goes into the very first element
579 of the destination vector.
580
581 **Usage conditions**
582
583 The simplest usage is to perform an overwrite, specifying all three
584 register operands the same.
585
586 setvl VL=6
587 sv.add/vr 8.v, 8.v, 8.v
588
589 The Reduction Schedule will issue the Parallel Tree Reduction spanning
590 registers 8 through 13, by adjusting the offsets to RT, RA and RB as
591 necessary (see "Parallel Reduction algorithm" in a later section).
592
593 A non-overwrite is possible as well but just as with the overwrite
594 version, only those destination elements necessary for storing
595 intermediary computations will be written to: the remaining elements
596 will **not** be overwritten and will **not** be zero'd.
597
598 setvl VL=4
599 sv.add/vr 0.v, 8.v, 8.v
600
601 ## Sub-Vector Horizontal Reduction
602
603 Note that when SVM is clear and SUBVL!=1 the sub-elements are
604 *independent*, i.e. they are mapreduced per *sub-element* as a result.
605 illustration with a vec2, assuming RA==RT, e.g `sv.add/mr/vec2 r4, r4, r16`
606
607 for i in range(0, VL):
608 # RA==RT in the instruction. does not have to be
609 iregs[RT].x = op(iregs[RT].x, iregs[RB+i].x)
610 iregs[RT].y = op(iregs[RT].y, iregs[RB+i].y)
611
612 Thus logically there is nothing special or unanticipated about
613 `SVM=0`: it is expected behaviour according to standard SVP64
614 Sub-Vector rules.
615
616 By contrast, when SVM is set and SUBVL!=1, a Horizontal
617 Subvector mode is enabled, which behaves very much more
618 like a traditional Vector Processor Reduction instruction.
619
620 Example for a vec2:
621
622 for i in range(VL):
623 iregs[RT+i] = op(iregs[RA+i].x, iregs[RA+i].y)
624
625 Example for a vec3:
626
627 for i in range(VL):
628 iregs[RT+i] = op(iregs[RA+i].x, iregs[RA+i].y)
629 iregs[RT+i] = op(iregs[RT+i] , iregs[RA+i].z)
630
631 Example for a vec4:
632
633 for i in range(VL):
634 iregs[RT+i] = op(iregs[RA+i].x, iregs[RA+i].y)
635 iregs[RT+i] = op(iregs[RT+i] , iregs[RA+i].z)
636 iregs[RT+i] = op(iregs[RT+i] , iregs[RA+i].w)
637
638 In this mode, when Rc=1 the Vector of CRs is as normal: each result
639 element creates a corresponding CR element (for the final, reduced, result).
640
641 Note that the destination (RT) is automatically used as an "Accumulator"
642 register, and consequently the Sub-Vector Loop is interruptible.
643 If RT is a Scalar then as usual the main VL Loop terminates at the
644 first predicated element (or the first element if unpredicated).
645
646 # Fail-on-first
647
648 Data-dependent fail-on-first has two distinct variants: one for LD/ST
649 (see [[sv/ldst]],
650 the other for arithmetic operations (actually, CR-driven)
651 ([[sv/normal]]) and CR operations ([[sv/cr_ops]]).
652 Note in each
653 case the assumption is that vector elements are required appear to be
654 executed in sequential Program Order, element 0 being the first.
655
656 * LD/ST ffirst treats the first LD/ST in a vector (element 0) as an
657 ordinary one. Exceptions occur "as normal". However for elements 1
658 and above, if an exception would occur, then VL is **truncated** to the
659 previous element.
660 * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other
661 CR-creating operation produces a result (including cmp). Similar to
662 branch, an analysis of the CR is performed and if the test fails, the
663 vector operation terminates and discards all element operations
664 above the current one (and the current one if VLi is not set),
665 and VL is truncated to either
666 the *previous* element or the current one, depending on whether
667 VLi (VL "inclusive") is set.
668
669 Thus the new VL comprises a contiguous vector of results,
670 all of which pass the testing criteria (equal to zero, less than zero).
671
672 The CR-based data-driven fail-on-first is new and not found in ARM
673 SVE or RVV. It is extremely useful for reducing instruction count,
674 however requires speculative execution involving modifications of VL
675 to get high performance implementations. An additional mode (RC1=1)
676 effectively turns what would otherwise be an arithmetic operation
677 into a type of `cmp`. The CR is stored (and the CR.eq bit tested
678 against the `inv` field).
679 If the CR.eq bit is equal to `inv` then the Vector is truncated and
680 the loop ends.
681 Note that when RC1=1 the result elements are never stored, only the CRs.
682
683 VLi is only available as an option when `Rc=0` (or for instructions
684 which do not have Rc). When set, the current element is always
685 also included in the count (the new length that VL will be set to).
686 This may be useful in combination with "inv" to truncate the Vector
687 to `exclude` elements that fail a test, or, in the case of implementations
688 of strncpy, to include the terminating zero.
689
690 In CR-based data-driven fail-on-first there is only the option to select
691 and test one bit of each CR (just as with branch BO). For more complex
692 tests this may be insufficient. If that is the case, a vectorised crops
693 (crand, cror) may be used, and ffirst applied to the crop instead of to
694 the arithmetic vector.
695
696 One extremely important aspect of ffirst is:
697
698 * LDST ffirst may never set VL equal to zero. This because on the first
699 element an exception must be raised "as normal".
700 * CR-based data-dependent ffirst on the other hand **can** set VL equal
701 to zero. This is the only means in the entirety of SV that VL may be set
702 to zero (with the exception of via the SV.STATE SPR). When VL is set
703 zero due to the first element failing the CR bit-test, all subsequent
704 vectorised operations are effectively `nops` which is
705 *precisely the desired and intended behaviour*.
706
707 Another aspect is that for ffirst LD/STs, VL may be truncated arbitrarily
708 to a nonzero value for any implementation-specific reason. For example:
709 it is perfectly reasonable for implementations to alter VL when ffirst
710 LD or ST operations are initiated on a nonaligned boundary, such that
711 within a loop the subsequent iteration of that loop begins subsequent
712 ffirst LD/ST operations on an aligned boundary. Likewise, to reduce
713 workloads or balance resources.
714
715 CR-based data-dependent first on the other hand MUST not truncate VL
716 arbitrarily to a length decided by the hardware: VL MUST only be
717 truncated based explicitly on whether a test fails.
718 This because it is a precise test on which algorithms
719 will rely.
720
721 ## Data-dependent fail-first on CR operations (crand etc)
722
723 Operations that actually produce or alter CR Field as a result
724 do not also in turn have an Rc=1 mode. However it makes no
725 sense to try to test the 4 bits of a CR Field for being equal
726 or not equal to zero. Moreover, the result is already in the
727 form that is desired: it is a CR field. Therefore,
728 CR-based operations have their own SVP64 Mode, described
729 in [[sv/cr_ops]]
730
731 There are two primary different types of CR operations:
732
733 * Those which have a 3-bit operand field (referring to a CR Field)
734 * Those which have a 5-bit operand (referring to a bit within the
735 whole 32-bit CR)
736
737 More details can be found in [[sv/cr_ops]].
738
739 # pred-result mode
740
741 Pred-result mode may not be applied on CR-based operations.
742
743 Although CR operations (mtcr, crand, cror) may be Vectorised,
744 predicated, pred-result mode applies to operations that have
745 an Rc=1 mode, or make sense to add an RC1 option.
746
747 Predicate-result merges common CR testing with predication, saving on
748 instruction count. In essence, a Condition Register Field test
749 is performed, and if it fails it is considered to have been
750 *as if* the destination predicate bit was zero. Given that
751 there are no CR-based operations that produce Rc=1 co-results,
752 there can be no pred-result mode for mtcr and other CR-based instructions
753
754 Arithmetic and Logical Pred-result, which does have Rc=1 or for which
755 RC1 Mode makes sense, is covered in [[sv/normal]]
756
757 # CR Operations
758
759 CRs are slightly more involved than INT or FP registers due to the
760 possibility for indexing individual bits (crops BA/BB/BT). Again however
761 the access pattern needs to be understandable in relation to v3.0B / v3.1B
762 numbering, with a clear linear relationship and mapping existing when
763 SV is applied.
764
765 ## CR EXTRA mapping table and algorithm <a name="cr_extra"></a>
766
767 Numbering relationships for CR fields are already complex due to being
768 in BE format (*the relationship is not clearly explained in the v3.0B
769 or v3.1 specification*). However with some care and consideration
770 the exact same mapping used for INT and FP regfiles may be applied,
771 just to the upper bits, as explained below. The notation
772 `CR{field number}` is used to indicate access to a particular
773 Condition Register Field (as opposed to the notation `CR[bit]`
774 which accesses one bit of the 32 bit Power ISA v3.0B
775 Condition Register)
776
777 `CR{n}` refers to `CR0` when `n=0` and consequently, for CR0-7, is defined, in v3.0B pseudocode, as:
778
779 CR{7-n} = CR[32+n*4:35+n*4]
780
781 For SVP64 the relationship for the sequential
782 numbering of elements is to the CR **fields** within
783 the CR Register, not to individual bits within the CR register.
784
785 In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits. The top 3 bits (0:2)
786 select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits
787 *in* that CR (EQ/LT/GT/SO). The numbering was determined (after 4 months of
788 analysis and research) to be as follows:
789
790 CR_index = 7-(BA>>2) # top 3 bits but BE
791 bit_index = 3-(BA & 0b11) # low 2 bits but BE
792 CR_reg = CR{CR_index} # get the CR
793 # finally get the bit from the CR.
794 CR_bit = (CR_reg & (1<<bit_index)) != 0
795
796 When it comes to applying SV, it is the CR\_reg number to which SV EXTRA2/3
797 applies, **not** the CR\_bit portion (bits 3-4):
798
799 if extra3_mode:
800 spec = EXTRA3
801 else:
802 spec = EXTRA2<<1 | 0b0
803 if spec[0]:
804 # vector constructs "BA[0:2] spec[1:2] 00 BA[3:4]"
805 return ((BA >> 2)<<6) | # hi 3 bits shifted up
806 (spec[1:2]<<4) | # to make room for these
807 (BA & 0b11) # CR_bit on the end
808 else:
809 # scalar constructs "00 spec[1:2] BA[0:4]"
810 return (spec[1:2] << 5) | BA
811
812 Thus, for example, to access a given bit for a CR in SV mode, the v3.0B
813 algorithm to determine CR\_reg is modified to as follows:
814
815 CR_index = 7-(BA>>2) # top 3 bits but BE
816 if spec[0]:
817 # vector mode, 0-124 increments of 4
818 CR_index = (CR_index<<4) | (spec[1:2] << 2)
819 else:
820 # scalar mode, 0-32 increments of 1
821 CR_index = (spec[1:2]<<3) | CR_index
822 # same as for v3.0/v3.1 from this point onwards
823 bit_index = 3-(BA & 0b11) # low 2 bits but BE
824 CR_reg = CR{CR_index} # get the CR
825 # finally get the bit from the CR.
826 CR_bit = (CR_reg & (1<<bit_index)) != 0
827
828 Note here that the decoding pattern to determine CR\_bit does not change.
829
830 Note: high-performance implementations may read/write Vectors of CRs in
831 batches of aligned 32-bit chunks (CR0-7, CR7-15). This is to greatly
832 simplify internal design. If instructions are issued where CR Vectors
833 do not start on a 32-bit aligned boundary, performance may be affected.
834
835 ## CR fields as inputs/outputs of vector operations
836
837 CRs (or, the arithmetic operations associated with them)
838 may be marked as Vectorised or Scalar. When Rc=1 in arithmetic operations that have no explicit EXTRA to cover the CR, the CR is Vectorised if the destination is Vectorised. Likewise if the destination is scalar then so is the CR.
839
840 When vectorized, the CR inputs/outputs are sequentially read/written
841 to 4-bit CR fields. Vectorised Integer results, when Rc=1, will begin
842 writing to CR8 (TBD evaluate) and increase sequentially from there.
843 This is so that:
844
845 * implementations may rely on the Vector CRs being aligned to 8. This
846 means that CRs may be read or written in aligned batches of 32 bits
847 (8 CRs per batch), for high performance implementations.
848 * scalar Rc=1 operation (CR0, CR1) and callee-saved CRs (CR2-4) are not
849 overwritten by vector Rc=1 operations except for very large VL
850 * CR-based predication, from CR32, is also not interfered with
851 (except by large VL).
852
853 However when the SV result (destination) is marked as a scalar by the
854 EXTRA field the *standard* v3.0B behaviour applies: the accompanying
855 CR when Rc=1 is written to. This is CR0 for integer operations and CR1
856 for FP operations.
857
858 Note that yes, the CR Fields are genuinely Vectorised. Unlike in SIMD VSX which
859 has a single CR (CR6) for a given SIMD result, SV Vectorised OpenPOWER
860 v3.0B scalar operations produce a **tuple** of element results: the
861 result of the operation as one part of that element *and a corresponding
862 CR element*. Greatly simplified pseudocode:
863
864 for i in range(VL):
865 # calculate the vector result of an add
866 iregs[RT+i] = iregs[RA+i] + iregs[RB+i]
867 # now calculate CR bits
868 CRs{8+i}.eq = iregs[RT+i] == 0
869 CRs{8+i}.gt = iregs[RT+i] > 0
870 ... etc
871
872 If a "cumulated" CR based analysis of results is desired (a la VSX CR6)
873 then a followup instruction must be performed, setting "reduce" mode on
874 the Vector of CRs, using cr ops (crand, crnor) to do so. This provides far
875 more flexibility in analysing vectors than standard Vector ISAs. Normal
876 Vector ISAs are typically restricted to "were all results nonzero" and
877 "were some results nonzero". The application of mapreduce to Vectorised
878 cr operations allows far more sophisticated analysis, particularly in
879 conjunction with the new crweird operations see [[sv/cr_int_predication]].
880
881 Note in particular that the use of a separate instruction in this way
882 ensures that high performance multi-issue OoO inplementations do not
883 have the computation of the cumulative analysis CR as a bottleneck and
884 hindrance, regardless of the length of VL.
885
886 Additionally,
887 SVP64 [[sv/branches]] may be used, even when the branch itself is to
888 the following instruction. The combined side-effects of CTR reduction
889 and VL truncation provide several benefits.
890
891 (see [[discussion]]. some alternative schemes are described there)
892
893 ## Rc=1 when SUBVL!=1
894
895 sub-vectors are effectively a form of Packed SIMD (length 2 to 4). Only 1 bit of
896 predicate is allocated per subvector; likewise only one CR is allocated
897 per subvector.
898
899 This leaves a conundrum as to how to apply CR computation per subvector,
900 when normally Rc=1 is exclusively applied to scalar elements. A solution
901 is to perform a bitwise OR or AND of the subvector tests. Given that
902 OE is ignored in SVP64, this field may (when available) be used to select OR or
903 AND behavior.
904
905 ### Table of CR fields
906
907 CRn is the notation used by the OpenPower spec to refer to CR field #i,
908 so FP instructions with Rc=1 write to CR1 (n=1).
909
910 CRs are not stored in SPRs: they are registers in their own right.
911 Therefore context-switching the full set of CRs involves a Vectorised
912 mfcr or mtcr, using VL=8 to do so. This is exactly as how
913 scalar OpenPOWER context-switches CRs: it is just that there are now
914 more of them.
915
916 The 64 SV CRs are arranged similarly to the way the 128 integer registers
917 are arranged. TODO a python program that auto-generates a CSV file
918 which can be included in a table, which is in a new page (so as not to
919 overwhelm this one). [[svp64/cr_names]]
920
921 # Register Profiles
922
923 **NOTE THIS TABLE SHOULD NO LONGER BE HAND EDITED** see
924 <https://bugs.libre-soc.org/show_bug.cgi?id=548> for details.
925
926 Instructions are broken down by Register Profiles as listed in the
927 following auto-generated page: [[opcode_regs_deduped]]. "Non-SV"
928 indicates that the operations with this Register Profile cannot be
929 Vectorised (mtspr, bc, dcbz, twi)
930
931 TODO generate table which will be here [[svp64/reg_profiles]]
932
933 # SV pseudocode illilustration
934
935 ## Single-predicated Instruction
936
937 illustration of normal mode add operation: zeroing not included, elwidth
938 overrides not included. if there is no predicate, it is set to all 1s
939
940 function op_add(rd, rs1, rs2) # add not VADD!
941 int i, id=0, irs1=0, irs2=0;
942 predval = get_pred_val(FALSE, rd);
943 for (i = 0; i < VL; i++)
944 STATE.srcoffs = i # save context
945 if (predval & 1<<i) # predication uses intregs
946 ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2];
947 if (!int_vec[rd].isvec) break;
948 if (rd.isvec) { id += 1; }
949 if (rs1.isvec) { irs1 += 1; }
950 if (rs2.isvec) { irs2 += 1; }
951 if (id == VL or irs1 == VL or irs2 == VL)
952 {
953 # end VL hardware loop
954 STATE.srcoffs = 0; # reset
955 return;
956 }
957
958 This has several modes:
959
960 * RT.v = RA.v RB.v
961 * RT.v = RA.v RB.s (and RA.s RB.v)
962 * RT.v = RA.s RB.s
963 * RT.s = RA.v RB.v
964 * RT.s = RA.v RB.s (and RA.s RB.v)
965 * RT.s = RA.s RB.s
966
967 All of these may be predicated. Vector-Vector is straightfoward.
968 When one of source is a Vector and the other a Scalar, it is clear that
969 each element of the Vector source should be added to the Scalar source,
970 each result placed into the Vector (or, if the destination is a scalar,
971 only the first nonpredicated result).
972
973 The one that is not obvious is RT=vector but both RA/RB=scalar.
974 Here this acts as a "splat scalar result", copying the same result into
975 all nonpredicated result elements. If a fixed destination scalar was
976 intended, then an all-Scalar operation should be used.
977
978 See <https://bugs.libre-soc.org/show_bug.cgi?id=552>
979
980 # Assembly Annotation
981
982 Assembly code annotation is required for SV to be able to successfully
983 mark instructions as "prefixed".
984
985 A reasonable (prototype) starting point:
986
987 svp64 [field=value]*
988
989 Fields:
990
991 * ew=8/16/32 - element width
992 * sew=8/16/32 - source element width
993 * vec=2/3/4 - SUBVL
994 * mode=mr/satu/sats/crpred
995 * pred=1\<\<3/r3/~r3/r10/~r10/r30/~r30/lt/gt/le/ge/eq/ne
996
997 similar to x86 "rex" prefix.
998
999 For actual assembler:
1000
1001 sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s
1002
1003 Qualifiers:
1004
1005 * m={pred}: predicate mask mode
1006 * sm={pred}: source-predicate mask mode (only allowed in Twin-predication)
1007 * vec{N}: vec2 OR vec3 OR vec4 - sets SUBVL=2/3/4
1008 * ew={N}: ew=8/16/32 - sets elwidth override
1009 * sw={N}: sw=8/16/32 - sets source elwidth override
1010 * ff={xx}: see fail-first mode
1011 * pr={xx}: see predicate-result mode
1012 * sat{x}: satu / sats - see saturation mode
1013 * mr: see map-reduce mode
1014 * mr.svm see map-reduce with sub-vector mode
1015 * crm: see map-reduce CR mode
1016 * crm.svm see map-reduce CR with sub-vector mode
1017 * sz: predication with source-zeroing
1018 * dz: predication with dest-zeroing
1019
1020 For modes:
1021
1022 * pred-result:
1023 - pm=lt/gt/le/ge/eq/ne/so/ns
1024 - RC1 mode
1025 * fail-first
1026 - ff=lt/gt/le/ge/eq/ne/so/ns
1027 - RC1 mode
1028 * saturation:
1029 - sats
1030 - satu
1031 * map-reduce:
1032 - mr OR crm: "normal" map-reduce mode or CR-mode.
1033 - mr.svm OR crm.svm: when vec2/3/4 set, sub-vector mapreduce is enabled
1034
1035 # Parallel-reduction algorithm
1036
1037 The principle of SVP64 is that SVP64 is a fully-independent
1038 Abstraction of hardware-looping in between issue and execute phases
1039 that has no relation to the operation it issues.
1040 Additional state cannot be saved on context-switching beyond that
1041 of SVSTATE, making things slightly tricky.
1042
1043 Executable demo pseudocode, full version
1044 [here](https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/test_preduce.py;hb=HEAD)
1045
1046 ```
1047 [[!inline raw="yes" pages="openpower/sv/preduce.py" ]]
1048 ```
1049
1050 This algorithm works by noting when data remains in-place rather than
1051 being reduced, and referring to that alternative position on subsequent
1052 layers of reduction. It is re-entrant. If however interrupted and
1053 restored, some implementations may take longer to re-establish the
1054 context.
1055
1056 Its application by default is that:
1057
1058 * RA, FRA or BFA is the first register as the first operand
1059 (ci index offset in the above pseudocode)
1060 * RB, FRB or BFB is the second (co index offset)
1061 * RT (result) also uses ci **if RA==RT**
1062
1063 For more complex applications a REMAP Schedule must be used
1064
1065 *Programmers's note:
1066 if passed a predicate mask with only one bit set, this algorithm
1067 takes no action, similar to when a predicate mask is all zero.*
1068
1069 *Implementor's Note: many SIMD-based Parallel Reduction Algorithms are
1070 implemented in hardware with MVs that ensure lane-crossing is minimised.
1071 The mistake which would be catastrophic to SVP64 to make is to then
1072 limit the Reduction Sequence for all implementors
1073 based solely and exclusively on what one
1074 specific internal microarchitecture does.
1075 In SIMD ISAs the internal SIMD Architectural design is exposed and imposed on the programmer. Cray-style Vector ISAs on the other hand provide convenient,
1076 compact and efficient encodings of abstract concepts.*
1077 **It is the Implementor's responsibility to produce a design
1078 that complies with the above algorithm,
1079 utilising internal Micro-coding and other techniques to transparently
1080 insert micro-architectural lane-crossing Move operations
1081 if necessary or desired, to give the level of efficiency or performance
1082 required.**
1083
1084 # Element-width overrides <a name="elwidth"> </>
1085
1086 Element-width overrides are best illustrated with a packed structure
1087 union in the c programming language. The following should be taken
1088 literally, and assume always a little-endian layout:
1089
1090 typedef union {
1091 uint8_t b[];
1092 uint16_t s[];
1093 uint32_t i[];
1094 uint64_t l[];
1095 uint8_t actual_bytes[8];
1096 } el_reg_t;
1097
1098 elreg_t int_regfile[128];
1099
1100 get_polymorphed_reg(reg, bitwidth, offset):
1101 el_reg_t res;
1102 res.l = 0; // TODO: going to need sign-extending / zero-extending
1103 if bitwidth == 8:
1104 reg.b = int_regfile[reg].b[offset]
1105 elif bitwidth == 16:
1106 reg.s = int_regfile[reg].s[offset]
1107 elif bitwidth == 32:
1108 reg.i = int_regfile[reg].i[offset]
1109 elif bitwidth == 64:
1110 reg.l = int_regfile[reg].l[offset]
1111 return res
1112
1113 set_polymorphed_reg(reg, bitwidth, offset, val):
1114 if (!reg.isvec):
1115 # not a vector: first element only, overwrites high bits
1116 int_regfile[reg].l[0] = val
1117 elif bitwidth == 8:
1118 int_regfile[reg].b[offset] = val
1119 elif bitwidth == 16:
1120 int_regfile[reg].s[offset] = val
1121 elif bitwidth == 32:
1122 int_regfile[reg].i[offset] = val
1123 elif bitwidth == 64:
1124 int_regfile[reg].l[offset] = val
1125
1126 In effect the GPR registers r0 to r127 (and corresponding FPRs fp0
1127 to fp127) are reinterpreted to be "starting points" in a byte-addressable
1128 memory. Vectors - which become just a virtual naming construct - effectively
1129 overlap.
1130
1131 It is extremely important for implementors to note that the only circumstance
1132 where upper portions of an underlying 64-bit register are zero'd out is
1133 when the destination is a scalar. The ideal register file has byte-level
1134 write-enable lines, just like most SRAMs, in order to avoid READ-MODIFY-WRITE.
1135
1136 An example ADD operation with predication and element width overrides:
1137
1138  for (i = 0; i < VL; i++)
1139 if (predval & 1<<i) # predication
1140 src1 = get_polymorphed_reg(RA, srcwid, irs1)
1141 src2 = get_polymorphed_reg(RB, srcwid, irs2)
1142 result = src1 + src2 # actual add here
1143 set_polymorphed_reg(RT, destwid, ird, result)
1144 if (!RT.isvec) break
1145 if (RT.isvec)  { id += 1; }
1146 if (RA.isvec)  { irs1 += 1; }
1147 if (RB.isvec)  { irs2 += 1; }
1148
1149 Thus it can be clearly seen that elements are packed by their
1150 element width, and the packing starts from the source (or destination)
1151 specified by the instruction.
1152
1153 # Twin (implicit) result operations
1154
1155 Some operations in the Power ISA already target two 64-bit scalar
1156 registers: `lq` for example, and LD with update.
1157 Some mathematical algorithms are more
1158 efficient when there are two outputs rather than one, providing
1159 feedback loops between elements (the most well-known being add with
1160 carry). 64-bit multiply
1161 for example actually internally produces a 128 bit result, which clearly
1162 cannot be stored in a single 64 bit register. Some ISAs recommend
1163 "macro op fusion": the practice of setting a convention whereby if
1164 two commonly used instructions (mullo, mulhi) use the same ALU but
1165 one selects the low part of an identical operation and the other
1166 selects the high part, then optimised micro-architectures may
1167 "fuse" those two instructions together, using Micro-coding techniques,
1168 internally.
1169
1170 The practice and convention of macro-op fusion however is not compatible
1171 with SVP64 Horizontal-First, because Horizontal Mode may only
1172 be applied to a single instruction at a time, and SVP64 is based on
1173 the principle of strict Program Order even at the element
1174 level. Thus it becomes
1175 necessary to add explicit more complex single instructions with
1176 more operands than would normally be seen in the average RISC ISA
1177 (3-in, 2-out, in some cases). If it
1178 was not for Power ISA already having LD/ST with update as well as
1179 Condition Codes and `lq` this would be hard to justify.
1180
1181 With limited space in the `EXTRA` Field, and Power ISA opcodes
1182 being only 32 bit, 5 operands is quite an ask. `lq` however sets
1183 a precedent: `RTp` stands for "RT pair". In other words the result
1184 is stored in RT and RT+1. For Scalar operations, following this
1185 precedent is perfectly reasonable. In Scalar mode,
1186 `madded` therefore stores the two halves of the 128-bit multiply
1187 into RT and RT+1.
1188
1189 What, then, of `sv.madded`? If the destination is hard-coded to
1190 RT and RT+1 the instruction is not useful when Vectorised because
1191 the output will be overwritten on the next element. To solve this
1192 is easy: define the destination registers as RT and RT+MAXVL
1193 respectively. This makes it easy for compilers to statically allocate
1194 registers even when VL changes dynamically.
1195
1196 Bear in mind that both RT and RT+MAXVL are starting points for Vectors,
1197 and bear in mind that element-width overrides still have to be taken
1198 into consideration, the starting point for the implicit destination
1199 is best illustrated in pseudocode:
1200
1201 # demo of madded
1202  for (i = 0; i < VL; i++)
1203 if (predval & 1<<i) # predication
1204 src1 = get_polymorphed_reg(RA, srcwid, irs1)
1205 src2 = get_polymorphed_reg(RB, srcwid, irs2)
1206 src2 = get_polymorphed_reg(RC, srcwid, irs3)
1207 result = src1*src2 + src2
1208 destmask = (2<<destwid)-1
1209 # store two halves of result, both start from RT.
1210 set_polymorphed_reg(RT, destwid, ird , result&destmask)
1211 set_polymorphed_reg(RT, destwid, ird+MAXVL, result>>destwid)
1212 if (!RT.isvec) break
1213 if (RT.isvec)  { id += 1; }
1214 if (RA.isvec)  { irs1 += 1; }
1215 if (RB.isvec)  { irs2 += 1; }
1216 if (RC.isvec)  { irs3 += 1; }
1217
1218 The significant part here is that the second half is stored
1219 starting not from RT+MAXVL at all: it is the *element* index
1220 that is offset by MAXVL, both halves actually starting from RT.
1221 If VL is 3, MAXVL is 5, RT is 1, and dest elwidth is 32 then the elements
1222 RT0 to RT2 are stored:
1223
1224 0..31 32..63
1225 r0 unchanged unchanged
1226 r1 RT0.lo RT1.lo
1227 r2 RT2.lo unchanged
1228 r3 unchanged RT0.hi
1229 r4 RT1.hi RT2.hi
1230 r5 unchanged unchanged
1231
1232 Note that all of the LO halves start from r1, but that the HI halves
1233 start from half-way into r3. The reason is that with MAXVL bring
1234 5 and elwidth being 32, this is the 5th element
1235 offset (in 32 bit quantities) counting from r1.
1236
1237 *Programmer's note: accessing registers that have been placed
1238 starting on a non-contiguous boundary (half-way along a scalar
1239 register) can be inconvenient: REMAP can provide an offset but
1240 it requires extra instructions to set up. A simple solution
1241 is to ensure that MAXVL is rounded up such that the Vector
1242 ends cleanly on a contiguous register boundary. MAXVL=6 in
1243 the above example would achieve that*
1244
1245 Additional DRAFT Scalar instructions in 3-in 2-out form
1246 with an implicit 2nd destination:
1247
1248 * [[isa/svfixedarith]]
1249 * [[isa/svfparith]]
1250