(no commit message)
[libreriscv.git] / openpower / sv / branches.mdwn
1 [[!tag standards]]
2 # SVP64 Branch Conditional behaviour
3
4 **DRAFT STATUS**
5
6 Please note: although similar, SVP64 Branch instructions should be
7 considered completely separate and distinct from
8 standard scalar OpenPOWER-approved v3.0B branches.
9 **v3.0B branches are in no way impacted, altered,
10 changed or modified in any way, shape or form by
11 the SVP64 Vectorised Variants**.
12
13 It is also
14 extremely important to note that Branches are the
15 sole semi-exception in SVP64 to `Scalar Identity Behaviour`.
16 SVP64 Branches contain additional modes that are useful
17 for scalar operations (i.e. even when VL=1 or when
18 using single-bit predication).
19
20 Links
21
22 * <https://bugs.libre-soc.org/show_bug.cgi?id=664>
23 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
24 * [[openpower/isa/branch]]
25
26 # Rationale
27
28 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
29 Condition Register. However for parallel processing it is simply impossible
30 to perform multiple independent branches: the Program Counter simply
31 cannot branch to multiple destinations based on multiple conditions.
32 The best that can be done is
33 to test multiple Conditions and make a decision of a *single* branch,
34 based on analysis of a *Vector* of CR Fields
35 which have just been calculated from a *Vector* of results.
36
37 In 3D Shader
38 binaries, which are inherently parallelised and predicated, testing all or
39 some results and branching based on multiple tests is extremely common,
40 and a fundamental part of Shader Compilers. Example:
41 without such multi-condition
42 test-and-branch, if a predicate mask is all zeros a large batch of
43 instructions may be masked out to `nop`, and it would waste
44 CPU cycles to run them. 3D GPU ISAs can test for this scenario
45 and, with the appropriate predicate-analysis instruction,
46 jump over fully-masked-out operations, by spotting that
47 *all* Conditions are false.
48
49 Unless Branches are aware and capable of such analysis, additional
50 instructions would be required which perform Horizontal Cumulative
51 analysis of Vectorised Condition Register Fields, in order to
52 reduce the Vector of CR Fields down to one single yes or no
53 decision that a Scalar-only v3.0B Branch-Conditional could cope with.
54 Such instructions would be unavoidable, required, and costly
55 by comparison to a single Vector-aware Branch.
56 Therefore, in order to be commercially competitive, `sv.bc` and
57 other Vector-aware Branch Conditional instructions are a high priority
58 for 3D GPU (and CUDA) workloads.
59
60 Given that Power ISA v3.0B is already quite powerful, particularly
61 the Condition Registers and their interaction with Branches, there
62 are opportunities to create extremely flexible and compact
63 Vectorised Branch behaviour. In addition, the side-effects (updating
64 of CTR, truncation of VL, described below) make it a useful instruction
65 even if the branch points to the next instruction (no actual branch).
66
67 # Overview
68
69 When considering an "array" of branch-tests, there are four
70 primarily-useful modes:
71 AND, OR, NAND and NOR of all Conditions.
72 NAND and NOR may be synthesised from AND and OR by
73 inverting `BO[1]` which just leaves two modes:
74
75 * Branch takes place on the **first** CR Field test to succeed
76 (a Great Big OR of all condition tests)
77 * Branch takes place only if **all** CR field tests succeed:
78 a Great Big AND of all condition tests
79
80 Early-exit is enacted such that the Vectorised Branch does not
81 perform needless extra tests, which will help reduce reads on
82 the Condition Register file.
83
84 *Note: Early-exit is **MANDATORY** (required) behaviour.
85 Branches **MUST** exit at the first sequentially-encountered
86 failure point, for
87 exactly the same reasons for which it is mandatory in
88 programming languages doing early-exit: to avoid
89 damaging side-effects and to provide deterministic
90 behsviour. Speculative testing of Condition
91 Register Fields is permitted, as is speculative updating
92 of CTR, as long as, as usual in any Out-of-Order microarchitecture,
93 that speculative testing is cancelled should an early-exit occur.*
94
95 Also note that when early-exit occurs in Horizontal-first Mode,
96 srcstep, dststep etc. are all reset, ready to begin looping from the
97 beginning for the next instruction. However for Vertical-first
98 Mode srcstep etc. are incremented "as usual" i.e. the early-exit
99 has no special impact. This can leave srcstep etc. in an unusual
100 state on exit from a loop and it is up to the programmer to
101 reset srcstep, dststep etc. to known-good values.
102
103 Additional useful behaviour involves two primary Modes (both of
104 which may be enabled and combined):
105
106 * **VLSET Mode**: identical to Data-Dependent Fail-First Mode
107 for Arithmetic SVP64 operations, with more
108 flexibility and a close interaction and integration into the
109 underlying base Scalar v3.0B Branch instruction.
110 Truncation of VL takes place around the early-exit point.
111 * **CTR-test Mode**: gives much more flexibility over when and why
112 CTR is decremented, including options to decrement if a Condition
113 test succeeds *or if it fails*.
114
115 With these side-effects, basic Boolean Logic Analysis advises that
116 it is important to provide a means
117 to enact them each based on whether testing succeeds *or fails*. This
118 results in a not-insignificant number of additional Mode Augmentation bits,
119 accompanying VLSET and CTR-test Modes respectively.
120
121 Predicate skipping or zeroing may, as usual with SVP64, be controlled
122 by `sz`.
123 Where the predicate is masked out and
124 zeroing is enabled, then in such circumstances
125 the same Boolean Logic Analysis dictates that
126 rather than testing only against zero, the option to test
127 against one is also prudent. This introduces a new
128 immediate field, `SNZ`, which works in conjunction with
129 `sz`.
130
131
132 Vectorised Branches can be used
133 in either SVP64 Horizontal-First or Vertical-First Mode. Essentially,
134 at an element level, the behaviour is identical in both Modes,
135 although the `ALL` bit is meaningless in Vertical-First Mode.
136
137 It is also important
138 to bear in mind that, fundamentally, Vectorised Branch-Conditional
139 is still extremely close to the Scalar v3.0B Branch-Conditional
140 instructions, and that the same v3.0B Scalar Branch-Conditional
141 instructions are still
142 *completely separate and independent*, being unaltered and
143 unaffected by their SVP64 variants in every conceivable way.
144
145 *Programming note: One important point is that SVP64 instructions are 64 bit.
146 (8 bytes not 4). This needs to be taken into consideration when computing
147 branch offsets: the offset is relative to the start of the instruction,
148 which **includes** the SVP64 Prefix*
149
150 # Format and fields
151
152 With element-width overrides being meaningless for Condition
153 Register Fields, bits 4 thru 7 of SVP64 RM may be used for additional
154 Mode bits.
155
156 SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
157 Conditional:
158
159 | 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description |
160 | - | - | - | - | -- | -- | --- |---------|----------------- |
161 |ALL|SNZ| / | / | 0 | 0 | / | LRu sz | normal mode |
162 |ALL|SNZ| / |VSb| 0 | 1 | VLI | LRu sz | VLSET mode |
163 |ALL|SNZ|CTi| / | 1 | 0 | / | LRu sz | CTR-test mode |
164 |ALL|SNZ|CTi|VSb| 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode |
165
166 Brief description of fields:
167
168 * **sz=1** if predication is enabled and `sz=1` and a predicate
169 element bit is zero, `SNZ` will
170 be substituted in place of the CR bit selected by `BI`,
171 as the Condition tested.
172 Contrast this with
173 normal SVP64 `sz=1` behaviour, where *only* a zero is put in
174 place of masked-out predicate bits.
175 * **sz=0** When `sz=0` skipping occurs as usual on
176 masked-out elements, but unlike all
177 other SVP64 behaviour which entirely skips an element with
178 no related side-effects at all, there are certain
179 special circumstances where CTR
180 may be decremented. See CTR-test Mode, below.
181 * **ALL** when set, all branch conditional tests must pass in order for
182 the branch to succeed. When clear, it is the first sequentially
183 encountered successful test that causes the branch to succeed.
184 This is identical behaviour to how programming languages perform
185 early-exit on Boolean Logic chains.
186 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
187 In VLSET mode, VL *may* (depending on `VSb`) be truncated.
188 If VLI (Vector Length Inclusive) is clear,
189 VL is truncated to *exclude* the current element, otherwise it is
190 included. SVSTATE.MVL is not altered: only VL.
191 * **LRu**: Link Register Update. When set, Link Register will
192 only be updated if the Branch Condition succeeds. This avoids
193 destruction of LR during loops (particularly Vertical-First
194 ones).
195 * **VSb** In VLSET Mode, after testing,
196 if VSb is set, VL is truncated if the test succeeds. If VSb is clear,
197 VL is truncated if a test *fails*. Masked-out (skipped)
198 bits are not considered
199 part of testing.
200 * **CTi** CTR inversion. CTR-test Mode normally decrements per element
201 tested. CTR inversion decrements if a test *fails*. Only relevant
202 in CTR-test Mode.
203
204 LRu and CTR-test modes are where SVP64 Branches subtly differ from
205 Scalar v3.0B Branches. `bclr` for example will always update LR, whereas
206 `sv.bclr/lru` will only update LR if the branch succeeds.
207
208 Of special interest is that when using ALL Mode (Great Big AND
209 of all Condition Tests), if `VL=0`,
210 which is rare but can occur in Data-Dependent Modes, the Branch
211 will always take place because there will be no failing Condition
212 Tests to prevent it. Likewise when not using ALL Mode (Great Big OR
213 of all Condition Tests) and `VL=0` the Branch is guaranteed not
214 to occur because there will be no *successful* Condition Tests
215 to make it happen.
216
217 # Vectorised CR Field numbering, and Scalar behaviour
218
219 It is important to keep in mind that just like all SVP64 instructions,
220 the `BI` field of the base v3.0B Branch Conditional instruction
221 may be extended by SVP64 EXTRA augmentation, as well as be marked
222 as either Scalar or Vector. It is also crucially important to keep in mind
223 that for CRs, SVP64 sequentially increments the CR *Field* numbers.
224 CR *Fields* are treated as elements, not bit-numbers of the CR *register*.
225
226 The `BI` operand of Branch Conditional operations is five bits, in scalar
227 v3.0B this would select one bit of the 32 bit CR,
228 comprising eight CR Fields of 4 bits each. In SVP64 there are
229 16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
230 `BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
231 are extended to either scalar or vector and to select CR Fields 0..127
232 as specified in SVP64 [[sv/svp64/appendix]].
233
234 When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
235 then as the usual SVP64 rules apply:
236 the Vector loop ends at the first element tested
237 (the first CR *Field*), after taking
238 predication into consideration. Thus, also as usual, when a predicate mask is
239 given, and `BI` marked as scalar, and `sz` is zero, srcstep
240 skips forward to the first non-zero predicated element, and only that
241 one element is tested.
242
243 In other words, the fact that this is a Branch
244 Operation (instead of an arithmetic one) does not result, ultimately,
245 in significant changes as to
246 how SVP64 is fundamentally applied, except with respect to:
247
248 * the unique properties associated with conditionally
249 changing the Program
250 Counter (aka "a Branch"), resulting in early-out
251 opportunities
252 * CTR-testing
253
254 Both are outlined below.
255
256 # Horizontal-First and Vertical-First Modes
257
258 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
259 AND) results in early exit: no more updates to CTR occur (if requested);
260 no branch occurs, and LR is not updated (if requested). Likewise for
261 non-ALL mode (Great Big Or) on first success early exit also occurs,
262 however this time with the Branch proceeding. In both cases the testing
263 of the Vector of CRs should be done in linear sequential order (or in
264 REMAP re-sequenced order): such that tests that are sequentially beyond
265 the exit point are *not* carried out. (*Note: it is standard practice in
266 Programming languages to exit early from conditional tests, however
267 a little unusual to consider in an ISA that is designed for Parallel
268 Vector Processing. The reason is to have strictly-defined guaranteed
269 behaviour*)
270
271 In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
272 behaviour. Given that only one element is being tested at a time
273 in Vertical-First Mode, a test designed to be done on multiple
274 bits is meaningless.
275
276 # Description and Modes
277
278 Predication in both INT and CR modes may be applied to `sv.bc` and other
279 SVP64 Branch Conditional operations, exactly as they may be applied to
280 other SVP64 operations. When `sz` is zero, any masked-out Branch-element
281 operations are not included in condition testing, exactly like all other
282 SVP64 operations, *including* side-effects such as potentially updating
283 LR or CTR, which will also be skipped. There is *one* exception here,
284 which is when
285 `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
286 predicate mask bit is also zero:
287 under these special circumstances CTR will also decrement.
288
289 When `sz` is non-zero, this normally requests insertion of a zero
290 in place of the input data, when the relevant predicate mask bit is zero.
291 This would mean that a zero is inserted in place of `CR[BI+32]` for
292 testing against `BO`, which may not be desirable in all circumstances.
293 Therefore, an extra field is provided `SNZ`, which, if set, will insert
294 a **one** in place of a masked-out element, instead of a zero.
295
296 (*Note: Both options are provided because it is useful to deliberately
297 cause the Branch-Conditional Vector testing to fail at a specific point,
298 controlled by the Predicate mask. This is particularly useful in `VLSET`
299 mode, which will truncate SVSTATE.VL at the point of the first failed
300 test.*)
301
302 Normally, CTR mode will decrement once per Condition Test, resulting
303 under normal circumstances that CTR reduces by up to VL in Horizontal-First
304 Mode. Just as when v3.0B Branch-Conditional saves at
305 least one instruction on tight inner loops through auto-decrementation
306 of CTR, likewise it is also possible to save instruction count for
307 SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
308 in circumstances where there is conditional interaction between the
309 element computation and testing, and the continuation (or otherwise)
310 of a given loop. The potential combinations of interactions is why CTR
311 testing options have been added.
312
313 Also, the unconditional bit `BO[0]` is still relevant when Predication
314 is applied to the Branch because in `ALL` mode all nonmasked bits have
315 to be tested, and when `sz=0` skipping occurs.
316 Even when VLSET mode is not used, CTR
317 may still be decremented by the total number of nonmasked elements,
318 acting in effect as either a popcount or cntlz depending on which
319 mode bits are set.
320 In short, Vectorised Branch becomes an extremely powerful tool.
321
322 ## CTR-test
323
324 Where a standard Scalar v3.0B branch unconditionally decrements
325 CTR when `BO[2]` is clear, CTR-test Mode introduces more flexibility
326 which allows CTR to be used for many more types of Vector loops
327 constructs.
328
329 CTR-test mode and CTi interaction is as follows: note that
330 `BO[2]` is still required to be clear for CTR decrements to be
331 considered, exactly as is the case in Scalar Power ISA v3.0B
332
333 * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
334 if `BO[2]` is zero. Masked-out elements when `sz=0` are
335 skipped (i.e. CTR is *not* decremented when the predicate
336 bit is zero and `sz=0`).
337 * **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
338 if `BO[2]` is zero and a masked-out element is skipped
339 (`sz=0` and predicate bit is zero). This one special case is the
340 **opposite** of other combinations, as well as being
341 completely different from normal SVP64 `sz=0` behaviour)
342 * **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
343 if `BO[2]` is zero and the Condition Test succeeds.
344 Masked-out elements when `sz=0` are skipped (including
345 not decrementing CTR)
346 * **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
347 if `BO[2]` is zero and the Condition Test *fails*.
348 Masked-out elements when `sz=0` are skipped (including
349 not decrementing CTR)
350
351 `CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
352 only time in the entirety of SVP64 that has side-effects when
353 a predicate mask bit is clear. **All** other SVP64 operations
354 entirely skip an element when sz=0 and a predicate mask bit is zero.
355 It is also critical to emphasise that in this unusual mode,
356 no other side-effects occur: **only** CTR is decremented, i.e. the
357 rest of the Branch operation iss skipped.
358
359 # VLSET Mode
360
361 VLSET Mode truncates the Vector Length so that subsequent instructions
362 operate on a reduced Vector Length. This is similar to
363 Data-dependent Fail-First and LD/ST Fail-First, where for VLSET the
364 truncation occurs at the Branch decision-point.
365
366 Interestingly, due to the side-effects of `VLSET` mode
367 it is actually useful to use Branch Conditional even
368 to perform no actual branch operation, i.e to point to the instruction
369 after the branch. Truncation of VL would thus conditionally occur yet control
370 flow alteration would not.
371
372 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
373 is designed to be used for explicit looping, where an explicit call to
374 `svstep` is required to move both srcstep and dststep on to
375 the next element, until VL (or other condition) is reached.
376 Vertical-First Looping is expected (required) to terminate if the end
377 of the Vector, VL, is reached. If however that loop is terminated early
378 because VL is truncated, VLSET with Vertical-First becomes meaningless.
379 Resolving this would require two branches: one Conditional, the other
380 branching unconditionally to create the loop, where the Conditional
381 one jumps over it.
382
383 Therefore, with `VSb`, the option to decide whether truncation should occur if the
384 branch succeeds *or* if the branch condition fails allows for the flexibility
385 required. This allows a Vertical-First Branch to *either* be used as
386 a branch-back (loop) *or* as part of a conditional exit or function
387 call from *inside* a loop, and for VLSET to be integrated into both
388 types of decision-making.
389
390 In the case of a Vertical-First branch-back (loop), with `VSb=0` the branch takes
391 place if success conditions are met, but on exit from that loop
392 (branch condition fails), VL will be truncated. This is extremely
393 useful.
394
395 `VLSET` mode with Horizontal-First when `VSb=0` is still
396 useful, because it can be used to truncate VL to the first predicated
397 (non-masked-out) element.
398
399 The truncation point for VL, when VLi is clear, must not include skipped
400 elements that preceded the current element being tested.
401 Example: `sz=0, VLi=0, predicate mask = 0b110010` and the Condition
402 Register failure point is at CR Field element 4.
403
404 * Testing at element 0 is skipped because its predicate bit is zero
405 * Testing at element 1 passed
406 * Testing elements 2 and 3 are skipped because their
407 respective predicate mask bits are zero
408 * Testing element 4 fails therefore VL is truncated to **2**
409 not 4 due to elements 2 and 3 being skipped.
410
411 If `sz=1` in the above example *then* VL would have been set to 4 because
412 in non-zeroing mode the zero'd elements are still effectively part of the
413 Vector (with their respective elements set to `SNZ`)
414
415 If `VLI=1` then VL would be set to 5 regardless of sz, due to being inclusive
416 of the element actually being tested.
417
418 ## VLSET and CTR-test combined
419
420 If both CTR-test and VLSET Modes are requested, it's important to
421 observe the correct order. What occurs depends on whether VLi
422 is enabled, because VLi affects the length, VL.
423
424 If VLi (VL truncate inclusive) is set:
425
426 1. compute the test including whether CTR triggers
427 2. (optionally) decrement CTR
428 3. (optionally) truncate VL (VSb inverts the decision)
429 4. decide (based on step 1) whether to terminate looping
430 (including not executing step 5)
431 5. decide whether to branch.
432
433 If VLi is clear, then when a test fails that element
434 and any following it
435 should **not** be considered part of the Vector. Consequently:
436
437 1. compute the branch test including whether CTR triggers
438 2. if the test fails against VSb, truncate VL to the *previous*
439 element, and terminate looping. No further steps executed.
440 3. (optionally) decrement CTR
441 4. decide whether to branch.
442
443 # Boolean Logic combinations
444
445 In a Scalar ISA, Branch-Conditional testing even of vector
446 results may be performed through inversion of tests. NOR of
447 all tests may be performed by inversion of the scalar condition
448 and branching *out* from the scalar loop around elements,
449 using scalar operations.
450
451 In a parallel (Vector) ISA it is the ISA itself which must perform
452 the prerequisite logic manipulation.
453 Thus for SVP64 there are an extraordinary number of nesessary combinations
454 which provide completely different and useful behaviour.
455 Available options to combine:
456
457 * `BO[0]` to make an unconditional branch would seem irrelevant if
458 it were not for predication and for side-effects (CTR Mode
459 for example)
460 * Enabling CTR-test Mode and setting `BO[2]` can still result in the
461 Branch
462 taking place, not because the Condition Test itself failed, but
463 because CTR reached zero **because**, as required by CTR-test mode,
464 CTR was decremented as a **result** of Condition Tests failing.
465 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
466 * `R30` and `~R30` and other predicate mask options including CR and
467 inverted CR bit testing
468 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
469 predicate bits
470 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
471 `OR` of all tests, respectively.
472 * Predicate Mask bits, which combine in effect with the CR being
473 tested.
474 * Inversion of Predicate Masks (`~r3` instead of `r3`, or using
475 `NE` rather than `EQ`) which results in an additional
476 level of possible ANDing, ORing etc. that would otherwise
477 need explicit instructions.
478
479 The most obviously useful combinations here are to set `BO[1]` to zero
480 in order to turn `ALL` into Great-Big-NAND and `ANY` into
481 Great-Big-NOR. Other Mode bits which perform behavioural inversion then
482 have to work round the fact that the Condition Testing is NOR or NAND.
483 The alternative to not having additional behavioural inversion
484 (`SNZ`, `VSb`, `CTi`) would be to have a second (unconditional)
485 branch directly after the first, which the first branch jumps over.
486 This contrivance is avoided by the behavioural inversion bits.
487
488 # Pseudocode and examples
489
490 Please see [[svp64/appendix]] regarding CR bit ordering and for
491 the definition of `CR{n}`
492
493 For comparative purposes this is a copy of the v3.0B `bc` pseudocode
494
495 ```
496 if (mode_is_64bit) then M <- 0
497 else M <- 32
498 if ¬BO[2] then CTR <- CTR - 1
499 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
500 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
501 if ctr_ok & cond_ok then
502 if AA then NIA <-iea EXTS(BD || 0b00)
503 else NIA <-iea CIA + EXTS(BD || 0b00)
504 if LK then LR <-iea CIA + 4
505 ```
506
507 Simplified pseudocode including LRu and CTR skipping, which illustrates
508 clearly that SVP64 Scalar Branches (VL=1) are **not** identical to
509 v3.0B Scalar Branches. The key areas where differences occur are
510 the inclusion of predication (which can still be used when VL=1), in
511 when and why CTR is decremented (CTRtest Mode) and whether LR is
512 updated (which is unconditional in v3.0B when LK=1, and conditional
513 in SVP64 when LRu=1).
514
515 ```
516 if (mode_is_64bit) then M <- 0
517 else M <- 32
518 testbit = CR[BI+32]
519 if ¬predicate_bit then testbit = SVRMmode.SNZ
520 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
521 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
522 if ¬predicate_bit & ¬SVRMmode.sz then
523 if ¬BO[2] & CTRtest & ¬CTi then
524 CTR = CTR - 1
525 stop # instruction finishes here
526 if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
527 lr_ok <- SVRMmode.LRu
528 if ctr_ok & cond_ok then
529 if AA then NIA <-iea EXTS(BD || 0b00)
530 else NIA <-iea CIA + EXTS(BD || 0b00)
531 lr_ok <- 0b1
532 if LK & lr_ok then LR <-iea CIA + 4
533 ```
534
535 Below is the pseudocode for SVP64 Branches, which is a little less
536 obvious but identical to the above. The lack of obviousness is down
537 to the early-exit opportunities.
538
539 Pseudocode for Horizontal-First Mode:
540
541 ```
542 if (mode_is_64bit) then M <- 0
543 else M <- 32
544 cond_ok = not SVRMmode.ALL
545 for srcstep in range(VL):
546 # select predicate bit or zero/one
547 if predicate[srcstep]:
548 # get SVP64 extended CR field 0..127
549 SVCRf = SVP64EXTRA(BI>>2)
550 CRbits = CR{SVCRf}
551 testbit = CRbits[BI & 0b11]
552 # testbit = CR[BI+32+srcstep*4]
553 else if not SVRMmode.sz:
554 # inverted CTR test skip mode
555 if ¬BO[2] & CTRtest & ¬CTI then
556 CTR = CTR - 1
557 continue # skip to next element
558 else
559 testbit = SVRMmode.SNZ
560 # actual element test here
561 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
562 el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
563 # check if CTR dec should occur
564 ctrdec = ¬BO[2]
565 if CTRtest & (el_cond_ok ^ CTi) then
566 ctrdec = 0b0
567 if ctrdec then CTR <- CTR - 1
568 # merge in the test
569 if SVRMmode.ALL:
570 cond_ok &= (el_cond_ok & ctr_ok)
571 else
572 cond_ok |= (el_cond_ok & ctr_ok)
573 # test for VL to be set (and exit)
574 if VLSET and VSb = (el_cond_ok & ctr_ok) then
575 if SVRMmode.VLI
576 SVSTATE.VL = srcstep+1
577 else
578 SVSTATE.VL = srcstep
579 break
580 # early exit?
581 if SVRMmode.ALL != (el_cond_ok & ctr_ok):
582 break
583 # SVP64 rules about Scalar registers still apply!
584 if SVCRf.scalar:
585 break
586 # loop finally done, now test if branch (and update LR)
587 lr_ok <- SVRMmode.LRu
588 if cond_ok then
589 if AA then NIA <-iea EXTS(BD || 0b00)
590 else NIA <-iea CIA + EXTS(BD || 0b00)
591 lr_ok <- 0b1
592 if LK & lr_ok then LR <-iea CIA + 4
593 ```
594
595 Pseudocode for Vertical-First Mode:
596
597 ```
598 # get SVP64 extended CR field 0..127
599 SVCRf = SVP64EXTRA(BI>>2)
600 CRbits = CR{SVCRf}
601 # select predicate bit or zero/one
602 if predicate[srcstep]:
603 if BRc = 1 then # CR0 vectorised
604 CR{SVCRf+srcstep} = CRbits
605 testbit = CRbits[BI & 0b11]
606 else if not SVRMmode.sz:
607 # inverted CTR test skip mode
608 if ¬BO[2] & CTRtest & ¬CTI then
609 CTR = CTR - 1
610 SVSTATE.srcstep = new_srcstep
611 exit # no branch testing
612 else
613 testbit = SVRMmode.SNZ
614 # actual element test here
615 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
616 # test for VL to be set (and exit)
617 if VLSET and cond_ok = VSb then
618 if SVRMmode.VLI
619 SVSTATE.VL = new_srcstep+1
620 else
621 SVSTATE.VL = new_srcstep
622 ```
623
624 # Example Shader code
625
626 ```
627 // assume f() g() or h() modify a and/or b
628 while(a > 2) {
629 if(b < 5)
630 f();
631 else
632 g();
633 h();
634 }
635 ```
636
637 which compiles to something like:
638
639 ```
640 vec<i32> a, b;
641 // ...
642 pred loop_pred = a > 2;
643 // loop continues while any of a elements greater than 2
644 while(loop_pred.any()) {
645 // vector of predicate bits
646 pred if_pred = loop_pred & (b < 5);
647 // only call f() if at least 1 bit set
648 if(if_pred.any()) {
649 f(if_pred);
650 }
651 label1:
652 // loop mask ANDs with inverted if-test
653 pred else_pred = loop_pred & ~if_pred;
654 // only call g() if at least 1 bit set
655 if(else_pred.any()) {
656 g(else_pred);
657 }
658 h(loop_pred);
659 }
660 ```
661
662 which will end up as:
663
664 ```
665 # start from while loop test point
666 b looptest
667 while_loop:
668 sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
669 sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
670 # only calculate loop_pred & pred_b because needed in f()
671 sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
672 f(CR80.v.SO)
673 skip_f:
674 # illustrate inversion of pred_b. invert r30, test ALL
675 # rather than SOME, but masked-out zero test would FAIL,
676 # therefore masked-out instead is tested against 1 not 0
677 sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
678 # else = loop & ~pred_b, need this because used in g()
679 sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
680 g(CR80.v.SO)
681 skip_g:
682 # conditionally call h(r30) if any loop pred set
683 sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
684 looptest:
685 sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
686 sv.crweird r30, CR60.GT # transfer GT vector to r30
687 sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
688 end:
689 ```