(no commit message)
[libreriscv.git] / openpower / sv / branches.mdwn
1 [[!tag standards]]
2 # SVP64 Branch Conditional behaviour
3
4 **DRAFT STATUS**
5
6 Please note: SVP64 Branch instructions should be
7 considered completely separate and distinct from
8 standard scalar OpenPOWER-approved v3.0B branches.
9 **v3.0B branches are in no way impacted, altered,
10 changed or modified in any way, shape or form by
11 the SVP64 Vectorised Variants**. It is
12 extremely important to note that this is the one
13 sole semi-exception in SVPY4 to `Scalar Identity Behaviour`.
14 SVP64 Branches contain additional modes that are useful
15 for scalar operations (i.e. even when VL=1).
16
17 Links
18
19 * <https://bugs.libre-soc.org/show_bug.cgi?id=664>
20 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
21 * [[openpower/isa/branch]]
22
23 # Rationale
24
25 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
26 Condition Register. However for parallel processing it is simply impossible
27 to perform multiple independent branches: the Program Counter simply
28 cannot branch to multiple destinations based on multiple conditions.
29 The best that can be done is
30 to test multiple Conditions and make a decision of a *single* branch,
31 based on analysis of a *Vector* of CR Fields
32 which have just been calculated from a *Vector* of results.
33
34 In 3D Shader
35 binaries, which are inherently parallelised and predicated, testing all or
36 some results and branching based on multiple tests is extremely common,
37 and a fundamental part of Shader Compilers. Example:
38 without such multi-condition
39 test-and-branch, if a predicate mask is all zeros a large batch of
40 instructions may be masked out to `nop`, and it would waste
41 CPU cycles to run them. 3D GPU ISAs can test for this scenario
42 and, with the appropriate predicate-analysis instruction,
43 jump over fully-masked-out operations, by spotting that
44 *all* Conditions are false.
45
46 Unless Branches are aware and capable of such analysis, additional
47 instructions would be required which perform Horizontal Cumulative
48 analysis of Vectorised Condition Register Fields, in order to
49 reduce the Vector of CR Fields down to one single yes or no
50 decision that a Scalar-only v3.0B Branch-Conditional could cope with.
51 Such instructions would be unavoidable, required, and costly
52 by comparison to a single Vector-aware Branch.
53 Therefore, in order to be commercially competitive, `sv.bc` and
54 other Vector-aware Branch Conditional instructions are a high priority
55 for 3D GPU (and CUDA) workloads.
56
57 Given that Power ISA v3.0B is already quite powerful, particularly
58 the Condition Registers and their interaction with Branches, there
59 are opportunities to create extremely flexible and compact
60 Vectorised Branch behaviour. In addition, the side-effects (updating
61 of CTR, truncation of VL, described below) make it a useful instruction
62 even if the branch points to the next instruction (no actual branch).
63
64 # Overview
65
66 When considering an "array" of branch-tests, there are four useful modes:
67 AND, OR, NAND and NOR of all Conditions.
68 NAND and NOR may be synthesised from AND and OR by
69 inverting `BO[1]` which just leaves two modes:
70
71 * Branch takes place on the **first** CR Field test to succeed
72 (a Great Big OR of all condition tests)
73 * Branch takes place only if **all** CR field tests succeed:
74 a Great Big AND of all condition tests
75
76 Early-exit is enacted such that the Vectorised Branch does not
77 perform needless extra tests, which will help reduce reads on
78 the Condition Register file.
79
80 *Note: Early-exit is **MANDATORY** (required) behaviour.
81 Branches **MUST** exit at the first failure point, for
82 exactly the same reasons for which it is mandatory in
83 programming languages doing early-exit: to avoid
84 damaging side-effects. Speculative testing of Condition
85 Register Fields is permitted, as is speculative updating
86 of CTR, as long as, as usual in any Out-of-Order microarchitecture,
87 that speculative testing is cancelled should an early-exit occur.*
88
89 Additional useful behaviour involves two primary Modes (both of
90 which may be enabled and combined):
91
92 * **VLSET Mode**: identical to Data-Dependent Fail-First Mode
93 for Arithmetic SVP64 operations, with more
94 flexibility and a close interaction and integration into the
95 underlying base Scalar v3.0B Branch instruction.
96 Truncation of VL takes place around the early-exit point.
97 * **CTR-test Mode**: gives much more flexibility over when and why
98 CTR is decremented, including options to decrement if a Condition
99 test succeeds *or if it fails*.
100
101 With these side-effects, basic Boolean Logic Analysis advises that
102 it is important to provide a means
103 to enact them each based on whether testing succeeds *or fails*. This
104 results in a not-insignificant number of additional Mode Augmentation bits,
105 accompanying VLSET and CTR-test Modes respectively.
106
107 Predicate skipping or zeroing may, as usual with SVP64, be controlled
108 by `sz`.
109 Where the predicate is masked out and
110 zeroing is enabled, then in such circumstances
111 the same Boolean Logic Analysis dictates that
112 rather than testing only against zero, the option to test
113 against one is also prudent. This introduces a new
114 immediate field, `SNZ`, which works in conjunction with
115 `sz`.
116
117
118 Vectorised Branches can be used
119 in either SVP64 Horizontal-First or Vertical-First Mode. Essentially,
120 at an element level, the behaviour is identical in both Modes,
121 although the `ALL` bit is meaningless in Vertical-First Mode.
122
123 It is also important
124 to bear in mind that, fundamentally, Vectorised Branch-Conditional
125 is still extremely close to the Scalar v3.0B Branch-Conditional
126 instructions, and that the same v3.0B Scalar Branch-Conditional
127 instructions are still
128 *completely separate and independent*, being unaltered and
129 unaffected by their SVP64 variants in every conceivable way.
130
131 *Programming note: One important point is that SVP64 instructions are 64 bit.
132 (8 bytes not 4). This needs to be taken into consideration when computing
133 branch offsets: the offset is relative to the start of the instruction,
134 which **includes** the SVP64 Prefix*
135
136 # Format and fields
137
138 With element-width overrides being meaningless for Condition
139 Register Fields, bits 4 thru 7 of SVP64 RM may be used for additional
140 Mode bits.
141
142 SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
143 Conditional:
144
145 | 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description |
146 | - | - | - | - | -- | -- | --- |---------|----------------- |
147 |ALL|SNZ| / | / | 0 | 0 | / | LRu sz | normal mode |
148 |ALL|SNZ| / |VSb| 0 | 1 | VLI | LRu sz | VLSET mode |
149 |ALL|SNZ|CTi| / | 1 | 0 | / | LRu sz | CTR-test mode |
150 |ALL|SNZ|CTi|VSb| 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode |
151
152 Brief description of fields:
153
154 * **sz=1** if predication is enabled and `sz=1` and a predicate
155 element bit is zero, `SNZ` will
156 be substituted in place of the CR bit selected by `BI`,
157 as the Condition tested.
158 Contrast this with
159 normal SVP64 `sz=1` behaviour, where *only* a zero is put in
160 place of masked-out predicate bits.
161 * **sz=0** When `sz=0` skipping occurs as usual on
162 masked-out elements, but unlike all
163 other SVP64 behaviour which entirely skips an element with
164 no related side-effects at all, there are certain
165 special circumstances where CTR
166 may be decremented. See CTR-test Mode, below.
167 * **ALL** when set, all branch conditional tests must pass in order for
168 the branch to succeed. When clear, it is the first sequentially
169 encountered successful test that causes the branch to succeed.
170 This is identical behaviour to how programming languages perform
171 early-exit on Boolean Logic chains.
172 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
173 In VLSET mode, VL *may* (depending on `VSb`) be truncated.
174 If VLI (Vector Length Inclusive) is clear,
175 VL is truncated to *exclude* the current element, otherwise it is
176 included. SVSTATE.MVL is not altered: only VL.
177 * **LRu**: Link Register Update. When set, Link Register will
178 only be updated if the Branch Condition succeeds. This avoids
179 destruction of LR during loops (particularly Vertical-First
180 ones).
181 * **VSb** In VLSET Mode, after testing,
182 if VSb is set, VL is truncated if the branch succeeds. If VSb is clear,
183 VL is truncated if the branch did **not** take place.
184 * **CTi** CTR inversion. CTR-test Mode normally decrements per element
185 tested. CTR inversion decrements if a test *fails*. Only relevant
186 in CTR-test Mode.
187
188 LRu and CTR-test modes are where SVP64 Branches subtly differ from
189 Scalar v3.0B Branches. `bclr` for example will always update LR, whereas
190 `sv.bclr/lru` will only update LR if the branch succeeds.
191
192 Of special interest is that when using ALL Mode (Great Big AND
193 of all Condition Tests), if `VL=0`,
194 which is rare but can occur in Data-Dependent Modes, the Branch
195 will always take place because there will be no failing Condition
196 Tests to prevent it. Likewise when not using ALL Mode (Great Big OR
197 of all Condition Tests) and `VL=0` the Branch is guaranteed not
198 to occur because there will be no *successful* Condition Tests
199 to make it happen.
200
201 # Vectorised CR Field numbering, and Scalar behaviour
202
203 It is important to keep in mind that just like all SVP64 instructions,
204 the `BI` field of the base v3.0B Branch Conditional instruction
205 may be extended by SVP64 EXTRA augmentation, as well as be marked
206 as either Scalar or Vector. It is also crucially important to keep in mind
207 that for CRs, SVP64 sequentially increments the CR *Field* numbers.
208 CR *Fields* are treated as elements, not bit-numbers of the CR *register*.
209
210 The `BI` operand of Branch Conditional operations is five bits, in scalar
211 v3.0B this would select one bit of the 32 bit CR,
212 comprising eight CR Fields of 4 bits each. In SVP64 there are
213 16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
214 `BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
215 are extended to either scalar or vector and to select CR Fields 0..127
216 as specified in SVP64 [[sv/svp64/appendix]].
217
218 When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
219 then as the usual SVP64 rules apply:
220 the Vector loop ends at the first element tested
221 (the first CR *Field*), after taking
222 predication into consideration. Thus, also as usual, when a predicate mask is
223 given, and `BI` marked as scalar, and `sz` is zero, srcstep
224 skips forward to the first non-zero predicated element, and only that
225 one element is tested.
226
227 In other words, the fact that this is a Branch
228 Operation (instead of an arithmetic one) does not result, ultimately,
229 in significant changes as to
230 how SVP64 is fundamentally applied, except with respect to:
231
232 * the unique properties associated with conditionally
233 changing the Program
234 Counter (aka "a Branch"), resulting in early-out
235 opportunities
236 * CTR-testing
237
238 Both are outlined below.
239
240 # Horizontal-First and Vertical-First Modes
241
242 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
243 AND) results in early exit: no more updates to CTR occur (if requested);
244 no branch occurs, and LR is not updated (if requested). Likewise for
245 non-ALL mode (Great Big Or) on first success early exit also occurs,
246 however this time with the Branch proceeding. In both cases the testing
247 of the Vector of CRs should be done in linear sequential order (or in
248 REMAP re-sequenced order): such that tests that are sequentially beyond
249 the exit point are *not* carried out. (*Note: it is standard practice in
250 Programming languages to exit early from conditional tests, however
251 a little unusual to consider in an ISA that is designed for Parallel
252 Vector Processing. The reason is to have strictly-defined guaranteed
253 behaviour*)
254
255 In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
256 behaviour. Given that only one element is being tested at a time
257 in Vertical-First Mode, a test designed to be done on multiple
258 bits is meaningless.
259
260 # Description and Modes
261
262 Predication in both INT and CR modes may be applied to `sv.bc` and other
263 SVP64 Branch Conditional operations, exactly as they may be applied to
264 other SVP64 operations. When `sz` is zero, any masked-out Branch-element
265 operations are not included in condition testing, exactly like all other
266 SVP64 operations, *including* side-effects such as potentially updating
267 LR or CTR, which will also be skipped. There is *one* exception here,
268 which is when
269 `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
270 predicate mask bit is also zero:
271 under these special circumstances CTR will also decrement.
272
273 When `sz` is non-zero, this normally requests insertion of a zero
274 in place of the input data, when the relevant predicate mask bit is zero.
275 This would mean that a zero is inserted in place of `CR[BI+32]` for
276 testing against `BO`, which may not be desirable in all circumstances.
277 Therefore, an extra field is provided `SNZ`, which, if set, will insert
278 a **one** in place of a masked-out element, instead of a zero.
279
280 (*Note: Both options are provided because it is useful to deliberately
281 cause the Branch-Conditional Vector testing to fail at a specific point,
282 controlled by the Predicate mask. This is particularly useful in `VLSET`
283 mode, which will truncate SVSTATE.VL at the point of the first failed
284 test.*)
285
286 Normally, CTR mode will decrement once per Condition Test, resulting
287 under normal circumstances that CTR reduces by up to VL in Horizontal-First
288 Mode. Just as when v3.0B Branch-Conditional saves at
289 least one instruction on tight inner loops through auto-decrementation
290 of CTR, likewise it is also possible to save instruction count for
291 SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
292 in circumstances where there is conditional interaction between the
293 element computation and testing, and the continuation (or otherwise)
294 of a given loop. The potential combinations of interactions is why CTR
295 testing options have been added.
296
297 Also, the unconditional bit `BO[0]` is still relevant when Predication
298 is applied to the Branch because in `ALL` mode all nonmasked bits have
299 to be tested, and when `sz=0` skipping occurs.
300 Even when VLSET mode is not used, CTR
301 may still be decremented by the total number of nonmasked elements,
302 acting in effect as either a popcount or cntlz depending on which
303 mode bits are set.
304 In short, Vectorised Branch becomes an extremely powerful tool.
305
306 ## CTR-test
307
308 Where a standard Scalar v3.0B branch unconditionally decrements
309 CTR when `BO[2]` is clear, CTR-test Mode introduces more flexibility
310 which allows CTR to be used for many more types of Vector loops
311 constructs.
312
313 CTR-test mode and CTi interaction is as follows: note that
314 `BO[2]` is still required to be clear for CTR decrements to be
315 considered, exactly as is the case in Scalar Power ISA v3.0B
316
317 * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
318 if `BO[2]` is zero. Masked-out elements when `sz=0` are
319 skipped (i.e. CTR is *not* decremented when the predicate
320 bit is zero and `sz=0`).
321 * **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
322 if `BO[2]` is zero and a masked-out element is skipped
323 (`sz=0` and predicate bit is zero). This one special case is the
324 **opposite** of other combinations, as well as being
325 completely different from normal SVP64 `sz=0` behaviour)
326 * **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
327 if `BO[2]` is zero and the Condition Test succeeds.
328 Masked-out elements when `sz=0` are skipped (including
329 not decrementing CTR)
330 * **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
331 if `BO[2]` is zero and the Condition Test *fails*.
332 Masked-out elements when `sz=0` are skipped (including
333 not decrementing CTR)
334
335 `CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
336 only time in the entirety of SVP64 that has side-effects when
337 a predicate mask bit is clear. **All** other SVP64 operations
338 entirely skip an element when sz=0 and a predicate mask bit is zero.
339 It is also critical to emphasise that in this unusual mode,
340 no other side-effects occur: **only** CTR is decremented, i.e. the
341 rest of the Branch operation iss skipped.
342
343 # VLSET Mode
344
345 VLSET Mode truncates the Vector Length so that subsequent instructions
346 operate on a reduced Vector Length. This is similar to
347 Data-dependent Fail-First and LD/ST Fail-First, where for VLSET the
348 truncation occurs at the Branch decision-point.
349
350 Interestingly, due to the side-effects of `VLSET` mode
351 it is actually useful to use Branch Conditional even
352 to perform no actual branch operation, i.e to point to the instruction
353 after the branch. Truncation of VL would thus conditionally occur yet control
354 flow alteration would not.
355
356 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
357 is designed to be used for explicit looping, where an explicit call to
358 `svstep` is required to move both srcstep and dststep on to
359 the next element, until VL (or other condition) is reached.
360 Vertical-First Looping is expected (required) to terminate if the end
361 of the Vector, VL, is reached. If however that loop is terminated early
362 because VL is truncated, VLSET with Vertical-First becomes meaningless.
363 Resolving this would require two branches: one Conditional, the other
364 branching unconditionally to create the loop, where the Conditional
365 one jumps over it.
366
367 Therefore, with `VSb`, the option to decide whether truncation should occur if the
368 branch succeeds *or* if the branch condition fails allows for the flexibility
369 required. This allows a Vertical-First Branch to *either* be used as
370 a branch-back (loop) *or* as part of a conditional exit or function
371 call from *inside* a loop, and for VLSET to be integrated into both
372 types of decision-making.
373
374 In the case of a Vertical-First branch-back (loop), with `VSb=0` the branch takes
375 place if success conditions are met, but on exit from that loop
376 (branch condition fails), VL will be truncated. This is extremely
377 useful.
378
379 `VLSET` mode with Horizontal-First when `VSb=0` is still
380 useful, because it can be used to truncate VL to the first predicated
381 (non-masked-out) element.
382
383 The truncation point for VL, when VLi is clear, must not include skipped
384 elements that preceded the current element being tested.
385 Example: `sz=0, VLi=0, predicate mask = 0b110010` and the Condition
386 Register failure point is at CR Field element 4.
387
388 * Testing at element 0 is skipped because its predicate bit is zero
389 * Testing at element 1 passed
390 * Testing elements 2 and 3 are skipped because their
391 respective predicate mask bits are zero
392 * Testing element 4 fails therefore VL is truncated to **2**
393 not 4 due to elements 2 and 3 being skipped.
394
395 If `sz=1` in the above example *then* VL would have been set to 4 because
396 in non-zeroing mode the zero'd elements are still effectively part of the
397 Vector (with their respective elements set to `SNZ`)
398
399 If `VLI=1` then VL would be set to 5 regardless of sz, due to being inclusive
400 of the element actually being tested.
401
402 ## VLSET and CTR-test combined
403
404 If both CTR-test and VLSET Modes are requested, it's important to
405 observe the correct order. What occurs depends on whether VLi
406 is enabled, because VLi affects the length, VL.
407
408 If VLi (VL truncate inclusive) is set:
409
410 1. compute the test including whether CTR triggers
411 2. (optionally) decrement CTR
412 3. (optionally) truncate VL (VSb inverts the decision)
413 4. decide (based on step 1) whether to terminate looping
414 (including not executing step 5)
415 5. decide whether to branch.
416
417 If VLi is clear, then when a test fails that element
418 and any following it
419 should **not** be considered part of the Vector. Consequently:
420
421 1. compute the branch test including whether CTR triggers
422 2. if the test fails against VSb, truncate VL to the *previous*
423 element, and terminate looping. No further steps executed.
424 3. (optionally) decrement CTR
425 4. decide whether to branch.
426
427 # Boolean Logic combinations
428
429 There are an extraordinary number of different combinations which
430 provide completely different and useful behaviour.
431 Available options to combine:
432
433 * `BO[0]` to make an unconditional branch would seem irrelevant if
434 it were not for predication and for side-effects (CTR Mode
435 for example)
436 * Enabling CTR-test Mode and setting `BO[2]` can still result in the
437 Branch
438 taking place, not because the Condition Test itself failed, but
439 because CTR reached zero **because**, as required by CTR-test mode,
440 CTR was decremented as a **result** of Condition Tests failing.
441 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
442 * `R30` and `~R30` and other predicate mask options including CR and
443 inverted CR bit testing
444 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
445 predicate bits
446 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
447 `OR` of all tests, respectively.
448 * Predicate Mask bits, which combine in effect with the CR being
449 tested.
450 * Inversion of Predicate Masks (`~r3` instead of `r3`, or using
451 `NE` rather than `EQ`) which results in an additional
452 level of possible ANDing, ORing etc. that would otherwise
453 need explicit instructions.
454
455 The most obviously useful combinations here are to set `BO[1]` to zero
456 in order to turn `ALL` into Great-Big-NAND and `ANY` into
457 Great-Big-NOR. Other Mode bits which perform behavioural inversion then
458 have to work round the fact that the Condition Testing is NOR or NAND.
459 The alternative to not having additional behavioural inversion
460 (`SNZ`, `VSb`, `CTi`) would be to have a second (unconditional)
461 branch directly after the first, which the first branch jumps over.
462 This contrived construct is avoided by the behavioural inversion bits.
463
464 # Pseudocode and examples
465
466 For comparative purposes this is a copy of the v3.0B bc pseudocode,
467 noting that M and AA have not been added to the SVP64 versions
468 for simplicity of illustration. ctr_ok does not appear in the SVP64
469 versions because of the way that CTRtest Mode interacts.
470
471 ```
472 if (mode_is_64bit) then M <- 0
473 else M <- 32
474 if ¬BO[2] then CTR <- CTR - 1
475 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
476 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
477 if ctr_ok & cond_ok then
478 if AA then NIA <-iea EXTS(BD || 0b00)
479 else NIA <-iea CIA + EXTS(BD || 0b00)
480 if LK then LR <-iea CIA + 4
481 ```
482
483 Simplified pseudocode including LRu and CTR skipping, which illustrates
484 clearly that SVP64 Scalar Branches (VL=1) are **not** identical to
485 v3.0B Scalar Branches. The key areas where differences occur are in
486 when and why CTR is decremented (CTRtest Mode) and whether LR is
487 updated (which is unconditional in v3.0B when LK=1, and conditional
488 in SVP64 when LRu=1).
489
490 ```
491 if (mode_is_64bit) then M <- 0
492 else M <- 32
493 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
494 ctrdec = ¬BO[2]
495 if CTRtest & (cond_ok ^ CTi) then
496 ctrdec = 0b0
497 if ctrdec then CTR <- CTR - 1
498 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
499 lr_ok <- SVRMmode.LRu
500 if ctr_ok & cond_ok then
501 if AA then NIA <-iea EXTS(BD || 0b00)
502 else NIA <-iea CIA + EXTS(BD || 0b00)
503 lr_ok <- 0b1
504 if LK & lr_ok then LR <-iea CIA + 4
505 ```
506
507 Below is the pseudocode for SVP64 Branches, which is a little less
508 obvious but identical to the above. The lack of obviousness is down
509 to the early-exit opportunities.
510
511 Pseudocode for Horizontal-First Mode:
512
513 ```
514 cond_ok = not SVRMmode.ALL
515 for srcstep in range(VL):
516 # select predicate bit or zero/one
517 if predicate[srcstep]:
518 # get SVP64 extended CR field 0..127
519 SVCRf = SVP64EXTRA(BI>>2)
520 CRbits = CR{SVCRf}
521 testbit = CRbits[BI & 0b11]
522 # testbit = CR[BI+32+srcstep*4]
523 else if not SVRMmode.sz:
524 # inverted CTR test skip mode
525 if ¬BO[2] & CTRtest & ¬CTI then
526 CTR = CTR - 1
527 continue
528 else
529 testbit = SVRMmode.SNZ
530 # actual element test here
531 el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
532 # merge in the test
533 if SVRMmode.ALL:
534 cond_ok &= el_cond_ok
535 else
536 cond_ok |= el_cond_ok
537 # test for VL to be set (and exit)
538 if VLSET and VSb = el_cond_ok then
539 if SVRMmode.VLI
540 SVSTATE.VL = srcstep+1
541 else
542 SVSTATE.VL = srcstep
543 break
544 # early exit?
545 if SVRMmode.ALL:
546 if ~el_cond_ok:
547 break
548 else
549 if el_cond_ok:
550 break
551 if SVCRf.scalar:
552 break
553 # loop finally done, now test if branch (and update LR)
554 lr_ok <- SVRMmode.LRu
555 if cond_ok then
556 if AA then NIA <-iea EXTS(BD || 0b00)
557 else NIA <-iea CIA + EXTS(BD || 0b00)
558 lr_ok <- 0b1
559 if LK & lr_ok then LR <-iea CIA + 4
560
561 ```
562
563 Pseudocode for Vertical-First Mode:
564
565 ```
566 # get SVP64 extended CR field 0..127
567 SVCRf = SVP64EXTRA(BI>>2)
568 CRbits = CR{SVCRf}
569 # select predicate bit or zero/one
570 if predicate[srcstep]:
571 if BRc = 1 then # CR0 vectorised
572 CR{SVCRf+srcstep} = CRbits
573 testbit = CRbits[BI & 0b11]
574 else if not SVRMmode.sz:
575 # inverted CTR test skip mode
576 if ¬BO[2] & CTRtest & ¬CTI then
577 CTR = CTR - 1
578 SVSTATE.srcstep = new_srcstep
579 exit # no branch testing
580 else
581 testbit = SVRMmode.SNZ
582 # actual element test here
583 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
584 # test for VL to be set (and exit)
585 if VLSET and cond_ok = VSb then
586 if SVRMmode.VLI
587 SVSTATE.VL = new_srcstep+1
588 else
589 SVSTATE.VL = new_srcstep
590 ```
591
592 # Example Shader code
593
594 ```
595 while(a > 2) {
596 if(b < 5)
597 f();
598 else
599 g();
600 h();
601 }
602 ```
603
604 which compiles to something like:
605
606 ```
607 vec<i32> a, b;
608 // ...
609 pred loop_pred = a > 2;
610 while(loop_pred.any()) {
611 pred if_pred = loop_pred & (b < 5);
612 if(if_pred.any()) {
613 f(if_pred);
614 }
615 label1:
616 pred else_pred = loop_pred & ~if_pred;
617 if(else_pred.any()) {
618 g(else_pred);
619 }
620 h(loop_pred);
621 }
622 ```
623
624 which will end up as:
625
626 ```
627 sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
628 sv.crweird r30, CR60.GT # transfer GT vector to r30
629 while_loop:
630 sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
631 sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
632 # only calculate loop_pred & pred_b because needed in f()
633 sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
634 f(CR80.v.SO)
635 skip_f:
636 # illustrate inversion of pred_b. invert r30, test ALL
637 # rather than SOME, but masked-out zero test would FAIL,
638 # therefore masked-out instead is tested against 1 not 0
639 sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
640 # else = loop & ~pred_b, need this because used in g()
641 sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
642 g(CR80.v.SO)
643 skip_g:
644 # conditionally call h(r30) if any loop pred set
645 sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
646 sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
647 ```