(no commit message)
[libreriscv.git] / openpower / sv / branches.mdwn
1 [[!tag standards]]
2 # SVP64 Branch Conditional behaviour
3
4 **DRAFT STATUS**
5
6 Please note: although similar, SVP64 Branch instructions should be
7 considered completely separate and distinct from
8 standard scalar OpenPOWER-approved v3.0B branches.
9 **v3.0B branches are in no way impacted, altered,
10 changed or modified in any way, shape or form by
11 the SVP64 Vectorised Variants**. It is
12 extremely important to note that Branches are the
13 sole semi-exception in SVP64 to `Scalar Identity Behaviour`.
14 SVP64 Branches contain additional modes that are useful
15 for scalar operations (i.e. even when VL=1 or when
16 using single-bit predication).
17
18 Links
19
20 * <https://bugs.libre-soc.org/show_bug.cgi?id=664>
21 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
22 * [[openpower/isa/branch]]
23
24 # Rationale
25
26 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
27 Condition Register. However for parallel processing it is simply impossible
28 to perform multiple independent branches: the Program Counter simply
29 cannot branch to multiple destinations based on multiple conditions.
30 The best that can be done is
31 to test multiple Conditions and make a decision of a *single* branch,
32 based on analysis of a *Vector* of CR Fields
33 which have just been calculated from a *Vector* of results.
34
35 In 3D Shader
36 binaries, which are inherently parallelised and predicated, testing all or
37 some results and branching based on multiple tests is extremely common,
38 and a fundamental part of Shader Compilers. Example:
39 without such multi-condition
40 test-and-branch, if a predicate mask is all zeros a large batch of
41 instructions may be masked out to `nop`, and it would waste
42 CPU cycles to run them. 3D GPU ISAs can test for this scenario
43 and, with the appropriate predicate-analysis instruction,
44 jump over fully-masked-out operations, by spotting that
45 *all* Conditions are false.
46
47 Unless Branches are aware and capable of such analysis, additional
48 instructions would be required which perform Horizontal Cumulative
49 analysis of Vectorised Condition Register Fields, in order to
50 reduce the Vector of CR Fields down to one single yes or no
51 decision that a Scalar-only v3.0B Branch-Conditional could cope with.
52 Such instructions would be unavoidable, required, and costly
53 by comparison to a single Vector-aware Branch.
54 Therefore, in order to be commercially competitive, `sv.bc` and
55 other Vector-aware Branch Conditional instructions are a high priority
56 for 3D GPU (and CUDA) workloads.
57
58 Given that Power ISA v3.0B is already quite powerful, particularly
59 the Condition Registers and their interaction with Branches, there
60 are opportunities to create extremely flexible and compact
61 Vectorised Branch behaviour. In addition, the side-effects (updating
62 of CTR, truncation of VL, described below) make it a useful instruction
63 even if the branch points to the next instruction (no actual branch).
64
65 # Overview
66
67 When considering an "array" of branch-tests, there are four
68 primarily-useful modes:
69 AND, OR, NAND and NOR of all Conditions.
70 NAND and NOR may be synthesised from AND and OR by
71 inverting `BO[1]` which just leaves two modes:
72
73 * Branch takes place on the **first** CR Field test to succeed
74 (a Great Big OR of all condition tests)
75 * Branch takes place only if **all** CR field tests succeed:
76 a Great Big AND of all condition tests
77
78 Early-exit is enacted such that the Vectorised Branch does not
79 perform needless extra tests, which will help reduce reads on
80 the Condition Register file.
81
82 *Note: Early-exit is **MANDATORY** (required) behaviour.
83 Branches **MUST** exit at the first failure point, for
84 exactly the same reasons for which it is mandatory in
85 programming languages doing early-exit: to avoid
86 damaging side-effects. Speculative testing of Condition
87 Register Fields is permitted, as is speculative updating
88 of CTR, as long as, as usual in any Out-of-Order microarchitecture,
89 that speculative testing is cancelled should an early-exit occur.*
90
91 Additional useful behaviour involves two primary Modes (both of
92 which may be enabled and combined):
93
94 * **VLSET Mode**: identical to Data-Dependent Fail-First Mode
95 for Arithmetic SVP64 operations, with more
96 flexibility and a close interaction and integration into the
97 underlying base Scalar v3.0B Branch instruction.
98 Truncation of VL takes place around the early-exit point.
99 * **CTR-test Mode**: gives much more flexibility over when and why
100 CTR is decremented, including options to decrement if a Condition
101 test succeeds *or if it fails*.
102
103 With these side-effects, basic Boolean Logic Analysis advises that
104 it is important to provide a means
105 to enact them each based on whether testing succeeds *or fails*. This
106 results in a not-insignificant number of additional Mode Augmentation bits,
107 accompanying VLSET and CTR-test Modes respectively.
108
109 Predicate skipping or zeroing may, as usual with SVP64, be controlled
110 by `sz`.
111 Where the predicate is masked out and
112 zeroing is enabled, then in such circumstances
113 the same Boolean Logic Analysis dictates that
114 rather than testing only against zero, the option to test
115 against one is also prudent. This introduces a new
116 immediate field, `SNZ`, which works in conjunction with
117 `sz`.
118
119
120 Vectorised Branches can be used
121 in either SVP64 Horizontal-First or Vertical-First Mode. Essentially,
122 at an element level, the behaviour is identical in both Modes,
123 although the `ALL` bit is meaningless in Vertical-First Mode.
124
125 It is also important
126 to bear in mind that, fundamentally, Vectorised Branch-Conditional
127 is still extremely close to the Scalar v3.0B Branch-Conditional
128 instructions, and that the same v3.0B Scalar Branch-Conditional
129 instructions are still
130 *completely separate and independent*, being unaltered and
131 unaffected by their SVP64 variants in every conceivable way.
132
133 *Programming note: One important point is that SVP64 instructions are 64 bit.
134 (8 bytes not 4). This needs to be taken into consideration when computing
135 branch offsets: the offset is relative to the start of the instruction,
136 which **includes** the SVP64 Prefix*
137
138 # Format and fields
139
140 With element-width overrides being meaningless for Condition
141 Register Fields, bits 4 thru 7 of SVP64 RM may be used for additional
142 Mode bits.
143
144 SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
145 Conditional:
146
147 | 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description |
148 | - | - | - | - | -- | -- | --- |---------|----------------- |
149 |ALL|SNZ| / | / | 0 | 0 | / | LRu sz | normal mode |
150 |ALL|SNZ| / |VSb| 0 | 1 | VLI | LRu sz | VLSET mode |
151 |ALL|SNZ|CTi| / | 1 | 0 | / | LRu sz | CTR-test mode |
152 |ALL|SNZ|CTi|VSb| 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode |
153
154 Brief description of fields:
155
156 * **sz=1** if predication is enabled and `sz=1` and a predicate
157 element bit is zero, `SNZ` will
158 be substituted in place of the CR bit selected by `BI`,
159 as the Condition tested.
160 Contrast this with
161 normal SVP64 `sz=1` behaviour, where *only* a zero is put in
162 place of masked-out predicate bits.
163 * **sz=0** When `sz=0` skipping occurs as usual on
164 masked-out elements, but unlike all
165 other SVP64 behaviour which entirely skips an element with
166 no related side-effects at all, there are certain
167 special circumstances where CTR
168 may be decremented. See CTR-test Mode, below.
169 * **ALL** when set, all branch conditional tests must pass in order for
170 the branch to succeed. When clear, it is the first sequentially
171 encountered successful test that causes the branch to succeed.
172 This is identical behaviour to how programming languages perform
173 early-exit on Boolean Logic chains.
174 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
175 In VLSET mode, VL *may* (depending on `VSb`) be truncated.
176 If VLI (Vector Length Inclusive) is clear,
177 VL is truncated to *exclude* the current element, otherwise it is
178 included. SVSTATE.MVL is not altered: only VL.
179 * **LRu**: Link Register Update. When set, Link Register will
180 only be updated if the Branch Condition succeeds. This avoids
181 destruction of LR during loops (particularly Vertical-First
182 ones).
183 * **VSb** In VLSET Mode, after testing,
184 if VSb is set, VL is truncated if the test succeeds. If VSb is clear,
185 VL is truncated if a test *fails*. Masked-out (skipped)
186 bits are not considered
187 part of testing.
188 * **CTi** CTR inversion. CTR-test Mode normally decrements per element
189 tested. CTR inversion decrements if a test *fails*. Only relevant
190 in CTR-test Mode.
191
192 LRu and CTR-test modes are where SVP64 Branches subtly differ from
193 Scalar v3.0B Branches. `bclr` for example will always update LR, whereas
194 `sv.bclr/lru` will only update LR if the branch succeeds.
195
196 Of special interest is that when using ALL Mode (Great Big AND
197 of all Condition Tests), if `VL=0`,
198 which is rare but can occur in Data-Dependent Modes, the Branch
199 will always take place because there will be no failing Condition
200 Tests to prevent it. Likewise when not using ALL Mode (Great Big OR
201 of all Condition Tests) and `VL=0` the Branch is guaranteed not
202 to occur because there will be no *successful* Condition Tests
203 to make it happen.
204
205 # Vectorised CR Field numbering, and Scalar behaviour
206
207 It is important to keep in mind that just like all SVP64 instructions,
208 the `BI` field of the base v3.0B Branch Conditional instruction
209 may be extended by SVP64 EXTRA augmentation, as well as be marked
210 as either Scalar or Vector. It is also crucially important to keep in mind
211 that for CRs, SVP64 sequentially increments the CR *Field* numbers.
212 CR *Fields* are treated as elements, not bit-numbers of the CR *register*.
213
214 The `BI` operand of Branch Conditional operations is five bits, in scalar
215 v3.0B this would select one bit of the 32 bit CR,
216 comprising eight CR Fields of 4 bits each. In SVP64 there are
217 16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
218 `BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
219 are extended to either scalar or vector and to select CR Fields 0..127
220 as specified in SVP64 [[sv/svp64/appendix]].
221
222 When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
223 then as the usual SVP64 rules apply:
224 the Vector loop ends at the first element tested
225 (the first CR *Field*), after taking
226 predication into consideration. Thus, also as usual, when a predicate mask is
227 given, and `BI` marked as scalar, and `sz` is zero, srcstep
228 skips forward to the first non-zero predicated element, and only that
229 one element is tested.
230
231 In other words, the fact that this is a Branch
232 Operation (instead of an arithmetic one) does not result, ultimately,
233 in significant changes as to
234 how SVP64 is fundamentally applied, except with respect to:
235
236 * the unique properties associated with conditionally
237 changing the Program
238 Counter (aka "a Branch"), resulting in early-out
239 opportunities
240 * CTR-testing
241
242 Both are outlined below.
243
244 # Horizontal-First and Vertical-First Modes
245
246 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
247 AND) results in early exit: no more updates to CTR occur (if requested);
248 no branch occurs, and LR is not updated (if requested). Likewise for
249 non-ALL mode (Great Big Or) on first success early exit also occurs,
250 however this time with the Branch proceeding. In both cases the testing
251 of the Vector of CRs should be done in linear sequential order (or in
252 REMAP re-sequenced order): such that tests that are sequentially beyond
253 the exit point are *not* carried out. (*Note: it is standard practice in
254 Programming languages to exit early from conditional tests, however
255 a little unusual to consider in an ISA that is designed for Parallel
256 Vector Processing. The reason is to have strictly-defined guaranteed
257 behaviour*)
258
259 In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
260 behaviour. Given that only one element is being tested at a time
261 in Vertical-First Mode, a test designed to be done on multiple
262 bits is meaningless.
263
264 # Description and Modes
265
266 Predication in both INT and CR modes may be applied to `sv.bc` and other
267 SVP64 Branch Conditional operations, exactly as they may be applied to
268 other SVP64 operations. When `sz` is zero, any masked-out Branch-element
269 operations are not included in condition testing, exactly like all other
270 SVP64 operations, *including* side-effects such as potentially updating
271 LR or CTR, which will also be skipped. There is *one* exception here,
272 which is when
273 `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
274 predicate mask bit is also zero:
275 under these special circumstances CTR will also decrement.
276
277 When `sz` is non-zero, this normally requests insertion of a zero
278 in place of the input data, when the relevant predicate mask bit is zero.
279 This would mean that a zero is inserted in place of `CR[BI+32]` for
280 testing against `BO`, which may not be desirable in all circumstances.
281 Therefore, an extra field is provided `SNZ`, which, if set, will insert
282 a **one** in place of a masked-out element, instead of a zero.
283
284 (*Note: Both options are provided because it is useful to deliberately
285 cause the Branch-Conditional Vector testing to fail at a specific point,
286 controlled by the Predicate mask. This is particularly useful in `VLSET`
287 mode, which will truncate SVSTATE.VL at the point of the first failed
288 test.*)
289
290 Normally, CTR mode will decrement once per Condition Test, resulting
291 under normal circumstances that CTR reduces by up to VL in Horizontal-First
292 Mode. Just as when v3.0B Branch-Conditional saves at
293 least one instruction on tight inner loops through auto-decrementation
294 of CTR, likewise it is also possible to save instruction count for
295 SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
296 in circumstances where there is conditional interaction between the
297 element computation and testing, and the continuation (or otherwise)
298 of a given loop. The potential combinations of interactions is why CTR
299 testing options have been added.
300
301 Also, the unconditional bit `BO[0]` is still relevant when Predication
302 is applied to the Branch because in `ALL` mode all nonmasked bits have
303 to be tested, and when `sz=0` skipping occurs.
304 Even when VLSET mode is not used, CTR
305 may still be decremented by the total number of nonmasked elements,
306 acting in effect as either a popcount or cntlz depending on which
307 mode bits are set.
308 In short, Vectorised Branch becomes an extremely powerful tool.
309
310 ## CTR-test
311
312 Where a standard Scalar v3.0B branch unconditionally decrements
313 CTR when `BO[2]` is clear, CTR-test Mode introduces more flexibility
314 which allows CTR to be used for many more types of Vector loops
315 constructs.
316
317 CTR-test mode and CTi interaction is as follows: note that
318 `BO[2]` is still required to be clear for CTR decrements to be
319 considered, exactly as is the case in Scalar Power ISA v3.0B
320
321 * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
322 if `BO[2]` is zero. Masked-out elements when `sz=0` are
323 skipped (i.e. CTR is *not* decremented when the predicate
324 bit is zero and `sz=0`).
325 * **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
326 if `BO[2]` is zero and a masked-out element is skipped
327 (`sz=0` and predicate bit is zero). This one special case is the
328 **opposite** of other combinations, as well as being
329 completely different from normal SVP64 `sz=0` behaviour)
330 * **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
331 if `BO[2]` is zero and the Condition Test succeeds.
332 Masked-out elements when `sz=0` are skipped (including
333 not decrementing CTR)
334 * **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
335 if `BO[2]` is zero and the Condition Test *fails*.
336 Masked-out elements when `sz=0` are skipped (including
337 not decrementing CTR)
338
339 `CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
340 only time in the entirety of SVP64 that has side-effects when
341 a predicate mask bit is clear. **All** other SVP64 operations
342 entirely skip an element when sz=0 and a predicate mask bit is zero.
343 It is also critical to emphasise that in this unusual mode,
344 no other side-effects occur: **only** CTR is decremented, i.e. the
345 rest of the Branch operation iss skipped.
346
347 # VLSET Mode
348
349 VLSET Mode truncates the Vector Length so that subsequent instructions
350 operate on a reduced Vector Length. This is similar to
351 Data-dependent Fail-First and LD/ST Fail-First, where for VLSET the
352 truncation occurs at the Branch decision-point.
353
354 Interestingly, due to the side-effects of `VLSET` mode
355 it is actually useful to use Branch Conditional even
356 to perform no actual branch operation, i.e to point to the instruction
357 after the branch. Truncation of VL would thus conditionally occur yet control
358 flow alteration would not.
359
360 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
361 is designed to be used for explicit looping, where an explicit call to
362 `svstep` is required to move both srcstep and dststep on to
363 the next element, until VL (or other condition) is reached.
364 Vertical-First Looping is expected (required) to terminate if the end
365 of the Vector, VL, is reached. If however that loop is terminated early
366 because VL is truncated, VLSET with Vertical-First becomes meaningless.
367 Resolving this would require two branches: one Conditional, the other
368 branching unconditionally to create the loop, where the Conditional
369 one jumps over it.
370
371 Therefore, with `VSb`, the option to decide whether truncation should occur if the
372 branch succeeds *or* if the branch condition fails allows for the flexibility
373 required. This allows a Vertical-First Branch to *either* be used as
374 a branch-back (loop) *or* as part of a conditional exit or function
375 call from *inside* a loop, and for VLSET to be integrated into both
376 types of decision-making.
377
378 In the case of a Vertical-First branch-back (loop), with `VSb=0` the branch takes
379 place if success conditions are met, but on exit from that loop
380 (branch condition fails), VL will be truncated. This is extremely
381 useful.
382
383 `VLSET` mode with Horizontal-First when `VSb=0` is still
384 useful, because it can be used to truncate VL to the first predicated
385 (non-masked-out) element.
386
387 The truncation point for VL, when VLi is clear, must not include skipped
388 elements that preceded the current element being tested.
389 Example: `sz=0, VLi=0, predicate mask = 0b110010` and the Condition
390 Register failure point is at CR Field element 4.
391
392 * Testing at element 0 is skipped because its predicate bit is zero
393 * Testing at element 1 passed
394 * Testing elements 2 and 3 are skipped because their
395 respective predicate mask bits are zero
396 * Testing element 4 fails therefore VL is truncated to **2**
397 not 4 due to elements 2 and 3 being skipped.
398
399 If `sz=1` in the above example *then* VL would have been set to 4 because
400 in non-zeroing mode the zero'd elements are still effectively part of the
401 Vector (with their respective elements set to `SNZ`)
402
403 If `VLI=1` then VL would be set to 5 regardless of sz, due to being inclusive
404 of the element actually being tested.
405
406 ## VLSET and CTR-test combined
407
408 If both CTR-test and VLSET Modes are requested, it's important to
409 observe the correct order. What occurs depends on whether VLi
410 is enabled, because VLi affects the length, VL.
411
412 If VLi (VL truncate inclusive) is set:
413
414 1. compute the test including whether CTR triggers
415 2. (optionally) decrement CTR
416 3. (optionally) truncate VL (VSb inverts the decision)
417 4. decide (based on step 1) whether to terminate looping
418 (including not executing step 5)
419 5. decide whether to branch.
420
421 If VLi is clear, then when a test fails that element
422 and any following it
423 should **not** be considered part of the Vector. Consequently:
424
425 1. compute the branch test including whether CTR triggers
426 2. if the test fails against VSb, truncate VL to the *previous*
427 element, and terminate looping. No further steps executed.
428 3. (optionally) decrement CTR
429 4. decide whether to branch.
430
431 # Boolean Logic combinations
432
433 In a Scalar ISA, Branch-Conditional testing even of vector
434 results may be performed through inversion of tests. NOR of
435 all tests may be performed by inversion of the scalar condition
436 and branching *out* from the scalar loop around elements,
437 using scalar operations.
438
439 In a parallel (Vector) ISA it is the ISA itself which must perform
440 the prerequisite logic manipulation.
441 Thus for SVP64 there are an extraordinary number of nesessary combinations
442 which provide completely different and useful behaviour.
443 Available options to combine:
444
445 * `BO[0]` to make an unconditional branch would seem irrelevant if
446 it were not for predication and for side-effects (CTR Mode
447 for example)
448 * Enabling CTR-test Mode and setting `BO[2]` can still result in the
449 Branch
450 taking place, not because the Condition Test itself failed, but
451 because CTR reached zero **because**, as required by CTR-test mode,
452 CTR was decremented as a **result** of Condition Tests failing.
453 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
454 * `R30` and `~R30` and other predicate mask options including CR and
455 inverted CR bit testing
456 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
457 predicate bits
458 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
459 `OR` of all tests, respectively.
460 * Predicate Mask bits, which combine in effect with the CR being
461 tested.
462 * Inversion of Predicate Masks (`~r3` instead of `r3`, or using
463 `NE` rather than `EQ`) which results in an additional
464 level of possible ANDing, ORing etc. that would otherwise
465 need explicit instructions.
466
467 The most obviously useful combinations here are to set `BO[1]` to zero
468 in order to turn `ALL` into Great-Big-NAND and `ANY` into
469 Great-Big-NOR. Other Mode bits which perform behavioural inversion then
470 have to work round the fact that the Condition Testing is NOR or NAND.
471 The alternative to not having additional behavioural inversion
472 (`SNZ`, `VSb`, `CTi`) would be to have a second (unconditional)
473 branch directly after the first, which the first branch jumps over.
474 This contrivance is avoided by the behavioural inversion bits.
475
476 # Pseudocode and examples
477
478 Please see [[svp64/appendix]] regarding CR bit ordering and for
479 the definition of `CR{n}`
480
481 For comparative purposes this is a copy of the v3.0B `bc` pseudocode
482
483 ```
484 if (mode_is_64bit) then M <- 0
485 else M <- 32
486 if ¬BO[2] then CTR <- CTR - 1
487 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
488 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
489 if ctr_ok & cond_ok then
490 if AA then NIA <-iea EXTS(BD || 0b00)
491 else NIA <-iea CIA + EXTS(BD || 0b00)
492 if LK then LR <-iea CIA + 4
493 ```
494
495 Simplified pseudocode including LRu and CTR skipping, which illustrates
496 clearly that SVP64 Scalar Branches (VL=1) are **not** identical to
497 v3.0B Scalar Branches. The key areas where differences occur are
498 the inclusion of predication (which can still be used when VL=1), in
499 when and why CTR is decremented (CTRtest Mode) and whether LR is
500 updated (which is unconditional in v3.0B when LK=1, and conditional
501 in SVP64 when LRu=1).
502
503 ```
504 if (mode_is_64bit) then M <- 0
505 else M <- 32
506 testbit = CR[BI+32]
507 if ¬predicate_bit then testbit = SVRMmode.SNZ
508 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
509 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
510 if ¬predicate_bit & ¬SVRMmode.sz then
511 if ¬BO[2] & CTRtest & ¬CTi then
512 CTR = CTR - 1
513 stop # instruction finishes here
514 if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
515 lr_ok <- SVRMmode.LRu
516 if ctr_ok & cond_ok then
517 if AA then NIA <-iea EXTS(BD || 0b00)
518 else NIA <-iea CIA + EXTS(BD || 0b00)
519 lr_ok <- 0b1
520 if LK & lr_ok then LR <-iea CIA + 4
521 ```
522
523 Below is the pseudocode for SVP64 Branches, which is a little less
524 obvious but identical to the above. The lack of obviousness is down
525 to the early-exit opportunities.
526
527 Pseudocode for Horizontal-First Mode:
528
529 ```
530 if (mode_is_64bit) then M <- 0
531 else M <- 32
532 cond_ok = not SVRMmode.ALL
533 for srcstep in range(VL):
534 # select predicate bit or zero/one
535 if predicate[srcstep]:
536 # get SVP64 extended CR field 0..127
537 SVCRf = SVP64EXTRA(BI>>2)
538 CRbits = CR{SVCRf}
539 testbit = CRbits[BI & 0b11]
540 # testbit = CR[BI+32+srcstep*4]
541 else if not SVRMmode.sz:
542 # inverted CTR test skip mode
543 if ¬BO[2] & CTRtest & ¬CTI then
544 CTR = CTR - 1
545 continue # skip to next element
546 else
547 testbit = SVRMmode.SNZ
548 # actual element test here
549 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
550 el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
551 # check if CTR dec should occur
552 ctrdec = ¬BO[2]
553 if CTRtest & (el_cond_ok ^ CTi) then
554 ctrdec = 0b0
555 if ctrdec then CTR <- CTR - 1
556 # merge in the test
557 if SVRMmode.ALL:
558 cond_ok &= (el_cond_ok & ctr_ok)
559 else
560 cond_ok |= (el_cond_ok & ctr_ok)
561 # test for VL to be set (and exit)
562 if VLSET and VSb = (el_cond_ok & ctr_ok) then
563 if SVRMmode.VLI
564 SVSTATE.VL = srcstep+1
565 else
566 SVSTATE.VL = srcstep
567 break
568 # early exit?
569 if SVRMmode.ALL != (el_cond_ok & ctr_ok):
570 break
571 # SVP64 rules about Scalar registers still apply!
572 if SVCRf.scalar:
573 break
574 # loop finally done, now test if branch (and update LR)
575 lr_ok <- SVRMmode.LRu
576 if cond_ok then
577 if AA then NIA <-iea EXTS(BD || 0b00)
578 else NIA <-iea CIA + EXTS(BD || 0b00)
579 lr_ok <- 0b1
580 if LK & lr_ok then LR <-iea CIA + 4
581 ```
582
583 Pseudocode for Vertical-First Mode:
584
585 ```
586 # get SVP64 extended CR field 0..127
587 SVCRf = SVP64EXTRA(BI>>2)
588 CRbits = CR{SVCRf}
589 # select predicate bit or zero/one
590 if predicate[srcstep]:
591 if BRc = 1 then # CR0 vectorised
592 CR{SVCRf+srcstep} = CRbits
593 testbit = CRbits[BI & 0b11]
594 else if not SVRMmode.sz:
595 # inverted CTR test skip mode
596 if ¬BO[2] & CTRtest & ¬CTI then
597 CTR = CTR - 1
598 SVSTATE.srcstep = new_srcstep
599 exit # no branch testing
600 else
601 testbit = SVRMmode.SNZ
602 # actual element test here
603 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
604 # test for VL to be set (and exit)
605 if VLSET and cond_ok = VSb then
606 if SVRMmode.VLI
607 SVSTATE.VL = new_srcstep+1
608 else
609 SVSTATE.VL = new_srcstep
610 ```
611
612 # Example Shader code
613
614 ```
615 while(a > 2) {
616 if(b < 5)
617 f();
618 else
619 g();
620 h();
621 }
622 ```
623
624 which compiles to something like:
625
626 ```
627 vec<i32> a, b;
628 // ...
629 pred loop_pred = a > 2;
630 while(loop_pred.any()) {
631 pred if_pred = loop_pred & (b < 5);
632 if(if_pred.any()) {
633 f(if_pred);
634 }
635 label1:
636 pred else_pred = loop_pred & ~if_pred;
637 if(else_pred.any()) {
638 g(else_pred);
639 }
640 h(loop_pred);
641 }
642 ```
643
644 which will end up as:
645
646 ```
647 sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
648 sv.crweird r30, CR60.GT # transfer GT vector to r30
649 while_loop:
650 sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
651 sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
652 # only calculate loop_pred & pred_b because needed in f()
653 sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
654 f(CR80.v.SO)
655 skip_f:
656 # illustrate inversion of pred_b. invert r30, test ALL
657 # rather than SOME, but masked-out zero test would FAIL,
658 # therefore masked-out instead is tested against 1 not 0
659 sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
660 # else = loop & ~pred_b, need this because used in g()
661 sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
662 g(CR80.v.SO)
663 skip_g:
664 # conditionally call h(r30) if any loop pred set
665 sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
666 sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
667 ```