reduce indent on for-loop code
[libreriscv.git] / openpower / sv / branches.mdwn
1 # SVP64 Branch Conditional behaviour
2
3 Please note: although similar, SVP64 Branch instructions should be
4 considered completely separate and distinct from standard scalar
5 OpenPOWER-approved v3.0B branches. **v3.0B branches are in no way
6 impacted, altered, changed or modified in any way, shape or form by the
7 SVP64 Vectorized Variants**.
8
9 It is also extremely important to note that Branches are the sole
10 pseudo-exception in SVP64 to `Scalar Identity Behaviour`. SVP64 Branches
11 contain additional modes that are useful for scalar operations (i.e. even
12 when VL=1 or when using single-bit predication).
13
14 <!-- hide -->
15 Links
16
17 * <https://bugs.libre-soc.org/show_bug.cgi?id=664>
18 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
19 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004678.html>
20 * Branch Divergence <https://jbush001.github.io/2014/12/07/branch-divergence-in-parallel-kernels.html>
21 * [[openpower/isa/branch]]
22 * [[sv/cr_int_predication]]
23 * [TODO](https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=fa99590eeb61e63b2d2ea81f303b9b4320e3bbe1)
24 <!-- show -->
25
26 ## Rationale
27
28 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test
29 a Condition Register. However for parallel processing it is simply
30 impossible to perform multiple independent branches: the Program
31 Counter simply cannot branch to multiple destinations based on multiple
32 conditions. The best that can be done is to test multiple Conditions
33 and make a decision of a *single* branch, based on analysis of a *Vector*
34 of CR Fields which have just been calculated from a *Vector* of results.
35
36 In 3D Shader binaries, which are inherently parallelised and predicated,
37 testing all or some results and branching based on multiple tests is
38 extremely common, and a fundamental part of Shader Compilers. Example:
39 without such multi-condition test-and-branch, if a predicate mask is
40 all zeros a large batch of instructions may be masked out to `nop`,
41 and it would waste CPU cycles to run them. 3D GPU ISAs can test for
42 this scenario and, with the appropriate predicate-analysis instruction,
43 jump over fully-masked-out operations, by spotting that *all* Conditions
44 are false.
45
46 Unless Branches are aware and capable of such analysis, additional
47 instructions would be required which perform Horizontal Cumulative
48 analysis of Vectorized Condition Register Fields, in order to reduce
49 the Vector of CR Fields down to one single yes or no decision that a
50 Scalar-only v3.0B Branch-Conditional could cope with. Such instructions
51 would be unavoidable, required, and costly by comparison to a single
52 Vector-aware Branch. Therefore, in order to be commercially competitive,
53 `sv.bc` and other Vector-aware Branch Conditional instructions are a
54 high priority for 3D GPU (and OpenCL-style) workloads.
55
56 Given that Power ISA v3.0B is already quite powerful, particularly
57 the Condition Registers and their interaction with Branches, there are
58 opportunities to create extremely flexible and compact Vectorized Branch
59 behaviour. In addition, the side-effects (updating of CTR, truncation
60 of VL, described below) make it a useful instruction even if the branch
61 points to the next instruction (no actual branch).
62
63 ## Overview
64
65 When considering an "array" of branch-tests, there are four
66 primarily-useful modes: AND, OR, NAND and NOR of all Conditions.
67 NAND and NOR may be synthesised from AND and OR by inverting `BO[1]`
68 which just leaves two modes:
69
70 * Branch takes place on the **first** CR Field test to succeed
71 (a Great Big OR of all condition tests). Exit occurs
72 on the first **successful** test.
73 * Branch takes place only if **all** CR field tests succeed:
74 a Great Big AND of all condition tests. Exit occurs
75 on the first **failed** test.
76
77 Early-exit is enacted such that the Vectorized Branch does not
78 perform needless extra tests, which will help reduce reads on
79 the Condition Register file.
80
81 *Note: Early-exit is **MANDATORY** (required) behaviour. Branches
82 **MUST** exit at the first sequentially-encountered failure point,
83 for exactly the same reasons for which it is mandatory in programming
84 languages doing early-exit: to avoid damaging side-effects and to provide
85 deterministic behaviour. Speculative testing of Condition Register
86 Fields is permitted, as is speculative calculation of CTR, as long as,
87 as usual in any Out-of-Order microarchitecture, that speculative testing
88 is cancelled should an early-exit occur. i.e. the speculation must be
89 "precise": Program Order must be preserved*
90
91 Also note that when early-exit occurs in Horizontal-first Mode, srcstep,
92 dststep etc. are all reset, ready to begin looping from the beginning
93 for the next instruction. However for Vertical-first Mode srcstep
94 etc. are incremented "as usual" i.e. an early-exit has no special impact,
95 regardless of whether the branch occurred or not. This can leave srcstep
96 etc. in what may be considered an unusual state on exit from a loop and
97 it is up to the programmer to reset srcstep, dststep etc. to known-good
98 values *(easily achieved with `setvl`)*.
99
100 Additional useful behaviour involves two primary Modes (both of which
101 may be enabled and combined):
102
103 * **VLSET Mode**: identical to Data-Dependent Fail-First Mode
104 for Arithmetic SVP64 operations, with more
105 flexibility and a close interaction and integration into the
106 underlying base Scalar v3.0B Branch instruction.
107 Truncation of VL takes place around the early-exit point.
108 * **CTR-test Mode**: gives much more flexibility over when and why
109 CTR is decremented, including options to decrement if a Condition
110 test succeeds *or if it fails*.
111
112 With these side-effects, basic Boolean Logic Analysis advises that it
113 is important to provide a means to enact them each based on whether
114 testing succeeds *or fails*. This results in a not-insignificant number
115 of additional Mode Augmentation bits, accompanying VLSET and CTR-test
116 Modes respectively.
117
118 Predicate skipping or zeroing may, as usual with SVP64, be controlled by
119 `sz`. Where the predicate is masked out and zeroing is enabled, then in
120 such circumstances the same Boolean Logic Analysis dictates that rather
121 than testing only against zero, the option to test against one is also
122 prudent. This introduces a new immediate field, `SNZ`, which works in
123 conjunction with `sz`.
124
125 Vectorized Branches can be used in either SVP64 Horizontal-First or
126 Vertical-First Mode. Essentially, at an element level, the behaviour
127 is identical in both Modes, although the `ALL` bit is meaningless in
128 Vertical-First Mode.
129
130 It is also important to bear in mind that, fundamentally, Vectorized
131 Branch-Conditional is still extremely close to the Scalar v3.0B
132 Branch-Conditional instructions, and that the same v3.0B Scalar
133 Branch-Conditional instructions are still *completely separate and
134 independent*, being unaltered and unaffected by their SVP64 variants in
135 every conceivable way.
136
137 *Programming note: One important point is that SVP64 instructions are
138 64 bit. (8 bytes not 4). This needs to be taken into consideration
139 when computing branch offsets: the offset is relative to the start of
140 the instruction, which **includes** the SVP64 Prefix*
141
142 ## Format and fields
143
144 With element-width overrides being meaningless for Condition Register
145 Fields, bits 4 thru 7 of SVP64 RM may be used for additional Mode bits.
146
147 SVP64 RM `MODE` (includes repurposing `ELWIDTH` bits 4:5, and
148 `ELWIDTH_SRC` bits 6-7 for *alternate* uses) for Branch Conditional:
149
150 | 4 | 5 | 6 | 7 | 17 | 18 | 19 | 20 | 21 | 22 23 | description |
151 | - | - | - | - | -- | -- | -- | -- | --- |--------|----------------- |
152 |ALL|SNZ| / | / | SL |SLu | 0 | 0 | / | LRu sz | simple mode |
153 |ALL|SNZ| / |VSb| SL |SLu | 0 | 1 | VLI | LRu sz | VLSET mode |
154 |ALL|SNZ|CTi| / | SL |SLu | 1 | 0 | / | LRu sz | CTR-test mode |
155 |ALL|SNZ|CTi|VSb| SL |SLu | 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode |
156
157 Brief description of fields:
158
159 * **sz=1** if predication is enabled and `sz=1` and a predicate
160 element bit is zero, `SNZ` will
161 be substituted in place of the CR bit selected by `BI`,
162 as the Condition tested.
163 Contrast this with
164 normal SVP64 `sz=1` behaviour, where *only* a zero is put in
165 place of masked-out predicate bits.
166 * **sz=0** When `sz=0` skipping occurs as usual on
167 masked-out elements, but unlike all
168 other SVP64 behaviour which entirely skips an element with
169 no related side-effects at all, there are certain
170 special circumstances where CTR
171 may be decremented. See CTR-test Mode, below.
172 * **ALL** when set, all branch conditional tests must pass in order for
173 the branch to succeed. When clear, it is the first sequentially
174 encountered successful test that causes the branch to succeed.
175 This is identical behaviour to how programming languages perform
176 early-exit on Boolean Logic chains.
177 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
178 In VLSET mode, VL *may* (depending on `VSb`) be truncated.
179 If VLI (Vector Length Inclusive) is clear,
180 VL is truncated to *exclude* the current element, otherwise it is
181 included. SVSTATE.MVL is not altered: only VL.
182 * **SL** identical to `LR` except applicable to SVSTATE. If `SL`
183 is set, SVSTATE is transferred to SVLR (conditionally on
184 whether `SLu` is set).
185 * **SLu**: SVSTATE Link Update, like `LRu` except applies to SVSTATE.
186 * **LRu**: Link Register Update, used in conjunction with LK=1
187 to make LR update conditional
188 * **VSb** In VLSET Mode, after testing,
189 if VSb is set, VL is truncated if the test succeeds. If VSb is clear,
190 VL is truncated if a test *fails*. Masked-out (skipped)
191 bits are not considered
192 part of testing when `sz=0`
193 * **CTi** CTR inversion. CTR-test Mode normally decrements per element
194 tested. CTR inversion decrements if a test *fails*. Only relevant
195 in CTR-test Mode.
196
197 LRu and CTR-test modes are where SVP64 Branches subtly differ from
198 Scalar v3.0B Branches. `sv.bcl` for example will always update LR,
199 whereas `sv.bcl/lru` will only update LR if the branch succeeds.
200
201 Of special interest is that when using ALL Mode (Great Big AND of all
202 Condition Tests), if `VL=0`, which is rare but can occur in Data-Dependent
203 Modes, the Branch will always take place because there will be no failing
204 Condition Tests to prevent it. Likewise when not using ALL Mode (Great
205 Big OR of all Condition Tests) and `VL=0` the Branch is guaranteed not
206 to occur because there will be no *successful* Condition Tests to make
207 it happen.
208
209 ## Vectorized CR Field numbering, and Scalar behaviour
210
211 It is important to keep in mind that just like all SVP64 instructions,
212 the `BI` field of the base v3.0B Branch Conditional instruction may be
213 extended by SVP64 EXTRA augmentation, as well as be marked as either
214 Scalar or Vector. It is also crucially important to keep in mind that for
215 CRs, SVP64 sequentially increments the CR *Field* numbers. CR *Fields*
216 are treated as elements, not bit-numbers of the CR *register*.
217
218 The `BI` operand of Branch Conditional operations is five bits, in scalar
219 v3.0B this would select one bit of the 32 bit CR, comprising eight CR
220 Fields of 4 bits each. In SVP64 there are 16 32 bit CRs, containing
221 128 4-bit CR Fields. Therefore, the 2 LSBs of `BI` select the bit from
222 the CR Field (EQ LT GT SO), and the top 3 bits are extended to either
223 scalar or vector and to select CR Fields 0..127 as specified in SVP64
224 [[sv/svp64/appendix]].
225
226 When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
227 then as the usual SVP64 rules apply: the Vector loop ends at the first
228 element tested (the first CR *Field*), after taking predication into
229 consideration. Thus, also as usual, when a predicate mask is given, and
230 `BI` marked as scalar, and `sz` is zero, srcstep skips forward to the
231 first non-zero predicated element, and only that one element is tested.
232
233 In other words, the fact that this is a Branch Operation (instead of an
234 arithmetic one) does not result, ultimately, in significant changes as
235 to how SVP64 is fundamentally applied, except with respect to:
236
237 * the unique properties associated with conditionally
238 changing the Program Counter (aka "a Branch"), resulting in early-out
239 opportunities
240 * CTR-testing
241
242 Both are outlined below, in later sections.
243
244 ## Horizontal-First and Vertical-First Modes
245
246 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
247 AND) results in early exit: no more updates to CTR occur (if requested);
248 no branch occurs, and LR is not updated (if requested). Likewise for
249 non-ALL mode (Great Big Or) on first success early exit also occurs,
250 however this time with the Branch proceeding. In both cases the testing
251 of the Vector of CRs should be done in linear sequential order (or in
252 REMAP re-sequenced order): such that tests that are sequentially beyond
253 the exit point are *not* carried out. (*Note: it is standard practice
254 in Programming languages to exit early from conditional tests, however a
255 little unusual to consider in an ISA that is designed for Parallel Vector
256 Processing. The reason is to have strictly-defined guaranteed behaviour*)
257
258 In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
259 behaviour. Given that only one element is being tested at a time in
260 Vertical-First Mode, a test designed to be done on multiple bits is
261 meaningless.
262
263 ## Description and Modes
264
265 Predication in both INT and CR modes may be applied to `sv.bc` and other
266 SVP64 Branch Conditional operations, exactly as they may be applied to
267 other SVP64 operations. When `sz` is zero, any masked-out Branch-element
268 operations are not included in condition testing, exactly like all other
269 SVP64 operations, *including* side-effects such as potentially updating
270 LR or CTR, which will also be skipped. There is *one* exception here,
271 which is when `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
272 predicate mask bit is also zero: under these special circumstances CTR
273 will also decrement.
274
275 When `sz` is non-zero, this normally requests insertion of a zero in
276 place of the input data, when the relevant predicate mask bit is zero.
277 This would mean that a zero is inserted in place of `CR[BI+32]` for
278 testing against `BO`, which may not be desirable in all circumstances.
279 Therefore, an extra field is provided `SNZ`, which, if set, will insert
280 a **one** in place of a masked-out element, instead of a zero.
281
282 (*Note: Both options are provided because it is useful to deliberately
283 cause the Branch-Conditional Vector testing to fail at a specific point,
284 controlled by the Predicate mask. This is particularly useful in `VLSET`
285 mode, which will truncate SVSTATE.VL at the point of the first failed
286 test.*)
287
288 Normally, CTR mode will decrement once per Condition Test, resulting under
289 normal circumstances that CTR reduces by up to VL in Horizontal-First
290 Mode. Just as when v3.0B Branch-Conditional saves at least one instruction
291 on tight inner loops through auto-decrementation of CTR, likewise it
292 is also possible to save instruction count for SVP64 loops in both
293 Vertical-First and Horizontal-First Mode, particularly in circumstances
294 where there is conditional interaction between the element computation
295 and testing, and the continuation (or otherwise) of a given loop. The
296 potential combinations of interactions is why CTR testing options have
297 been added.
298
299 Also, the unconditional bit `BO[0]` is still relevant when Predication
300 is applied to the Branch because in `ALL` mode all nonmasked bits have
301 to be tested, and when `sz=0` skipping occurs. Even when VLSET mode is
302 not used, CTR may still be decremented by the total number of nonmasked
303 elements, acting in effect as either a popcount or cntlz depending
304 on which mode bits are set. In short, Vectorized Branch becomes an
305 extremely powerful tool.
306
307 **Micro-Architectural Implementation Note**: *when implemented on top
308 of a Multi-Issue Out-of-Order Engine it is possible to pass a copy of
309 the predicate and the prerequisite CR Fields to all Branch Units, as
310 well as the current value of CTR at the time of multi-issue, and for
311 each Branch Unit to compute how many times CTR would be subtracted,
312 in a fully-deterministic and parallel fashion. A SIMD-based Branch
313 Unit, receiving and processing multiple CR Fields covered by multiple
314 predicate bits, would do the exact same thing. Obviously, however, if
315 CTR is modified within any given loop (mtctr) the behaviour of CTR is
316 no longer deterministic.*
317
318 ### Link Register Update
319
320 For a Scalar Branch, unconditional updating of the Link Register LR
321 is useful and practical. However, if a loop of CR Fields is tested,
322 unconditional updating of LR becomes problematic.
323
324 For example when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode,
325 LR's value will be unconditionally overwritten after the first element,
326 such that for execution (testing) of the second element, LR has the value
327 `CIA+8`. This is covered in the `bclrl` example, in a later section.
328
329 The addition of a LRu bit modifies behaviour in conjunction with LK,
330 as follows:
331
332 * `sv.bc` When LRu=0,LK=0, Link Register is not updated
333 * `sv.bcl` When LRu=0,LK=1, Link Register is updated unconditionally
334 * `sv.bcl/lru` When LRu=1,LK=1, Link Register will
335 only be updated if the Branch Condition fails.
336 * `sv.bc/lru` When LRu=1,LK=0, Link Register will only be updated if
337 the Branch Condition succeeds.
338
339 This avoids destruction of LR during loops (particularly Vertical-First
340 ones).
341
342 **SVLR and SVSTATE**
343
344 For precisely the reasons why `LK=1` was added originally to the Power
345 ISA, with SVSTATE being a peer of the Program Counter it becomes necessary
346 to also add an SVLR (SVSTATE Link Register) and corresponding control bits
347 `SL` and `SLu`.
348
349 ### CTR-test
350
351 Where a standard Scalar v3.0B branch unconditionally decrements CTR when
352 `BO[2]` is clear, CTR-test Mode introduces more flexibility which allows
353 CTR to be used for many more types of Vector loops constructs.
354
355 CTR-test mode and CTi interaction is as follows: note that `BO[2]`
356 is still required to be clear for CTR decrements to be considered,
357 exactly as is the case in Scalar Power ISA v3.0B
358
359 * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
360 if `BO[2]` is zero. Masked-out elements when `sz=0` are
361 skipped (i.e. CTR is *not* decremented when the predicate
362 bit is zero and `sz=0`).
363 * **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
364 if `BO[2]` is zero and a masked-out element is skipped
365 (`sz=0` and predicate bit is zero). This one special case is the
366 **opposite** of other combinations, as well as being
367 completely different from normal SVP64 `sz=0` behaviour)
368 * **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
369 if `BO[2]` is zero and the Condition Test succeeds.
370 Masked-out elements when `sz=0` are skipped (including
371 not decrementing CTR)
372 * **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
373 if `BO[2]` is zero and the Condition Test *fails*.
374 Masked-out elements when `sz=0` are skipped (including
375 not decrementing CTR)
376
377 `CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the only
378 time in the entirety of SVP64 that has side-effects when a predicate mask
379 bit is clear. **All** other SVP64 operations entirely skip an element
380 when sz=0 and a predicate mask bit is zero. It is also critical to
381 emphasise that in this unusual mode, no other side-effects occur: **only**
382 CTR is decremented, i.e. the rest of the Branch operation is skipped.
383
384 ### VLSET Mode
385
386 VLSET Mode truncates the Vector Length so that subsequent instructions
387 operate on a reduced Vector Length. This is similar to Data-dependent
388 Fail-First and LD/ST Fail-First, where for VLSET the truncation occurs
389 at the Branch decision-point.
390
391 Interestingly, due to the side-effects of `VLSET` mode it is actually
392 useful to use Branch Conditional even to perform no actual branch
393 operation, i.e to point to the instruction after the branch. Truncation of
394 VL would thus conditionally occur yet control flow alteration would not.
395
396 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
397 is designed to be used for explicit looping, where an explicit call to
398 `svstep` is required to move both srcstep and dststep on to the next
399 element, until VL (or other condition) is reached. Vertical-First Looping
400 is expected (required) to terminate if the end of the Vector, VL, is
401 reached. If however that loop is terminated early because VL is truncated,
402 VLSET with Vertical-First becomes meaningless. Resolving this would
403 require two branches: one Conditional, the other branching unconditionally
404 to create the loop, where the Conditional one jumps over it.
405
406 Therefore, with `VSb`, the option to decide whether truncation should
407 occur if the branch succeeds *or* if the branch condition fails allows
408 for the flexibility required. This allows a Vertical-First Branch to
409 *either* be used as a branch-back (loop) *or* as part of a conditional
410 exit or function call from *inside* a loop, and for VLSET to be integrated
411 into both types of decision-making.
412
413 In the case of a Vertical-First branch-back (loop), with `VSb=0` the
414 branch takes place if success conditions are met, but on exit from that
415 loop (branch condition fails), VL will be truncated. This is extremely
416 useful.
417
418 `VLSET` mode with Horizontal-First when `VSb=0` is still useful, because
419 it can be used to truncate VL to the first predicated (non-masked-out)
420 element.
421
422 The truncation point for VL, when VLi is clear, must not include skipped
423 elements that preceded the current element being tested. Example:
424 `sz=0, VLi=0, predicate mask = 0b110010` and the Condition Register
425 failure point is at CR Field element 4.
426
427 * Testing at element 0 is skipped because its predicate bit is zero
428 * Testing at element 1 passed
429 * Testing elements 2 and 3 are skipped because their
430 respective predicate mask bits are zero
431 * Testing element 4 fails therefore VL is truncated to **2**
432 not 4 due to elements 2 and 3 being skipped.
433
434 If `sz=1` in the above example *then* VL would have been set to 4 because
435 in non-zeroing mode the zero'd elements are still effectively part of
436 the Vector (with their respective elements set to `SNZ`)
437
438 If `VLI=1` then VL would be set to 5 regardless of sz, due to being
439 inclusive of the element actually being tested.
440
441 ### VLSET and CTR-test combined
442
443 If both CTR-test and VLSET Modes are requested, it is important to
444 observe the correct order. What occurs depends on whether VLi is enabled,
445 because VLi affects the length, VL.
446
447 If VLi (VL truncate inclusive) is set:
448
449 1. compute the test including whether CTR triggers
450 2. (optionally) decrement CTR
451 3. (optionally) truncate VL (VSb inverts the decision)
452 4. decide (based on step 1) whether to terminate looping
453 (including not executing step 5)
454 5. decide whether to branch.
455
456 If VLi is clear, then when a test fails that element and any following
457 it should **not** be considered part of the Vector. Consequently:
458
459 1. compute the branch test including whether CTR triggers
460 2. if the test fails against VSb, truncate VL to the *previous*
461 element, and terminate looping. No further steps executed.
462 3. (optionally) decrement CTR
463 4. decide whether to branch.
464
465 ## Boolean Logic combinations
466
467 In a Scalar ISA, Branch-Conditional testing even of vector results may be
468 performed through inversion of tests. NOR of all tests may be performed
469 by inversion of the scalar condition and branching *out* from the scalar
470 loop around elements, using scalar operations.
471
472 In a parallel (Vector) ISA it is the ISA itself which must perform
473 the prerequisite logic manipulation. Thus for SVP64 there are an
474 extraordinary number of nesessary combinations which provide completely
475 different and useful behaviour. Available options to combine:
476
477 * `BO[0]` to make an unconditional branch would seem irrelevant if
478 it were not for predication and for side-effects (CTR Mode
479 for example)
480 * Enabling CTR-test Mode and setting `BO[2]` can still result in the
481 Branch
482 taking place, not because the Condition Test itself failed, but
483 because CTR reached zero **because**, as required by CTR-test mode,
484 CTR was decremented as a **result** of Condition Tests failing.
485 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
486 * `R30` and `~R30` and other predicate mask options including CR and
487 inverted CR bit testing
488 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
489 predicate bits
490 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
491 `OR` of all tests, respectively.
492 * Predicate Mask bits, which combine in effect with the CR being
493 tested.
494 * Inversion of Predicate Masks (`~r3` instead of `r3`, or using
495 `NE` rather than `EQ`) which results in an additional
496 level of possible ANDing, ORing etc. that would otherwise
497 need explicit instructions.
498
499 The most obviously useful combinations here are to set `BO[1]` to zero
500 in order to turn `ALL` into Great-Big-NAND and `ANY` into Great-Big-NOR.
501 Other Mode bits which perform behavioural inversion then have to work
502 round the fact that the Condition Testing is NOR or NAND. The alternative
503 to not having additional behavioural inversion (`SNZ`, `VSb`, `CTi`)
504 would be to have a second (unconditional) branch directly after the first,
505 which the first branch jumps over. This contrivance is avoided by the
506 behavioural inversion bits.
507
508 ## Pseudocode and examples
509
510 Please see [[svp64/appendix]] regarding CR bit ordering and for
511 the definition of `CR{n}`
512
513 For comparative purposes this is a copy of the v3.0B `bc` pseudocode
514
515 ```
516 if (mode_is_64bit) then M <- 0
517 else M <- 32
518 if ¬BO[2] then CTR <- CTR - 1
519 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
520 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
521 if ctr_ok & cond_ok then
522 if AA then NIA <-iea EXTS(BD || 0b00)
523 else NIA <-iea CIA + EXTS(BD || 0b00)
524 if LK then LR <-iea CIA + 4
525 ```
526
527 Simplified pseudocode including LRu and CTR skipping, which illustrates
528 clearly that SVP64 Scalar Branches (VL=1) are **not** identical to
529 v3.0B Scalar Branches. The key areas where differences occur are the
530 inclusion of predication (which can still be used when VL=1), in when and
531 why CTR is decremented (CTRtest Mode) and whether LR is updated (which
532 is unconditional in v3.0B when LK=1, and conditional in SVP64 when LRu=1).
533
534 Inline comments highlight the fact that the Scalar Branch behaviour and
535 pseudocode is still clearly visible and embedded within the Vectorized
536 variant:
537
538 ```
539 if (mode_is_64bit) then M <- 0
540 else M <- 32
541 # the bit of CR to test, if the predicate bit is zero,
542 # is overridden
543 testbit = CR[BI+32]
544 if ¬predicate_bit then testbit = SVRMmode.SNZ
545 # otherwise apart from the override ctr_ok and cond_ok
546 # are exactly the same
547 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
548 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
549 if ¬predicate_bit & ¬SVRMmode.sz then
550 # this is entirely new: CTR-test mode still decrements CTR
551 # even when predicate-bits are zero
552 if ¬BO[2] & CTRtest & ¬CTi then
553 CTR = CTR - 1
554 # instruction finishes here
555 else
556 # usual BO[2] CTR-mode now under CTR-test mode as well
557 if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
558 # new VLset mode, conditional test truncates VL
559 if VLSET and VSb = (cond_ok & ctr_ok) then
560 if SVRMmode.VLI then SVSTATE.VL = srcstep+1
561 else SVSTATE.VL = srcstep
562 # usual LR is now conditional, but also joined by SVLR
563 lr_ok <- LK
564 svlr_ok <- SVRMmode.SL
565 if ctr_ok & cond_ok then
566 if AA then NIA <-iea EXTS(BD || 0b00)
567 else NIA <-iea CIA + EXTS(BD || 0b00)
568 if SVRMmode.LRu then lr_ok <- ¬lr_ok
569 if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
570 if lr_ok then LR <-iea CIA + 4
571 if svlr_ok then SVLR <- SVSTATE
572 ```
573
574 Below is the pseudocode for SVP64 Branches, which is a little less
575 obvious but identical to the above. The lack of obviousness is down to
576 the early-exit opportunities.
577
578 Effective pseudocode for Horizontal-First Mode:
579
580 ```
581 if (mode_is_64bit) then M <- 0
582 else M <- 32
583 cond_ok = not SVRMmode.ALL
584 for srcstep in range(VL):
585 # select predicate bit or zero/one
586 if predicate[srcstep]:
587 # get SVP64 extended CR field 0..127
588 SVCRf = SVP64EXTRA(BI>>2)
589 CRbits = CR{SVCRf}
590 testbit = CRbits[BI & 0b11]
591 # testbit = CR[BI+32+srcstep*4]
592 else if not SVRMmode.sz:
593 # inverted CTR test skip mode
594 if ¬BO[2] & CTRtest & ¬CTI then
595 CTR = CTR - 1
596 continue # skip to next element
597 else
598 testbit = SVRMmode.SNZ
599 # actual element test here
600 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
601 el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
602 # check if CTR dec should occur
603 ctrdec = ¬BO[2]
604 if CTRtest & (el_cond_ok ^ CTi) then
605 ctrdec = 0b0
606 if ctrdec then CTR <- CTR - 1
607 # merge in the test
608 if SVRMmode.ALL:
609 cond_ok &= (el_cond_ok & ctr_ok)
610 else
611 cond_ok |= (el_cond_ok & ctr_ok)
612 # test for VL to be set (and exit)
613 if VLSET and VSb = (el_cond_ok & ctr_ok) then
614 if SVRMmode.VLI then SVSTATE.VL = srcstep+1
615 else SVSTATE.VL = srcstep
616 break
617 # early exit?
618 if SVRMmode.ALL != (el_cond_ok & ctr_ok):
619 break
620 # SVP64 rules about Scalar registers still apply!
621 if SVCRf.scalar:
622 break
623 # loop finally done, now test if branch (and update LR)
624 lr_ok <- LK
625 svlr_ok <- SVRMmode.SL
626 if cond_ok then
627 if AA then NIA <-iea EXTS(BD || 0b00)
628 else NIA <-iea CIA + EXTS(BD || 0b00)
629 if SVRMmode.LRu then lr_ok <- ¬lr_ok
630 if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
631 if lr_ok then LR <-iea CIA + 4
632 if svlr_ok then SVLR <- SVSTATE
633 ```
634
635 Pseudocode for Vertical-First Mode:
636
637 ```
638 # get SVP64 extended CR field 0..127
639 SVCRf = SVP64EXTRA(BI>>2)
640 CRbits = CR{SVCRf}
641 # select predicate bit or zero/one
642 if predicate[srcstep]:
643 if BRc = 1 then # CR0 vectorized
644 CR{SVCRf+srcstep} = CRbits
645 testbit = CRbits[BI & 0b11]
646 else if not SVRMmode.sz:
647 # inverted CTR test skip mode
648 if ¬BO[2] & CTRtest & ¬CTI then
649 CTR = CTR - 1
650 SVSTATE.srcstep = new_srcstep
651 exit # no branch testing
652 else
653 testbit = SVRMmode.SNZ
654 # actual element test here
655 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
656 # test for VL to be set (and exit)
657 if VLSET and cond_ok = VSb then
658 if SVRMmode.VLI
659 SVSTATE.VL = new_srcstep+1
660 else
661 SVSTATE.VL = new_srcstep
662 ```
663
664 ### Example Shader code
665
666 ```
667 // assume f() g() or h() modify a and/or b
668 while(a > 2) {
669 if(b < 5)
670 f();
671 else
672 g();
673 h();
674 }
675 ```
676
677 which compiles to something like:
678
679 ```
680 vec<i32> a, b;
681 // ...
682 pred loop_pred = a > 2;
683 // loop continues while any of a elements greater than 2
684 while(loop_pred.any()) {
685 // vector of predicate bits
686 pred if_pred = loop_pred & (b < 5);
687 // only call f() if at least 1 bit set
688 if(if_pred.any()) {
689 f(if_pred);
690 }
691 label1:
692 // loop mask ANDs with inverted if-test
693 pred else_pred = loop_pred & ~if_pred;
694 // only call g() if at least 1 bit set
695 if(else_pred.any()) {
696 g(else_pred);
697 }
698 h(loop_pred);
699 }
700 ```
701
702 which will end up as:
703
704 ```
705 # start from while loop test point
706 b looptest
707 while_loop:
708 sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
709 sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
710 # only calculate loop_pred & pred_b because needed in f()
711 sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
712 f(CR80.v.SO)
713 skip_f:
714 # illustrate inversion of pred_b. invert r30, test ALL
715 # rather than SOME, but masked-out zero test would FAIL,
716 # therefore masked-out instead is tested against 1 not 0
717 sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
718 # else = loop & ~pred_b, need this because used in g()
719 sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
720 g(CR80.v.SO)
721 skip_g:
722 # conditionally call h(r30) if any loop pred set
723 sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
724 looptest:
725 sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
726 sv.crweird r30, CR60.GT # transfer GT vector to r30
727 sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
728 end:
729 ```
730
731 ### LRu example
732
733 show why LRu would be useful in a loop. Imagine the following
734 c code:
735
736 ```
737 for (int i = 0; i < 8; i++) {
738 if (x < y) break;
739 }
740 ```
741
742 Under these circumstances exiting from the loop is not only based on
743 CTR it has become conditional on a CR result. Thus it is desirable that
744 NIA *and* LR only be modified if the conditions are met
745
746
747 v3.0 pseudocode for `bclrl`:
748
749 ```
750 if (mode_is_64bit) then M <- 0
751 else M <- 32
752 if ¬BO[2] then CTR <- CTR - 1
753 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
754 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
755 if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
756 if LK then LR <-iea CIA + 4
757 ```
758
759 the latter part for SVP64 `bclrl` becomes:
760
761 ```
762 for i in 0 to VL-1:
763 ...
764 ...
765 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
766 lr_ok <- LK
767 if ctr_ok & cond_ok then
768 NIA <-iea LR[0:61] || 0b00
769 if SVRMmode.LRu then lr_ok <- ¬lr_ok
770 if lr_ok then LR <-iea CIA + 4
771 # if NIA modified exit loop
772 ```
773
774 The reason why should be clear from this being a Vector loop:
775 unconditional destruction of LR when LK=1 makes `sv.bclrl` ineffective,
776 because the intention going into the loop is that the branch should be to
777 the copy of LR set at the *start* of the loop, not half way through it.
778 However if the change to LR only occurs if the branch is taken then it
779 becomes a useful instruction.
780
781 The following pseudocode should **not** be implemented because it
782 violates the fundamental principle of SVP64 which is that SVP64 looping
783 is a thin wrapper around Scalar Instructions. The pseducode below is
784 more an actual Vector ISA Branch and as such is not at all appropriate:
785
786 ```
787 for i in 0 to VL-1:
788 ...
789 ...
790 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
791 if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
792 # only at the end of looping is LK checked.
793 # this completely violates the design principle of SVP64
794 # and would actually need to be a separate (scalar)
795 # instruction "set LR to CIA+4 but retrospectively"
796 # which is clearly impossible
797 if LK then LR <-iea CIA + 4
798 ```
799
800 --------
801
802 \newpage{}
803
804 [[!tag standards]]