(no commit message)
[libreriscv.git] / openpower / sv / branches.mdwn
1 [[!tag standards]]
2 # SVP64 Branch Conditional behaviour
3
4 **DRAFT STATUS**
5
6 Please note: although similar, SVP64 Branch instructions should be
7 considered completely separate and distinct from
8 standard scalar OpenPOWER-approved v3.0B branches.
9 **v3.0B branches are in no way impacted, altered,
10 changed or modified in any way, shape or form by
11 the SVP64 Vectorised Variants**.
12
13 It is also
14 extremely important to note that Branches are the
15 sole semi-exception in SVP64 to `Scalar Identity Behaviour`.
16 SVP64 Branches contain additional modes that are useful
17 for scalar operations (i.e. even when VL=1 or when
18 using single-bit predication).
19
20 Links
21
22 * <https://bugs.libre-soc.org/show_bug.cgi?id=664>
23 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
24 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004678.html>
25 * [[openpower/isa/branch]]
26 * [[sv/cr_int_predication]]
27
28 # Rationale
29
30 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
31 Condition Register. However for parallel processing it is simply impossible
32 to perform multiple independent branches: the Program Counter simply
33 cannot branch to multiple destinations based on multiple conditions.
34 The best that can be done is
35 to test multiple Conditions and make a decision of a *single* branch,
36 based on analysis of a *Vector* of CR Fields
37 which have just been calculated from a *Vector* of results.
38
39 In 3D Shader
40 binaries, which are inherently parallelised and predicated, testing all or
41 some results and branching based on multiple tests is extremely common,
42 and a fundamental part of Shader Compilers. Example:
43 without such multi-condition
44 test-and-branch, if a predicate mask is all zeros a large batch of
45 instructions may be masked out to `nop`, and it would waste
46 CPU cycles to run them. 3D GPU ISAs can test for this scenario
47 and, with the appropriate predicate-analysis instruction,
48 jump over fully-masked-out operations, by spotting that
49 *all* Conditions are false.
50
51 Unless Branches are aware and capable of such analysis, additional
52 instructions would be required which perform Horizontal Cumulative
53 analysis of Vectorised Condition Register Fields, in order to
54 reduce the Vector of CR Fields down to one single yes or no
55 decision that a Scalar-only v3.0B Branch-Conditional could cope with.
56 Such instructions would be unavoidable, required, and costly
57 by comparison to a single Vector-aware Branch.
58 Therefore, in order to be commercially competitive, `sv.bc` and
59 other Vector-aware Branch Conditional instructions are a high priority
60 for 3D GPU (and OpenCL-style) workloads.
61
62 Given that Power ISA v3.0B is already quite powerful, particularly
63 the Condition Registers and their interaction with Branches, there
64 are opportunities to create extremely flexible and compact
65 Vectorised Branch behaviour. In addition, the side-effects (updating
66 of CTR, truncation of VL, described below) make it a useful instruction
67 even if the branch points to the next instruction (no actual branch).
68
69 # Overview
70
71 When considering an "array" of branch-tests, there are four
72 primarily-useful modes:
73 AND, OR, NAND and NOR of all Conditions.
74 NAND and NOR may be synthesised from AND and OR by
75 inverting `BO[1]` which just leaves two modes:
76
77 * Branch takes place on the **first** CR Field test to succeed
78 (a Great Big OR of all condition tests). Exit occurs
79 on the first **successful** test.
80 * Branch takes place only if **all** CR field tests succeed:
81 a Great Big AND of all condition tests. Exit occurs
82 on the first **failed** test.
83
84 Early-exit is enacted such that the Vectorised Branch does not
85 perform needless extra tests, which will help reduce reads on
86 the Condition Register file.
87
88 *Note: Early-exit is **MANDATORY** (required) behaviour.
89 Branches **MUST** exit at the first sequentially-encountered
90 failure point, for
91 exactly the same reasons for which it is mandatory in
92 programming languages doing early-exit: to avoid
93 damaging side-effects and to provide deterministic
94 behaviour. Speculative testing of Condition
95 Register Fields is permitted, as is speculative calculation
96 of CTR, as long as, as usual in any Out-of-Order microarchitecture,
97 that speculative testing is cancelled should an early-exit occur.
98 i.e. the speculation must be "precise": Program Order must be preserved*
99
100 Also note that when early-exit occurs in Horizontal-first Mode,
101 srcstep, dststep etc. are all reset, ready to begin looping from the
102 beginning for the next instruction. However for Vertical-first
103 Mode srcstep etc. are incremented "as usual" i.e. an early-exit
104 has no special impact, regardless of whether the branch
105 occurred or not. This can leave srcstep etc. in what may be
106 considered an unusual
107 state on exit from a loop and it is up to the programmer to
108 reset srcstep, dststep etc. to known-good values
109 *(easily achieved with `setvl`)*.
110
111 Additional useful behaviour involves two primary Modes (both of
112 which may be enabled and combined):
113
114 * **VLSET Mode**: identical to Data-Dependent Fail-First Mode
115 for Arithmetic SVP64 operations, with more
116 flexibility and a close interaction and integration into the
117 underlying base Scalar v3.0B Branch instruction.
118 Truncation of VL takes place around the early-exit point.
119 * **CTR-test Mode**: gives much more flexibility over when and why
120 CTR is decremented, including options to decrement if a Condition
121 test succeeds *or if it fails*.
122
123 With these side-effects, basic Boolean Logic Analysis advises that
124 it is important to provide a means
125 to enact them each based on whether testing succeeds *or fails*. This
126 results in a not-insignificant number of additional Mode Augmentation bits,
127 accompanying VLSET and CTR-test Modes respectively.
128
129 Predicate skipping or zeroing may, as usual with SVP64, be controlled
130 by `sz`.
131 Where the predicate is masked out and
132 zeroing is enabled, then in such circumstances
133 the same Boolean Logic Analysis dictates that
134 rather than testing only against zero, the option to test
135 against one is also prudent. This introduces a new
136 immediate field, `SNZ`, which works in conjunction with
137 `sz`.
138
139
140 Vectorised Branches can be used
141 in either SVP64 Horizontal-First or Vertical-First Mode. Essentially,
142 at an element level, the behaviour is identical in both Modes,
143 although the `ALL` bit is meaningless in Vertical-First Mode.
144
145 It is also important
146 to bear in mind that, fundamentally, Vectorised Branch-Conditional
147 is still extremely close to the Scalar v3.0B Branch-Conditional
148 instructions, and that the same v3.0B Scalar Branch-Conditional
149 instructions are still
150 *completely separate and independent*, being unaltered and
151 unaffected by their SVP64 variants in every conceivable way.
152
153 *Programming note: One important point is that SVP64 instructions are 64 bit.
154 (8 bytes not 4). This needs to be taken into consideration when computing
155 branch offsets: the offset is relative to the start of the instruction,
156 which **includes** the SVP64 Prefix*
157
158 # Format and fields
159
160 With element-width overrides being meaningless for Condition
161 Register Fields, bits 4 thru 7 of SVP64 RM may be used for additional
162 Mode bits.
163
164 SVP64 RM `MODE` (includes repurposing `ELWIDTH` bits 4:5,
165 and `ELWIDTH_SRC` bits 6-7 for *alternate* uses) for Branch
166 Conditional:
167
168 | 4 | 5 | 6 | 7 | 17 | 18 | 19 | 20 | 21 | 22 23 | description |
169 | - | - | - | - | -- | -- | -- | -- | --- |--------|----------------- |
170 |ALL|SNZ| / | / | SL |SLu | 0 | 0 | / | LRu sz | simple mode |
171 |ALL|SNZ| / |VSb| SL |SLu | 0 | 1 | VLI | LRu sz | VLSET mode |
172 |ALL|SNZ|CTi| / | SL |SLu | 1 | 0 | / | LRu sz | CTR-test mode |
173 |ALL|SNZ|CTi|VSb| SL |SLu | 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode |
174
175 Brief description of fields:
176
177 * **sz=1** if predication is enabled and `sz=1` and a predicate
178 element bit is zero, `SNZ` will
179 be substituted in place of the CR bit selected by `BI`,
180 as the Condition tested.
181 Contrast this with
182 normal SVP64 `sz=1` behaviour, where *only* a zero is put in
183 place of masked-out predicate bits.
184 * **sz=0** When `sz=0` skipping occurs as usual on
185 masked-out elements, but unlike all
186 other SVP64 behaviour which entirely skips an element with
187 no related side-effects at all, there are certain
188 special circumstances where CTR
189 may be decremented. See CTR-test Mode, below.
190 * **ALL** when set, all branch conditional tests must pass in order for
191 the branch to succeed. When clear, it is the first sequentially
192 encountered successful test that causes the branch to succeed.
193 This is identical behaviour to how programming languages perform
194 early-exit on Boolean Logic chains.
195 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
196 In VLSET mode, VL *may* (depending on `VSb`) be truncated.
197 If VLI (Vector Length Inclusive) is clear,
198 VL is truncated to *exclude* the current element, otherwise it is
199 included. SVSTATE.MVL is not altered: only VL.
200 * **SL** identical to `LR` except applicable to SVSTATE. If `SL`
201 is set, SVSTATE is transferred to SVLR (conditionally on
202 whether `SLu` is set).
203 * **SLu**: SVSTATE Link Update, like `LRu` except applies to SVSTATE.
204 * **LRu**: Link Register Update, used in conjunction with LK=1
205 to make LR update conditional
206 * **VSb** In VLSET Mode, after testing,
207 if VSb is set, VL is truncated if the test succeeds. If VSb is clear,
208 VL is truncated if a test *fails*. Masked-out (skipped)
209 bits are not considered
210 part of testing when `sz=0`
211 * **CTi** CTR inversion. CTR-test Mode normally decrements per element
212 tested. CTR inversion decrements if a test *fails*. Only relevant
213 in CTR-test Mode.
214
215 LRu and CTR-test modes are where SVP64 Branches subtly differ from
216 Scalar v3.0B Branches. `sv.bcl` for example will always update LR, whereas
217 `sv.bcl/lru` will only update LR if the branch succeeds.
218
219 Of special interest is that when using ALL Mode (Great Big AND
220 of all Condition Tests), if `VL=0`,
221 which is rare but can occur in Data-Dependent Modes, the Branch
222 will always take place because there will be no failing Condition
223 Tests to prevent it. Likewise when not using ALL Mode (Great Big OR
224 of all Condition Tests) and `VL=0` the Branch is guaranteed not
225 to occur because there will be no *successful* Condition Tests
226 to make it happen.
227
228 # Vectorised CR Field numbering, and Scalar behaviour
229
230 It is important to keep in mind that just like all SVP64 instructions,
231 the `BI` field of the base v3.0B Branch Conditional instruction
232 may be extended by SVP64 EXTRA augmentation, as well as be marked
233 as either Scalar or Vector. It is also crucially important to keep in mind
234 that for CRs, SVP64 sequentially increments the CR *Field* numbers.
235 CR *Fields* are treated as elements, not bit-numbers of the CR *register*.
236
237 The `BI` operand of Branch Conditional operations is five bits, in scalar
238 v3.0B this would select one bit of the 32 bit CR,
239 comprising eight CR Fields of 4 bits each. In SVP64 there are
240 16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
241 `BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
242 are extended to either scalar or vector and to select CR Fields 0..127
243 as specified in SVP64 [[sv/svp64/appendix]].
244
245 When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
246 then as the usual SVP64 rules apply:
247 the Vector loop ends at the first element tested
248 (the first CR *Field*), after taking
249 predication into consideration. Thus, also as usual, when a predicate mask is
250 given, and `BI` marked as scalar, and `sz` is zero, srcstep
251 skips forward to the first non-zero predicated element, and only that
252 one element is tested.
253
254 In other words, the fact that this is a Branch
255 Operation (instead of an arithmetic one) does not result, ultimately,
256 in significant changes as to
257 how SVP64 is fundamentally applied, except with respect to:
258
259 * the unique properties associated with conditionally
260 changing the Program
261 Counter (aka "a Branch"), resulting in early-out
262 opportunities
263 * CTR-testing
264
265 Both are outlined below, in later sections.
266
267 # Horizontal-First and Vertical-First Modes
268
269 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
270 AND) results in early exit: no more updates to CTR occur (if requested);
271 no branch occurs, and LR is not updated (if requested). Likewise for
272 non-ALL mode (Great Big Or) on first success early exit also occurs,
273 however this time with the Branch proceeding. In both cases the testing
274 of the Vector of CRs should be done in linear sequential order (or in
275 REMAP re-sequenced order): such that tests that are sequentially beyond
276 the exit point are *not* carried out. (*Note: it is standard practice in
277 Programming languages to exit early from conditional tests, however
278 a little unusual to consider in an ISA that is designed for Parallel
279 Vector Processing. The reason is to have strictly-defined guaranteed
280 behaviour*)
281
282 In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
283 behaviour. Given that only one element is being tested at a time
284 in Vertical-First Mode, a test designed to be done on multiple
285 bits is meaningless.
286
287 # Description and Modes
288
289 Predication in both INT and CR modes may be applied to `sv.bc` and other
290 SVP64 Branch Conditional operations, exactly as they may be applied to
291 other SVP64 operations. When `sz` is zero, any masked-out Branch-element
292 operations are not included in condition testing, exactly like all other
293 SVP64 operations, *including* side-effects such as potentially updating
294 LR or CTR, which will also be skipped. There is *one* exception here,
295 which is when
296 `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
297 predicate mask bit is also zero:
298 under these special circumstances CTR will also decrement.
299
300 When `sz` is non-zero, this normally requests insertion of a zero
301 in place of the input data, when the relevant predicate mask bit is zero.
302 This would mean that a zero is inserted in place of `CR[BI+32]` for
303 testing against `BO`, which may not be desirable in all circumstances.
304 Therefore, an extra field is provided `SNZ`, which, if set, will insert
305 a **one** in place of a masked-out element, instead of a zero.
306
307 (*Note: Both options are provided because it is useful to deliberately
308 cause the Branch-Conditional Vector testing to fail at a specific point,
309 controlled by the Predicate mask. This is particularly useful in `VLSET`
310 mode, which will truncate SVSTATE.VL at the point of the first failed
311 test.*)
312
313 Normally, CTR mode will decrement once per Condition Test, resulting
314 under normal circumstances that CTR reduces by up to VL in Horizontal-First
315 Mode. Just as when v3.0B Branch-Conditional saves at
316 least one instruction on tight inner loops through auto-decrementation
317 of CTR, likewise it is also possible to save instruction count for
318 SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
319 in circumstances where there is conditional interaction between the
320 element computation and testing, and the continuation (or otherwise)
321 of a given loop. The potential combinations of interactions is why CTR
322 testing options have been added.
323
324 Also, the unconditional bit `BO[0]` is still relevant when Predication
325 is applied to the Branch because in `ALL` mode all nonmasked bits have
326 to be tested, and when `sz=0` skipping occurs.
327 Even when VLSET mode is not used, CTR
328 may still be decremented by the total number of nonmasked elements,
329 acting in effect as either a popcount or cntlz depending on which
330 mode bits are set.
331 In short, Vectorised Branch becomes an extremely powerful tool.
332
333 **Micro-Architectural Implementation Note**: *when implemented on
334 top of a Multi-Issue Out-of-Order Engine it is possible to pass
335 a copy of the predicate and the prerequisite CR Fields to all
336 Branch Units, as well as the current value of CTR at the time of
337 multi-issue, and for each Branch Unit to compute how many times
338 CTR would be subtracted, in a fully-deterministic and parallel
339 fashion. A SIMD-based Branch Unit, receiving and processing
340 multiple CR Fields covered by multiple predicate bits, would
341 do the exact same thing. Obviously, however, if CTR is modified
342 within any given loop (mtctr) the behaviour of CTR is no longer
343 deterministic.*
344
345 ## Link Register Update
346
347 For a Scalar Branch, unconditional updating of the Link Register
348 LR is useful and practical. However, if a loop of CR Fields is
349 tested, unconditional updating of LR becomes problematic.
350
351 For example when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode,
352 LR's value will be unconditionally overwritten after the first element,
353 such that for execution (testing) of the second element, LR
354 has the value `CIA+8`. This is covered in the `bclrl` example, in
355 a later section.
356
357 The addition of a LRu bit modifies behaviour in conjunction
358 with LK, as follows:
359
360 * `sv.bc` When LRu=0,LK=0, Link Register is not updated
361 * `sv.bcl` When LRu=0,LK=1, Link Register is updated unconditionally
362 * `sv.bcl/lru` When LRu=1,LK=1, Link Register will
363 only be updated if the Branch Condition fails.
364 * `sv.bc/lru` When LRu=1,LK=0, Link Register will only be updated if
365 the Branch Condition succeeds.
366
367 This avoids
368 destruction of LR during loops (particularly Vertical-First
369 ones).
370
371 **SVLR and SVSTATE**
372
373 For precisely the reasons why `LK=1` was added originally to the Power
374 ISA, with SVSTATE being a peer of the Program Counter it becomes
375 necessary to also add an SVLR (SVSTATE Link Register)
376 and corresponding control bits `SL` and `SLu`.
377
378 ## CTR-test
379
380 Where a standard Scalar v3.0B branch unconditionally decrements
381 CTR when `BO[2]` is clear, CTR-test Mode introduces more flexibility
382 which allows CTR to be used for many more types of Vector loops
383 constructs.
384
385 CTR-test mode and CTi interaction is as follows: note that
386 `BO[2]` is still required to be clear for CTR decrements to be
387 considered, exactly as is the case in Scalar Power ISA v3.0B
388
389 * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
390 if `BO[2]` is zero. Masked-out elements when `sz=0` are
391 skipped (i.e. CTR is *not* decremented when the predicate
392 bit is zero and `sz=0`).
393 * **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
394 if `BO[2]` is zero and a masked-out element is skipped
395 (`sz=0` and predicate bit is zero). This one special case is the
396 **opposite** of other combinations, as well as being
397 completely different from normal SVP64 `sz=0` behaviour)
398 * **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
399 if `BO[2]` is zero and the Condition Test succeeds.
400 Masked-out elements when `sz=0` are skipped (including
401 not decrementing CTR)
402 * **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
403 if `BO[2]` is zero and the Condition Test *fails*.
404 Masked-out elements when `sz=0` are skipped (including
405 not decrementing CTR)
406
407 `CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
408 only time in the entirety of SVP64 that has side-effects when
409 a predicate mask bit is clear. **All** other SVP64 operations
410 entirely skip an element when sz=0 and a predicate mask bit is zero.
411 It is also critical to emphasise that in this unusual mode,
412 no other side-effects occur: **only** CTR is decremented, i.e. the
413 rest of the Branch operation is skipped.
414
415 ## VLSET Mode
416
417 VLSET Mode truncates the Vector Length so that subsequent instructions
418 operate on a reduced Vector Length. This is similar to
419 Data-dependent Fail-First and LD/ST Fail-First, where for VLSET the
420 truncation occurs at the Branch decision-point.
421
422 Interestingly, due to the side-effects of `VLSET` mode
423 it is actually useful to use Branch Conditional even
424 to perform no actual branch operation, i.e to point to the instruction
425 after the branch. Truncation of VL would thus conditionally occur yet control
426 flow alteration would not.
427
428 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
429 is designed to be used for explicit looping, where an explicit call to
430 `svstep` is required to move both srcstep and dststep on to
431 the next element, until VL (or other condition) is reached.
432 Vertical-First Looping is expected (required) to terminate if the end
433 of the Vector, VL, is reached. If however that loop is terminated early
434 because VL is truncated, VLSET with Vertical-First becomes meaningless.
435 Resolving this would require two branches: one Conditional, the other
436 branching unconditionally to create the loop, where the Conditional
437 one jumps over it.
438
439 Therefore, with `VSb`, the option to decide whether truncation should occur if the
440 branch succeeds *or* if the branch condition fails allows for the flexibility
441 required. This allows a Vertical-First Branch to *either* be used as
442 a branch-back (loop) *or* as part of a conditional exit or function
443 call from *inside* a loop, and for VLSET to be integrated into both
444 types of decision-making.
445
446 In the case of a Vertical-First branch-back (loop), with `VSb=0` the branch takes
447 place if success conditions are met, but on exit from that loop
448 (branch condition fails), VL will be truncated. This is extremely
449 useful.
450
451 `VLSET` mode with Horizontal-First when `VSb=0` is still
452 useful, because it can be used to truncate VL to the first predicated
453 (non-masked-out) element.
454
455 The truncation point for VL, when VLi is clear, must not include skipped
456 elements that preceded the current element being tested.
457 Example: `sz=0, VLi=0, predicate mask = 0b110010` and the Condition
458 Register failure point is at CR Field element 4.
459
460 * Testing at element 0 is skipped because its predicate bit is zero
461 * Testing at element 1 passed
462 * Testing elements 2 and 3 are skipped because their
463 respective predicate mask bits are zero
464 * Testing element 4 fails therefore VL is truncated to **2**
465 not 4 due to elements 2 and 3 being skipped.
466
467 If `sz=1` in the above example *then* VL would have been set to 4 because
468 in non-zeroing mode the zero'd elements are still effectively part of the
469 Vector (with their respective elements set to `SNZ`)
470
471 If `VLI=1` then VL would be set to 5 regardless of sz, due to being inclusive
472 of the element actually being tested.
473
474 ## VLSET and CTR-test combined
475
476 If both CTR-test and VLSET Modes are requested, it's important to
477 observe the correct order. What occurs depends on whether VLi
478 is enabled, because VLi affects the length, VL.
479
480 If VLi (VL truncate inclusive) is set:
481
482 1. compute the test including whether CTR triggers
483 2. (optionally) decrement CTR
484 3. (optionally) truncate VL (VSb inverts the decision)
485 4. decide (based on step 1) whether to terminate looping
486 (including not executing step 5)
487 5. decide whether to branch.
488
489 If VLi is clear, then when a test fails that element
490 and any following it
491 should **not** be considered part of the Vector. Consequently:
492
493 1. compute the branch test including whether CTR triggers
494 2. if the test fails against VSb, truncate VL to the *previous*
495 element, and terminate looping. No further steps executed.
496 3. (optionally) decrement CTR
497 4. decide whether to branch.
498
499 # Boolean Logic combinations
500
501 In a Scalar ISA, Branch-Conditional testing even of vector
502 results may be performed through inversion of tests. NOR of
503 all tests may be performed by inversion of the scalar condition
504 and branching *out* from the scalar loop around elements,
505 using scalar operations.
506
507 In a parallel (Vector) ISA it is the ISA itself which must perform
508 the prerequisite logic manipulation.
509 Thus for SVP64 there are an extraordinary number of nesessary combinations
510 which provide completely different and useful behaviour.
511 Available options to combine:
512
513 * `BO[0]` to make an unconditional branch would seem irrelevant if
514 it were not for predication and for side-effects (CTR Mode
515 for example)
516 * Enabling CTR-test Mode and setting `BO[2]` can still result in the
517 Branch
518 taking place, not because the Condition Test itself failed, but
519 because CTR reached zero **because**, as required by CTR-test mode,
520 CTR was decremented as a **result** of Condition Tests failing.
521 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
522 * `R30` and `~R30` and other predicate mask options including CR and
523 inverted CR bit testing
524 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
525 predicate bits
526 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
527 `OR` of all tests, respectively.
528 * Predicate Mask bits, which combine in effect with the CR being
529 tested.
530 * Inversion of Predicate Masks (`~r3` instead of `r3`, or using
531 `NE` rather than `EQ`) which results in an additional
532 level of possible ANDing, ORing etc. that would otherwise
533 need explicit instructions.
534
535 The most obviously useful combinations here are to set `BO[1]` to zero
536 in order to turn `ALL` into Great-Big-NAND and `ANY` into
537 Great-Big-NOR. Other Mode bits which perform behavioural inversion then
538 have to work round the fact that the Condition Testing is NOR or NAND.
539 The alternative to not having additional behavioural inversion
540 (`SNZ`, `VSb`, `CTi`) would be to have a second (unconditional)
541 branch directly after the first, which the first branch jumps over.
542 This contrivance is avoided by the behavioural inversion bits.
543
544 # Pseudocode and examples
545
546 Please see [[svp64/appendix]] regarding CR bit ordering and for
547 the definition of `CR{n}`
548
549 For comparative purposes this is a copy of the v3.0B `bc` pseudocode
550
551 ```
552 if (mode_is_64bit) then M <- 0
553 else M <- 32
554 if ¬BO[2] then CTR <- CTR - 1
555 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
556 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
557 if ctr_ok & cond_ok then
558 if AA then NIA <-iea EXTS(BD || 0b00)
559 else NIA <-iea CIA + EXTS(BD || 0b00)
560 if LK then LR <-iea CIA + 4
561 ```
562
563 Simplified pseudocode including LRu and CTR skipping, which illustrates
564 clearly that SVP64 Scalar Branches (VL=1) are **not** identical to
565 v3.0B Scalar Branches. The key areas where differences occur are
566 the inclusion of predication (which can still be used when VL=1), in
567 when and why CTR is decremented (CTRtest Mode) and whether LR is
568 updated (which is unconditional in v3.0B when LK=1, and conditional
569 in SVP64 when LRu=1).
570
571 ```
572 if (mode_is_64bit) then M <- 0
573 else M <- 32
574 testbit = CR[BI+32]
575 if ¬predicate_bit then testbit = SVRMmode.SNZ
576 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
577 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
578 if ¬predicate_bit & ¬SVRMmode.sz then
579 if ¬BO[2] & CTRtest & ¬CTi then
580 CTR = CTR - 1
581 # instruction finishes here
582 else
583 if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
584 if VLSET and VSb = (cond_ok & ctr_ok) then
585 if SVRMmode.VLI then SVSTATE.VL = srcstep+1
586 else SVSTATE.VL = srcstep
587 lr_ok <- LK
588 svlr_ok <- SVRMmode.SL
589 if ctr_ok & cond_ok then
590 if AA then NIA <-iea EXTS(BD || 0b00)
591 else NIA <-iea CIA + EXTS(BD || 0b00)
592 if SVRMmode.LRu then lr_ok <- ¬lr_ok
593 if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
594 if lr_ok then LR <-iea CIA + 4
595 if svlr_ok then SVLR <- SVSTATE
596 ```
597
598 Below is the pseudocode for SVP64 Branches, which is a little less
599 obvious but identical to the above. The lack of obviousness is down
600 to the early-exit opportunities.
601
602 Effective pseudocode for Horizontal-First Mode:
603
604 ```
605 if (mode_is_64bit) then M <- 0
606 else M <- 32
607 cond_ok = not SVRMmode.ALL
608 for srcstep in range(VL):
609 # select predicate bit or zero/one
610 if predicate[srcstep]:
611 # get SVP64 extended CR field 0..127
612 SVCRf = SVP64EXTRA(BI>>2)
613 CRbits = CR{SVCRf}
614 testbit = CRbits[BI & 0b11]
615 # testbit = CR[BI+32+srcstep*4]
616 else if not SVRMmode.sz:
617 # inverted CTR test skip mode
618 if ¬BO[2] & CTRtest & ¬CTI then
619 CTR = CTR - 1
620 continue # skip to next element
621 else
622 testbit = SVRMmode.SNZ
623 # actual element test here
624 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
625 el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
626 # check if CTR dec should occur
627 ctrdec = ¬BO[2]
628 if CTRtest & (el_cond_ok ^ CTi) then
629 ctrdec = 0b0
630 if ctrdec then CTR <- CTR - 1
631 # merge in the test
632 if SVRMmode.ALL:
633 cond_ok &= (el_cond_ok & ctr_ok)
634 else
635 cond_ok |= (el_cond_ok & ctr_ok)
636 # test for VL to be set (and exit)
637 if VLSET and VSb = (el_cond_ok & ctr_ok) then
638 if SVRMmode.VLI then SVSTATE.VL = srcstep+1
639 else SVSTATE.VL = srcstep
640 break
641 # early exit?
642 if SVRMmode.ALL != (el_cond_ok & ctr_ok):
643 break
644 # SVP64 rules about Scalar registers still apply!
645 if SVCRf.scalar:
646 break
647 # loop finally done, now test if branch (and update LR)
648 lr_ok <- LK
649 svlr_ok <- SVRMmode.SL
650 if cond_ok then
651 if AA then NIA <-iea EXTS(BD || 0b00)
652 else NIA <-iea CIA + EXTS(BD || 0b00)
653 if SVRMmode.LRu then lr_ok <- ¬lr_ok
654 if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
655 if lr_ok then LR <-iea CIA + 4
656 if svlr_ok then SVLR <- SVSTATE
657 ```
658
659 Pseudocode for Vertical-First Mode:
660
661 ```
662 # get SVP64 extended CR field 0..127
663 SVCRf = SVP64EXTRA(BI>>2)
664 CRbits = CR{SVCRf}
665 # select predicate bit or zero/one
666 if predicate[srcstep]:
667 if BRc = 1 then # CR0 vectorised
668 CR{SVCRf+srcstep} = CRbits
669 testbit = CRbits[BI & 0b11]
670 else if not SVRMmode.sz:
671 # inverted CTR test skip mode
672 if ¬BO[2] & CTRtest & ¬CTI then
673 CTR = CTR - 1
674 SVSTATE.srcstep = new_srcstep
675 exit # no branch testing
676 else
677 testbit = SVRMmode.SNZ
678 # actual element test here
679 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
680 # test for VL to be set (and exit)
681 if VLSET and cond_ok = VSb then
682 if SVRMmode.VLI
683 SVSTATE.VL = new_srcstep+1
684 else
685 SVSTATE.VL = new_srcstep
686 ```
687
688 # Example Shader code
689
690 ```
691 // assume f() g() or h() modify a and/or b
692 while(a > 2) {
693 if(b < 5)
694 f();
695 else
696 g();
697 h();
698 }
699 ```
700
701 which compiles to something like:
702
703 ```
704 vec<i32> a, b;
705 // ...
706 pred loop_pred = a > 2;
707 // loop continues while any of a elements greater than 2
708 while(loop_pred.any()) {
709 // vector of predicate bits
710 pred if_pred = loop_pred & (b < 5);
711 // only call f() if at least 1 bit set
712 if(if_pred.any()) {
713 f(if_pred);
714 }
715 label1:
716 // loop mask ANDs with inverted if-test
717 pred else_pred = loop_pred & ~if_pred;
718 // only call g() if at least 1 bit set
719 if(else_pred.any()) {
720 g(else_pred);
721 }
722 h(loop_pred);
723 }
724 ```
725
726 which will end up as:
727
728 ```
729 # start from while loop test point
730 b looptest
731 while_loop:
732 sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
733 sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
734 # only calculate loop_pred & pred_b because needed in f()
735 sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
736 f(CR80.v.SO)
737 skip_f:
738 # illustrate inversion of pred_b. invert r30, test ALL
739 # rather than SOME, but masked-out zero test would FAIL,
740 # therefore masked-out instead is tested against 1 not 0
741 sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
742 # else = loop & ~pred_b, need this because used in g()
743 sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
744 g(CR80.v.SO)
745 skip_g:
746 # conditionally call h(r30) if any loop pred set
747 sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
748 looptest:
749 sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
750 sv.crweird r30, CR60.GT # transfer GT vector to r30
751 sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
752 end:
753 ```
754 # TODO LRu example
755
756 show why LRu would be useful in a loop. Imagine the following
757 c code:
758
759 ```
760 for (int i = 0; i < 8; i++) {
761 if (x < y) break;
762 }
763 ```
764
765 Under these circumstances exiting from the loop is not only
766 based on CTR it has become conditional on a CR result.
767 Thus it is desirable that NIA *and* LR only be modified
768 if the conditions are met
769
770
771 v3.0 pseudocode for `bclrl`:
772
773 ```
774 if (mode_is_64bit) then M <- 0
775 else M <- 32
776 if ¬BO[2] then CTR <- CTR - 1
777 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
778 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
779 if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
780 if LK then LR <-iea CIA + 4
781 ```
782
783 the latter part for SVP64 `bclrl` becomes:
784
785 ```
786 for i in 0 to VL-1:
787 ...
788 ...
789 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
790 lr_ok <- LK
791 if ctr_ok & cond_ok then
792 NIA <-iea LR[0:61] || 0b00
793 if SVRMmode.LRu then lr_ok <- ¬lr_ok
794 if lr_ok then LR <-iea CIA + 4
795 # if NIA modified exit loop
796 ```
797
798 The reason why should be clear from this being a Vector loop:
799 unconditional destruction of LR when LK=1 makes `sv.bclrl`
800 ineffective, because the intention going into the loop is
801 that the branch should be to the copy of LR set at the *start*
802 of the loop, not half way through it.
803 However if the change to LR only occurs if
804 the branch is taken then it becomes a useful instruction.
805
806 The following pseudocode should **not** be implemented because
807 it violates the fundamental principle of SVP64 which is that
808 SVP64 looping is a thin wrapper around Scalar Instructions.
809 The pseducode below is more an actual Vector ISA Branch and
810 as such is not at all appropriate:
811
812 ```
813 for i in 0 to VL-1:
814 ...
815 ...
816 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
817 if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
818 # only at the end of looping is LK checked.
819 # this completely violates the design principle of SVP64
820 # and would actually need to be a separate (scalar)
821 # instruction "set LR to CIA+4 but retrospectively"
822 # which is clearly impossible
823 if LK then LR <-iea CIA + 4
824 ```