(no commit message)
[libreriscv.git] / openpower / sv / branches.mdwn
1 # SVP64 Branch Conditional behaviour
2
3 Please note: although similar, SVP64 Branch instructions should be
4 considered completely separate and distinct from standard scalar
5 OpenPOWER-approved v3.0B branches. **v3.0B branches are in no way
6 impacted, altered, changed or modified in any way, shape or form by the
7 SVP64 Vectorized Variants**.
8
9 It is also extremely important to note that Branches are the sole
10 pseudo-exception in SVP64 to `Scalar Identity Behaviour`. SVP64 Branches
11 contain additional modes that are useful for scalar operations (i.e. even
12 when VL=1 or when using single-bit predication).
13
14 <!-- hide -->
15 Links
16
17 * <https://bugs.libre-soc.org/show_bug.cgi?id=664>
18 * <https://bugs.libre-soc.org/show_bug.cgi?id=1215> -
19 fix error where ending on scalar BI source.
20 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
21 * <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004678.html>
22 * Branch Divergence <https://jbush001.github.io/2014/12/07/branch-divergence-in-parallel-kernels.html>
23 * [[openpower/isa/branch]]
24 * [[sv/cr_int_predication]]
25 * [TODO](https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=fa99590eeb61e63b2d2ea81f303b9b4320e3bbe1)
26 <!-- show -->
27
28 ## Rationale
29
30 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test
31 a Condition Register. However for parallel processing it is simply
32 impossible to perform multiple independent branches: the Program
33 Counter simply cannot branch to multiple destinations based on multiple
34 conditions. The best that can be done is to test multiple Conditions
35 and make a decision of a *single* branch, based on analysis of a *Vector*
36 of CR Fields which have just been calculated from a *Vector* of results.
37
38 In 3D Shader binaries, which are inherently parallelised and predicated,
39 testing all or some results and branching based on multiple tests is
40 extremely common, and a fundamental part of Shader Compilers. Example:
41 without such multi-condition test-and-branch, if a predicate mask is
42 all zeros a large batch of instructions may be masked out to `nop`,
43 and it would waste CPU cycles to run them. 3D GPU ISAs can test for
44 this scenario and, with the appropriate predicate-analysis instruction,
45 jump over fully-masked-out operations, by spotting that *all* Conditions
46 are false.
47
48 Unless Branches are aware and capable of such analysis, additional
49 instructions would be required which perform Horizontal Cumulative
50 analysis of Vectorized Condition Register Fields, in order to reduce
51 the Vector of CR Fields down to one single yes or no decision that a
52 Scalar-only v3.0B Branch-Conditional could cope with. Such instructions
53 would be unavoidable, required, and costly by comparison to a single
54 Vector-aware Branch. Therefore, in order to be commercially competitive,
55 `sv.bc` and other Vector-aware Branch Conditional instructions are a
56 high priority for 3D GPU (and OpenCL-style) workloads.
57
58 Given that Power ISA v3.0B is already quite powerful, particularly
59 the Condition Registers and their interaction with Branches, there are
60 opportunities to create extremely flexible and compact Vectorized Branch
61 behaviour. In addition, the side-effects (updating of CTR, truncation
62 of VL, described below) make it a useful instruction even if the branch
63 points to the next instruction (no actual branch).
64
65 ## Overview
66
67 When considering an "array" of branch-tests, there are four
68 primarily-useful modes: AND, OR, NAND and NOR of all Conditions.
69 NAND and NOR may be synthesised from AND and OR by inverting `BO[1]`
70 which just leaves two modes:
71
72 * Branch takes place on the **first** CR Field test to succeed
73 (a Great Big OR of all condition tests). Exit occurs
74 on the first **successful** test.
75 * Branch takes place only if **all** CR field tests succeed:
76 a Great Big AND of all condition tests. Exit occurs
77 on the first **failed** test.
78
79 Early-exit is enacted such that the Vectorized Branch does not
80 perform needless extra tests, which will help reduce reads on
81 the Condition Register file.
82
83 *Note: Early-exit is **MANDATORY** (required) behaviour. Branches
84 **MUST** exit at the first sequentially-encountered failure point,
85 for exactly the same reasons for which it is mandatory in programming
86 languages doing early-exit: to avoid damaging side-effects and to provide
87 deterministic behaviour. Speculative testing of Condition Register
88 Fields is permitted, as is speculative calculation of CTR, as long as,
89 as usual in any Out-of-Order microarchitecture, that speculative testing
90 is cancelled should an early-exit occur. i.e. the speculation must be
91 "precise": Program Order must be preserved*
92
93 Also note that when early-exit occurs in Horizontal-first Mode, srcstep,
94 dststep etc. are all reset, ready to begin looping from the beginning
95 for the next instruction. However for Vertical-first Mode srcstep
96 etc. are incremented "as usual" i.e. an early-exit has no special impact,
97 regardless of whether the branch occurred or not. This can leave srcstep
98 etc. in what may be considered an unusual state on exit from a loop and
99 it is up to the programmer to reset srcstep, dststep etc. to known-good
100 values *(easily achieved with `setvl`)*.
101
102 Additional useful behaviour involves two primary Modes (both of which
103 may be enabled and combined):
104
105 * **VLSET Mode**: identical to Data-Dependent Fail-First Mode
106 for Arithmetic SVP64 operations, with more
107 flexibility and a close interaction and integration into the
108 underlying base Scalar v3.0B Branch instruction.
109 Truncation of VL takes place around the early-exit point.
110 * **CTR-test Mode**: gives much more flexibility over when and why
111 CTR is decremented, including options to decrement if a Condition
112 test succeeds *or if it fails*.
113
114 With these side-effects, basic Boolean Logic Analysis advises that it
115 is important to provide a means to enact them each based on whether
116 testing succeeds *or fails*. This results in a not-insignificant number
117 of additional Mode Augmentation bits, accompanying VLSET and CTR-test
118 Modes respectively.
119
120 Predicate skipping or zeroing may, as usual with SVP64, be controlled by
121 `sz`. Where the predicate is masked out and zeroing is enabled, then in
122 such circumstances the same Boolean Logic Analysis dictates that rather
123 than testing only against zero, the option to test against one is also
124 prudent. This introduces a new immediate field, `SNZ`, which works in
125 conjunction with `sz`.
126
127 Vectorized Branches can be used in either SVP64 Horizontal-First or
128 Vertical-First Mode. Essentially, at an element level, the behaviour
129 is identical in both Modes, although the `ALL` bit is meaningless in
130 Vertical-First Mode.
131
132 It is also important to bear in mind that, fundamentally, Vectorized
133 Branch-Conditional is still extremely close to the Scalar v3.0B
134 Branch-Conditional instructions, and that the same v3.0B Scalar
135 Branch-Conditional instructions are still *completely separate and
136 independent*, being unaltered and unaffected by their SVP64 variants in
137 every conceivable way.
138
139 *Programming note: One important point is that SVP64 instructions are
140 64 bit. (8 bytes not 4). This needs to be taken into consideration
141 when computing branch offsets: the offset is relative to the start of
142 the instruction, which **includes** the SVP64 Prefix*
143
144 *Programming note: SV Branch-conditional instructions have no destination
145 register, only a source (`BI`). Therefore the looping will occur even on
146 Scalar BI (`sv.bc/all 16, 0, location`). If this is not desirable behaviour
147 and only a single scalar test is required
148 use a single-bit unary predicate mask such as `sm=1<<r3`*
149
150 ## Format and fields
151
152 With element-width overrides being meaningless for Condition Register
153 Fields, bits 4 thru 7 of SVP64 RM may be used for additional Mode bits.
154
155 SVP64 RM `MODE` (includes repurposing `ELWIDTH` bits 4:5, and
156 `ELWIDTH_SRC` bits 6-7 for *alternate* uses) for Branch Conditional:
157
158 | 4 | 5 | 6 | 7 | 17 | 18 | 19 | 20 | 21 | 22 23 | description |
159 | - | - | - | - | -- | -- | -- | -- | --- |--------|----------------- |
160 |ALL|SNZ| / | / | SL |SLu | 0 | 0 | / | LRu sz | simple mode |
161 |ALL|SNZ| / |VSb| SL |SLu | 0 | 1 | VLI | LRu sz | VLSET mode |
162 |ALL|SNZ|CTi| / | SL |SLu | 1 | 0 | / | LRu sz | CTR-test mode |
163 |ALL|SNZ|CTi|VSb| SL |SLu | 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode |
164
165 Brief description of fields:
166
167 * **sz=1** if predication is enabled and `sz=1` and a predicate
168 element bit is zero, `SNZ` will
169 be substituted in place of the CR bit selected by `BI`,
170 as the Condition tested.
171 Contrast this with
172 normal SVP64 `sz=1` behaviour, where *only* a zero is put in
173 place of masked-out predicate bits.
174 * **sz=0** When `sz=0` skipping occurs as usual on
175 masked-out elements, but unlike all
176 other SVP64 behaviour which entirely skips an element with
177 no related side-effects at all, there are certain
178 special circumstances where CTR
179 may be decremented. See CTR-test Mode, below.
180 * **ALL** when set, all branch conditional tests must pass in order for
181 the branch to succeed. When clear, it is the first sequentially
182 encountered successful test that causes the branch to succeed.
183 This is identical behaviour to how programming languages perform
184 early-exit on Boolean Logic chains.
185 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
186 In VLSET mode, VL *may* (depending on `VSb`) be truncated.
187 If VLI (Vector Length Inclusive) is clear,
188 VL is truncated to *exclude* the current element, otherwise it is
189 included. SVSTATE.MVL is not altered: only VL.
190 * **SL** identical to `LR` except applicable to SVSTATE. If `SL`
191 is set, SVSTATE is transferred to SVLR (conditionally on
192 whether `SLu` is set).
193 * **SLu**: SVSTATE Link Update, like `LRu` except applies to SVSTATE.
194 * **LRu**: Link Register Update, used in conjunction with LK=1
195 to make LR update conditional
196 * **VSb** In VLSET Mode, after testing,
197 if VSb is set, VL is truncated if the test succeeds. If VSb is clear,
198 VL is truncated if a test *fails*. Masked-out (skipped)
199 bits are not considered
200 part of testing when `sz=0`
201 * **CTi** CTR inversion. CTR-test Mode normally decrements per element
202 tested. CTR inversion decrements if a test *fails*. Only relevant
203 in CTR-test Mode.
204
205 LRu and CTR-test modes are where SVP64 Branches subtly differ from
206 Scalar v3.0B Branches. `sv.bcl` for example will always update LR,
207 whereas `sv.bcl/lru` will only update LR if the branch succeeds.
208
209 Of special interest is that when using ALL Mode (Great Big AND of all
210 Condition Tests), if `VL=0`, which is rare but can occur in Data-Dependent
211 Modes, the Branch will always take place because there will be no failing
212 Condition Tests to prevent it. Likewise when not using ALL Mode (Great
213 Big OR of all Condition Tests) and `VL=0` the Branch is guaranteed not
214 to occur because there will be no *successful* Condition Tests to make
215 it happen.
216
217 ## Vectorized CR Field numbering, and Scalar behaviour
218
219 It is important to keep in mind that just like all SVP64 instructions,
220 the `BI` field of the base v3.0B Branch Conditional instruction may be
221 extended by SVP64 EXTRA augmentation, as well as be marked as either
222 Scalar or Vector. It is also crucially important to keep in mind that for
223 CRs, SVP64 sequentially increments the CR *Field* numbers. CR *Fields*
224 are treated as elements, not bit-numbers of the CR *register*.
225
226 The `BI` operand of Branch Conditional operations is five bits, in scalar
227 v3.0B this would select one bit of the 32 bit CR, comprising eight CR
228 Fields of 4 bits each. In SVP64 there are 16 32 bit CRs, containing
229 128 4-bit CR Fields. Therefore, the 2 LSBs of `BI` select the bit from
230 the CR Field (EQ LT GT SO), and the top 3 bits are extended to either
231 scalar or vector and to select CR Fields 0..127 as specified in SVP64
232 [[sv/svp64/appendix]].
233
234 When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
235 then as the usual SVP64 rules apply: the Vector loop ends at the first
236 element tested (the first CR *Field*), after taking predication into
237 consideration. Thus, also as usual, when a predicate mask is given, and
238 `BI` marked as scalar, and `sz` is zero, srcstep skips forward to the
239 first non-zero predicated element, and only that one element is tested.
240
241 In other words, the fact that this is a Branch Operation (instead of an
242 arithmetic one) does not result, ultimately, in significant changes as
243 to how SVP64 is fundamentally applied, except with respect to:
244
245 * the unique properties associated with conditionally
246 changing the Program Counter (aka "a Branch"), resulting in early-out
247 opportunities
248 * CTR-testing
249
250 Both are outlined below, in later sections.
251
252 ## Horizontal-First and Vertical-First Modes
253
254 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
255 AND) results in early exit: no more updates to CTR occur (if requested);
256 no branch occurs, and LR is not updated (if requested). Likewise for
257 non-ALL mode (Great Big Or) on first success early exit also occurs,
258 however this time with the Branch proceeding. In both cases the testing
259 of the Vector of CRs should be done in linear sequential order (or in
260 REMAP re-sequenced order): such that tests that are sequentially beyond
261 the exit point are *not* carried out. (*Note: it is standard practice
262 in Programming languages to exit early from conditional tests, however a
263 little unusual to consider in an ISA that is designed for Parallel Vector
264 Processing. The reason is to have strictly-defined guaranteed behaviour*)
265
266 In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
267 behaviour. Given that only one element is being tested at a time in
268 Vertical-First Mode, a test designed to be done on multiple bits is
269 meaningless.
270
271 ## Description and Modes
272
273 Predication in both INT and CR modes may be applied to `sv.bc` and other
274 SVP64 Branch Conditional operations, exactly as they may be applied to
275 other SVP64 operations. When `sz` is zero, any masked-out Branch-element
276 operations are not included in condition testing, exactly like all other
277 SVP64 operations, *including* side-effects such as potentially updating
278 LR or CTR, which will also be skipped. There is *one* exception here,
279 which is when `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
280 predicate mask bit is also zero: under these special circumstances CTR
281 will also decrement.
282
283 When `sz` is non-zero, this normally requests insertion of a zero in
284 place of the input data, when the relevant predicate mask bit is zero.
285 This would mean that a zero is inserted in place of `CR[BI+32]` for
286 testing against `BO`, which may not be desirable in all circumstances.
287 Therefore, an extra field is provided `SNZ`, which, if set, will insert
288 a **one** in place of a masked-out element, instead of a zero.
289
290 (*Note: Both options are provided because it is useful to deliberately
291 cause the Branch-Conditional Vector testing to fail at a specific point,
292 controlled by the Predicate mask. This is particularly useful in `VLSET`
293 mode, which will truncate SVSTATE.VL at the point of the first failed
294 test.*)
295
296 Normally, CTR mode will decrement once per Condition Test, resulting under
297 normal circumstances that CTR reduces by up to VL in Horizontal-First
298 Mode. Just as when v3.0B Branch-Conditional saves at least one instruction
299 on tight inner loops through auto-decrementation of CTR, likewise it
300 is also possible to save instruction count for SVP64 loops in both
301 Vertical-First and Horizontal-First Mode, particularly in circumstances
302 where there is conditional interaction between the element computation
303 and testing, and the continuation (or otherwise) of a given loop. The
304 potential combinations of interactions is why CTR testing options have
305 been added.
306
307 Also, the unconditional bit `BO[0]` is still relevant when Predication
308 is applied to the Branch because in `ALL` mode all nonmasked bits have
309 to be tested, and when `sz=0` skipping occurs. Even when VLSET mode is
310 not used, CTR may still be decremented by the total number of nonmasked
311 elements, acting in effect as either a popcount or cntlz depending
312 on which mode bits are set. In short, Vectorized Branch becomes an
313 extremely powerful tool.
314
315 **Micro-Architectural Implementation Note**: *when implemented on top
316 of a Multi-Issue Out-of-Order Engine it is possible to pass a copy of
317 the predicate and the prerequisite CR Fields to all Branch Units, as
318 well as the current value of CTR at the time of multi-issue, and for
319 each Branch Unit to compute how many times CTR would be subtracted,
320 in a fully-deterministic and parallel fashion. A SIMD-based Branch
321 Unit, receiving and processing multiple CR Fields covered by multiple
322 predicate bits, would do the exact same thing. Obviously, however, if
323 CTR is modified within any given loop (mtctr) the behaviour of CTR is
324 no longer deterministic.*
325
326 ### Link Register Update
327
328 For a Scalar Branch, unconditional updating of the Link Register LR
329 is useful and practical. However, if a loop of CR Fields is tested,
330 unconditional updating of LR becomes problematic.
331
332 For example when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode,
333 LR's value will be unconditionally overwritten after the first element,
334 such that for execution (testing) of the second element, LR has the value
335 `CIA+8`. This is covered in the `bclrl` example, in a later section.
336
337 The addition of a LRu bit modifies behaviour in conjunction with LK,
338 as follows:
339
340 * `sv.bc` When LRu=0,LK=0, Link Register is not updated
341 * `sv.bcl` When LRu=0,LK=1, Link Register is updated unconditionally
342 * `sv.bcl/lru` When LRu=1,LK=1, Link Register will
343 only be updated if the Branch Condition fails.
344 * `sv.bc/lru` When LRu=1,LK=0, Link Register will only be updated if
345 the Branch Condition succeeds.
346
347 This avoids destruction of LR during loops (particularly Vertical-First
348 ones).
349
350 **SVLR and SVSTATE**
351
352 For precisely the reasons why `LK=1` was added originally to the Power
353 ISA, with SVSTATE being a peer of the Program Counter it becomes necessary
354 to also add an SVLR (SVSTATE Link Register) and corresponding control bits
355 `SL` and `SLu`.
356
357 ### CTR-test
358
359 Where a standard Scalar v3.0B branch unconditionally decrements CTR when
360 `BO[2]` is clear, CTR-test Mode introduces more flexibility which allows
361 CTR to be used for many more types of Vector loops constructs.
362
363 CTR-test mode and CTi interaction is as follows: note that `BO[2]`
364 is still required to be clear for CTR decrements to be considered,
365 exactly as is the case in Scalar Power ISA v3.0B
366
367 * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
368 if `BO[2]` is zero. Masked-out elements when `sz=0` are
369 skipped (i.e. CTR is *not* decremented when the predicate
370 bit is zero and `sz=0`).
371 * **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
372 if `BO[2]` is zero and a masked-out element is skipped
373 (`sz=0` and predicate bit is zero). This one special case is the
374 **opposite** of other combinations, as well as being
375 completely different from normal SVP64 `sz=0` behaviour)
376 * **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
377 if `BO[2]` is zero and the Condition Test succeeds.
378 Masked-out elements when `sz=0` are skipped (including
379 not decrementing CTR)
380 * **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
381 if `BO[2]` is zero and the Condition Test *fails*.
382 Masked-out elements when `sz=0` are skipped (including
383 not decrementing CTR)
384
385 `CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the only
386 time in the entirety of SVP64 that has side-effects when a predicate mask
387 bit is clear. **All** other SVP64 operations entirely skip an element
388 when sz=0 and a predicate mask bit is zero. It is also critical to
389 emphasise that in this unusual mode, no other side-effects occur: **only**
390 CTR is decremented, i.e. the rest of the Branch operation is skipped.
391
392 ### VLSET Mode
393
394 VLSET Mode truncates the Vector Length so that subsequent instructions
395 operate on a reduced Vector Length. This is similar to Data-dependent
396 Fail-First and LD/ST Fail-First, where for VLSET the truncation occurs
397 at the Branch decision-point.
398
399 Interestingly, due to the side-effects of `VLSET` mode it is actually
400 useful to use Branch Conditional even to perform no actual branch
401 operation, i.e to point to the instruction after the branch. Truncation of
402 VL would thus conditionally occur yet control flow alteration would not.
403
404 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
405 is designed to be used for explicit looping, where an explicit call to
406 `svstep` is required to move both srcstep and dststep on to the next
407 element, until VL (or other condition) is reached. Vertical-First Looping
408 is expected (required) to terminate if the end of the Vector, VL, is
409 reached. If however that loop is terminated early because VL is truncated,
410 VLSET with Vertical-First becomes meaningless. Resolving this would
411 require two branches: one Conditional, the other branching unconditionally
412 to create the loop, where the Conditional one jumps over it.
413
414 Therefore, with `VSb`, the option to decide whether truncation should
415 occur if the branch succeeds *or* if the branch condition fails allows
416 for the flexibility required. This allows a Vertical-First Branch to
417 *either* be used as a branch-back (loop) *or* as part of a conditional
418 exit or function call from *inside* a loop, and for VLSET to be integrated
419 into both types of decision-making.
420
421 In the case of a Vertical-First branch-back (loop), with `VSb=0` the
422 branch takes place if success conditions are met, but on exit from that
423 loop (branch condition fails), VL will be truncated. This is extremely
424 useful.
425
426 `VLSET` mode with Horizontal-First when `VSb=0` is still useful, because
427 it can be used to truncate VL to the first predicated (non-masked-out)
428 element.
429
430 The truncation point for VL, when VLi is clear, must not include skipped
431 elements that preceded the current element being tested. Example:
432 `sz=0, VLi=0, predicate mask = 0b110010` and the Condition Register
433 failure point is at CR Field element 4.
434
435 * Testing at element 0 is skipped because its predicate bit is zero
436 * Testing at element 1 passed
437 * Testing elements 2 and 3 are skipped because their
438 respective predicate mask bits are zero
439 * Testing element 4 fails therefore VL is truncated to **2**
440 not 4 due to elements 2 and 3 being skipped.
441
442 If `sz=1` in the above example *then* VL would have been set to 4 because
443 in non-zeroing mode the zero'd elements are still effectively part of
444 the Vector (with their respective elements set to `SNZ`)
445
446 If `VLI=1` then VL would be set to 5 regardless of sz, due to being
447 inclusive of the element actually being tested.
448
449 ### VLSET and CTR-test combined
450
451 If both CTR-test and VLSET Modes are requested, it is important to
452 observe the correct order. What occurs depends on whether VLi is enabled,
453 because VLi affects the length, VL.
454
455 If VLi (VL truncate inclusive) is set:
456
457 1. compute the test including whether CTR triggers
458 2. (optionally) decrement CTR
459 3. (optionally) truncate VL (VSb inverts the decision)
460 4. decide (based on step 1) whether to terminate looping
461 (including not executing step 5)
462 5. decide whether to branch.
463
464 If VLi is clear, then when a test fails that element and any following
465 it should **not** be considered part of the Vector. Consequently:
466
467 1. compute the branch test including whether CTR triggers
468 2. if the test fails against VSb, truncate VL to the *previous*
469 element, and terminate looping. No further steps executed.
470 3. (optionally) decrement CTR
471 4. decide whether to branch.
472
473 ## Boolean Logic combinations
474
475 In a Scalar ISA, Branch-Conditional testing even of vector results may be
476 performed through inversion of tests. NOR of all tests may be performed
477 by inversion of the scalar condition and branching *out* from the scalar
478 loop around elements, using scalar operations.
479
480 In a parallel (Vector) ISA it is the ISA itself which must perform
481 the prerequisite logic manipulation. Thus for SVP64 there are an
482 extraordinary number of nesessary combinations which provide completely
483 different and useful behaviour. Available options to combine:
484
485 * `BO[0]` to make an unconditional branch would seem irrelevant if
486 it were not for predication and for side-effects (CTR Mode
487 for example)
488 * Enabling CTR-test Mode and setting `BO[2]` can still result in the
489 Branch
490 taking place, not because the Condition Test itself failed, but
491 because CTR reached zero **because**, as required by CTR-test mode,
492 CTR was decremented as a **result** of Condition Tests failing.
493 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
494 * `R30` and `~R30` and other predicate mask options including CR and
495 inverted CR bit testing
496 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
497 predicate bits
498 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
499 `OR` of all tests, respectively.
500 * Predicate Mask bits, which combine in effect with the CR being
501 tested.
502 * Inversion of Predicate Masks (`~r3` instead of `r3`, or using
503 `NE` rather than `EQ`) which results in an additional
504 level of possible ANDing, ORing etc. that would otherwise
505 need explicit instructions.
506
507 The most obviously useful combinations here are to set `BO[1]` to zero
508 in order to turn `ALL` into Great-Big-NAND and `ANY` into Great-Big-NOR.
509 Other Mode bits which perform behavioural inversion then have to work
510 round the fact that the Condition Testing is NOR or NAND. The alternative
511 to not having additional behavioural inversion (`SNZ`, `VSb`, `CTi`)
512 would be to have a second (unconditional) branch directly after the first,
513 which the first branch jumps over. This contrivance is avoided by the
514 behavioural inversion bits.
515
516 ## Pseudocode and examples
517
518 Please see [[svp64/appendix]] regarding CR bit ordering and for
519 the definition of `CR{n}`
520
521 For comparative purposes this is a copy of the v3.0B `bc` pseudocode
522
523 ```
524 if (mode_is_64bit) then M <- 0
525 else M <- 32
526 if ¬BO[2] then CTR <- CTR - 1
527 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
528 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
529 if ctr_ok & cond_ok then
530 if AA then NIA <-iea EXTS(BD || 0b00)
531 else NIA <-iea CIA + EXTS(BD || 0b00)
532 if LK then LR <-iea CIA + 4
533 ```
534
535 Simplified pseudocode including LRu and CTR skipping, which illustrates
536 clearly that SVP64 Scalar Branches (VL=1) are **not** identical to
537 v3.0B Scalar Branches. The key areas where differences occur are the
538 inclusion of predication (which can still be used when VL=1), in when and
539 why CTR is decremented (CTRtest Mode) and whether LR is updated (which
540 is unconditional in v3.0B when LK=1, and conditional in SVP64 when LRu=1).
541
542 Inline comments highlight the fact that the Scalar Branch behaviour and
543 pseudocode is still clearly visible and embedded within the Vectorized
544 variant:
545
546 ```
547 if (mode_is_64bit) then M <- 0
548 else M <- 32
549 # the bit of CR to test, if the predicate bit is zero,
550 # is overridden
551 testbit = CR[BI+32]
552 if ¬predicate_bit then testbit = SVRMmode.SNZ
553 # otherwise apart from the override ctr_ok and cond_ok
554 # are exactly the same
555 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
556 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
557 if ¬predicate_bit & ¬SVRMmode.sz then
558 # this is entirely new: CTR-test mode still decrements CTR
559 # even when predicate-bits are zero
560 if ¬BO[2] & CTRtest & ¬CTi then
561 CTR = CTR - 1
562 # instruction finishes here
563 else
564 # usual BO[2] CTR-mode now under CTR-test mode as well
565 if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
566 # new VLset mode, conditional test truncates VL
567 if VLSET and VSb = (cond_ok & ctr_ok) then
568 if SVRMmode.VLI then SVSTATE.VL = srcstep+1
569 else SVSTATE.VL = srcstep
570 # usual LR is now conditional, but also joined by SVLR
571 lr_ok <- LK
572 svlr_ok <- SVRMmode.SL
573 if ctr_ok & cond_ok then
574 if AA then NIA <-iea EXTS(BD || 0b00)
575 else NIA <-iea CIA + EXTS(BD || 0b00)
576 if SVRMmode.LRu then lr_ok <- ¬lr_ok
577 if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
578 if lr_ok then LR <-iea CIA + 4
579 if svlr_ok then SVLR <- SVSTATE
580 ```
581
582 Below is the pseudocode for SVP64 Branches, which is a little less
583 obvious but identical to the above. The lack of obviousness is down to
584 the early-exit opportunities.
585
586 Effective pseudocode for Horizontal-First Mode:
587
588 ```
589 if (mode_is_64bit) then M <- 0
590 else M <- 32
591 cond_ok = not SVRMmode.ALL
592 for srcstep in range(VL):
593 # select predicate bit or zero/one
594 if predicate[srcstep]:
595 # get SVP64 extended CR field 0..127
596 SVCRf = SVP64EXTRA(BI>>2)
597 CRbits = CR{SVCRf}
598 testbit = CRbits[BI & 0b11]
599 # testbit = CR[BI+32+srcstep*4]
600 else if not SVRMmode.sz:
601 # inverted CTR test skip mode
602 if ¬BO[2] & CTRtest & ¬CTI then
603 CTR = CTR - 1
604 continue # skip to next element
605 else
606 testbit = SVRMmode.SNZ
607 # actual element test here
608 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
609 el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
610 # check if CTR dec should occur
611 ctrdec = ¬BO[2]
612 if CTRtest & (el_cond_ok ^ CTi) then
613 ctrdec = 0b0
614 if ctrdec then CTR <- CTR - 1
615 # merge in the test
616 if SVRMmode.ALL:
617 cond_ok &= (el_cond_ok & ctr_ok)
618 else
619 cond_ok |= (el_cond_ok & ctr_ok)
620 # test for VL to be set (and exit)
621 if VLSET and VSb = (el_cond_ok & ctr_ok) then
622 if SVRMmode.VLI then SVSTATE.VL = srcstep+1
623 else SVSTATE.VL = srcstep
624 break
625 # early exit?
626 if SVRMmode.ALL != (el_cond_ok & ctr_ok):
627 break
628 # SVP64 rules about Scalar registers still apply!
629 if SVCRf.scalar:
630 break
631 # loop finally done, now test if branch (and update LR)
632 lr_ok <- LK
633 svlr_ok <- SVRMmode.SL
634 if cond_ok then
635 if AA then NIA <-iea EXTS(BD || 0b00)
636 else NIA <-iea CIA + EXTS(BD || 0b00)
637 if SVRMmode.LRu then lr_ok <- ¬lr_ok
638 if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
639 if lr_ok then LR <-iea CIA + 4
640 if svlr_ok then SVLR <- SVSTATE
641 ```
642
643 Pseudocode for Vertical-First Mode:
644
645 ```
646 # get SVP64 extended CR field 0..127
647 SVCRf = SVP64EXTRA(BI>>2)
648 CRbits = CR{SVCRf}
649 # select predicate bit or zero/one
650 if predicate[srcstep]:
651 if BRc = 1 then # CR0 vectorized
652 CR{SVCRf+srcstep} = CRbits
653 testbit = CRbits[BI & 0b11]
654 else if not SVRMmode.sz:
655 # inverted CTR test skip mode
656 if ¬BO[2] & CTRtest & ¬CTI then
657 CTR = CTR - 1
658 SVSTATE.srcstep = new_srcstep
659 exit # no branch testing
660 else
661 testbit = SVRMmode.SNZ
662 # actual element test here
663 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
664 # test for VL to be set (and exit)
665 if VLSET and cond_ok = VSb then
666 if SVRMmode.VLI
667 SVSTATE.VL = new_srcstep+1
668 else
669 SVSTATE.VL = new_srcstep
670 ```
671
672 ### Example Shader code
673
674 ```
675 // assume f() g() or h() modify a and/or b
676 while(a > 2) {
677 if(b < 5)
678 f();
679 else
680 g();
681 h();
682 }
683 ```
684
685 which compiles to something like:
686
687 ```
688 vec<i32> a, b;
689 // ...
690 pred loop_pred = a > 2;
691 // loop continues while any of a elements greater than 2
692 while(loop_pred.any()) {
693 // vector of predicate bits
694 pred if_pred = loop_pred & (b < 5);
695 // only call f() if at least 1 bit set
696 if(if_pred.any()) {
697 f(if_pred);
698 }
699 label1:
700 // loop mask ANDs with inverted if-test
701 pred else_pred = loop_pred & ~if_pred;
702 // only call g() if at least 1 bit set
703 if(else_pred.any()) {
704 g(else_pred);
705 }
706 h(loop_pred);
707 }
708 ```
709
710 which will end up as:
711
712 ```
713 # start from while loop test point
714 b looptest
715 while_loop:
716 sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
717 sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
718 # only calculate loop_pred & pred_b because needed in f()
719 sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
720 f(CR80.v.SO)
721 skip_f:
722 # illustrate inversion of pred_b. invert r30, test ALL
723 # rather than SOME, but masked-out zero test would FAIL,
724 # therefore masked-out instead is tested against 1 not 0
725 sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
726 # else = loop & ~pred_b, need this because used in g()
727 sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
728 g(CR80.v.SO)
729 skip_g:
730 # conditionally call h(r30) if any loop pred set
731 sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
732 looptest:
733 sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
734 sv.crweird r30, CR60.GT # transfer GT vector to r30
735 sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
736 end:
737 ```
738
739 ### LRu example
740
741 show why LRu would be useful in a loop. Imagine the following
742 c code:
743
744 ```
745 for (int i = 0; i < 8; i++) {
746 if (x < y) break;
747 }
748 ```
749
750 Under these circumstances exiting from the loop is not only based on
751 CTR it has become conditional on a CR result. Thus it is desirable that
752 NIA *and* LR only be modified if the conditions are met
753
754
755 v3.0 pseudocode for `bclrl`:
756
757 ```
758 if (mode_is_64bit) then M <- 0
759 else M <- 32
760 if ¬BO[2] then CTR <- CTR - 1
761 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
762 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
763 if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
764 if LK then LR <-iea CIA + 4
765 ```
766
767 the latter part for SVP64 `bclrl` becomes:
768
769 ```
770 for i in 0 to VL-1:
771 ...
772 ...
773 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
774 lr_ok <- LK
775 if ctr_ok & cond_ok then
776 NIA <-iea LR[0:61] || 0b00
777 if SVRMmode.LRu then lr_ok <- ¬lr_ok
778 if lr_ok then LR <-iea CIA + 4
779 # if NIA modified exit loop
780 ```
781
782 The reason why should be clear from this being a Vector loop:
783 unconditional destruction of LR when LK=1 makes `sv.bclrl` ineffective,
784 because the intention going into the loop is that the branch should be to
785 the copy of LR set at the *start* of the loop, not half way through it.
786 However if the change to LR only occurs if the branch is taken then it
787 becomes a useful instruction.
788
789 The following pseudocode should **not** be implemented because it
790 violates the fundamental principle of SVP64 which is that SVP64 looping
791 is a thin wrapper around Scalar Instructions. The pseducode below is
792 more an actual Vector ISA Branch and as such is not at all appropriate:
793
794 ```
795 for i in 0 to VL-1:
796 ...
797 ...
798 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
799 if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
800 # only at the end of looping is LK checked.
801 # this completely violates the design principle of SVP64
802 # and would actually need to be a separate (scalar)
803 # instruction "set LR to CIA+4 but retrospectively"
804 # which is clearly impossible
805 if LK then LR <-iea CIA + 4
806 ```
807
808 --------
809
810 \newpage{}
811
812 [[!tag standards]]