CPU cycles not only to run them but also to load the predicate
mask repeatedly for each one. 3D GPU ISAs can test for this scenario
and jump over the fully-masked-out operations, by spotting that
-all Conditions are zero.
+*all* Conditions are false. Or, conversely, they only call the function if at least
+one Condition) is set.
Therefore, in order to be commercially competitive, `sv.bc` and
other Vector-aware Branch Conditional instructions are a high priority
for 3D GPU workloads.
NAND and NOR may be synthesised by
inverting `BO[2]` which just leaves two modes:
-* Branch takes place on the first CR test to succeed
+* Branch takes place on the first CR Field test to succeed
(a Great Big OR of all condition tests)
-* Branch takes place only if **all** CR tests succeed:
+* Branch takes place only if **all** CR field tests succeed:
a Great Big AND of all condition tests
(including those where the predicate is masked out
and the corresponding CR Field is considered to be
set to `SNZ`)
When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
-then as usual the loop ends at the first element tested, after taking
+then as the usual SVP64 rules apply,
+the loop ends at the first element tested, after taking
predication into consideration. Thus, also as usual, when a predicate mask is
given, and `BI` marked as scalar, and `sz` is zero, srcstep
skips forward to the first non-zero predicated element, and only that
| - | - | - | - | -- | -- | --- |---------|----------------- |
|ALL|LRu| / | / | 0 | 0 | / | SNZ sz | normal mode |
|ALL|LRu| / |VSb| 0 | 1 | VLI | SNZ sz | VLSET mode |
-|ALL|LRu| / | / | 1 | 0 | / | SNZ sz | CTR mode |
-|ALL|LRu| / |VSb| 1 | 1 | VLI | SNZ sz | CTR+VLSET mode |
+|ALL|LRu|Csk| / | 1 | 0 | / | SNZ sz | CTR mode |
+|ALL|LRu|Csk|VSb| 1 | 1 | VLI | SNZ sz | CTR+VLSET mode |
Fields:
* **VSb** is most relevant for Vertical-First VLSET Mode. After testing,
if VSb is set, VL is truncated if the branch succeeds. If VSb is clear,
VL is truncated if the branch did **not** take place.
+* **Csk** CTR skipping. CTR Mode normally subtracts VL from CTR.
+ Csk refines that further
-CTR mode will subtract VL from CTR rather than just decrement
+Normally, CTR mode will subtract VL from CTR rather than just decrement
CTR by one. Just as when v3.0B Branch-Conditional saves at
least one instruction on tight inner loops through auto-decrementation
of CTR, likewise it is also possible to save instruction count for
at a time, standard single (v3.0B) CTR decrementing should
correspondingly be used instead.
-If CTR+VLSET Modes are requested, the amount that CTR is decremented
+If both CTR+VLSET Modes are requested, the amount that CTR is decremented
by is the value of VL *after* truncation (should that occur).
+Enabling CTR Skipping (Csk) has a number of options, which need explaining:
+
+* **Standard SVP64 CTR Mode** Csk=0, sz=0, no predicate specified.
+ VL will be subtracted from CTR (as already explained above)
+* **Predicated CTR Mode** Csk=1, predicate is specified.
+ Masked-out elements are *not included* in the
+ count subtracted from CTR. If VL=3 but the predicate mask
+ is 0b101 and all CR Field Conditions are tested then CTR
+ will be reduced by two, *not* three (because only 2 predicate
+ mask bits are enabled). This includes when sz=1.
+* **Non-predicated CTR Skip Mode**, Csk=1, sz=0, no
+ predicate specified.
+ Only those elements which pass the Condition Test (in
+ both ALL or ANY mode) will be subtracted from CTR
+* **Non-predicated CTR Skip inverted**, Csk=1, sz=1,
+ no predicate specified.
+ Only those elements which **fail** the Condition
+ test will be subtracted from CTR
+
Note that, interestingly, due to the side-effects of `VLSET` mode
it is actually useful to use Branch Conditional even
to perform no actual branch operation, i.e to point to the instruction