CPU cycles not only to run them but also to load the predicate
mask repeatedly for each one. 3D GPU ISAs can test for this scenario
and jump over the fully-masked-out operations, by spotting that
-all Conditions are zero.
+*all* Conditions are false. Or, conversely, they only call the function if at least
+one Condition) is set.
Therefore, in order to be commercially competitive, `sv.bc` and
other Vector-aware Branch Conditional instructions are a high priority
for 3D GPU workloads.
The `BI` field of Branch Conditional operations is five bits, in scalar
-v3.0B this would select one bit of the 32 bit CR. In SVP64 there are
+v3.0B this would select one bit of the 32 bit CR,
+comprising eight CR Fields of 4 bits each. In SVP64 there are
16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
`BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
are extended to either scalar or vector and to select CR Fields 0..127
NAND and NOR may be synthesised by
inverting `BO[2]` which just leaves two modes:
-* Branch takes place on the first CR test to succeed
+* Branch takes place on the first CR Field test to succeed
(a Great Big OR of all condition tests)
-* Branch takes place only if **all** CR tests succeed:
+* Branch takes place only if **all** CR field tests succeed:
a Great Big AND of all condition tests
(including those where the predicate is masked out
and the corresponding CR Field is considered to be
set to `SNZ`)
-When the CR Fields selected by SVP64 Augmented `BI` is marked as scalar,
-then as usual the loop ends at the first element tested, after taking
-predication into consideration. Thus, as usual, when `sz` is zero, srcstep
+When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
+then as the usual SVP64 rules apply,
+the loop ends at the first element tested, after taking
+predication into consideration. Thus, also as usual, when a predicate mask is
+given, and `BI` marked as scalar, and `sz` is zero, srcstep
skips forward to the first non-zero predicated element, and only that
one element is tested.
Vector Processing. The reason is to have strictly-defined guaranteed
behaviour*)
-In Vertical-First Mode, the `ALL` bit still applies, but to the elements
-that are executed up to the Hint length, in parallel batches. Contrast
-this with Horizontal-First Mode which tests elements from
-`0..VL-1`, Vertical-First tests elements `srcstep..MIN(srcstep+VFHint,VL-1)` See
-[[sv/setvl]] for the definition of Vertical-First Hint.
+In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
+behaviour. Given that only one element is being tested at a time
+in Vertical-First Mode, a test designed to be done on multiple
+bits is meaningless.
Predication in both INT and CR modes may be applied to `sv.bc` and other
SVP64 Branch Conditional operations, exactly as they may be applied to
other SVP64 operations. When `sz` is zero, any masked-out Branch-element
operations are not included in condition testing, exactly like all other
-SVP64 operations. This *includes* side-effects such as decrementing of
-CTR, which is also skipped on masked-out CR Field elements, when `sz`
-is zero.
-
-However when `sz` is non-zero, this normally requests insertion of a zero
+SVP64 operations, *including* side-effects such as potentially updating
+LR or CTR, which will also be skipped. There is *one* exception here,
+which is when
+`BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
+predicate mask bit is also zero:
+under these special circumstances CTR will also decrement.
+
+When `sz` is non-zero, this normally requests insertion of a zero
in place of the input data, when the relevant predicate mask bit is zero.
This would mean that a zero is inserted in place of `CR[BI+32]` for
testing against `BO`, which may not be desirable in all circumstances.
SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
Conditional:
-| 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description |
-| - | - | - | - | -- | -- | --- |---------|-------------------- |
-|ALL|LRu| / | / | 0 | 0 | / | SNZ sz | normal mode |
-|ALL|LRu| / |VSb| 0 | 1 | VLI | SNZ sz | VLSET mode |
-|ALL|LRu|CVh| / | 1 | 0 | / | SNZ sz | CTR mode |
-|ALL|LRu|CVh|VSb| 1 | 1 | VLI | SNZ sz | CTR+VLSET mode |
+| 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description |
+| - | - | - | - | -- | -- | --- |---------|----------------- |
+|ALL|LRu| / | / | 0 | 0 | / | SNZ sz | normal mode |
+|ALL|LRu| / |VSb| 0 | 1 | VLI | SNZ sz | VLSET mode |
+|ALL|LRu|CTi| / | 1 | 0 | / | SNZ sz | CTR-test mode |
+|ALL|LRu|CTi|VSb| 1 | 1 | VLI | SNZ sz | CTR-test+VLSET mode |
Fields:
* **VSb** is most relevant for Vertical-First VLSET Mode. After testing,
if VSb is set, VL is truncated if the branch succeeds. If VSb is clear,
VL is truncated if the branch did **not** take place.
+* **CTi** CTR inversion. CTR Mode normally decrements per element
+ tested. CTR inversion decrements if a test *fails*.
-CTR mode will subtract VL (or VLHint) from CTR rather than just decrement
-CTR by one. Just as when v3.0B Branch-Conditional saves at
+Normally, CTR mode will decrement once per Condition Test, resulting
+under normal circumstances that CTR reduces by up to VL in Horizontal-First
+Mode. Just as when v3.0B Branch-Conditional saves at
least one instruction on tight inner loops through auto-decrementation
of CTR, likewise it is also possible to save instruction count for
-SVP64 loops in both Vertical-First and Horizontal-First Mode.
-
-Note that, interestingly, due to the useful side-effects of `VLSET` mode
+SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
+in circumstances where there is conditional interaction between the
+element computation and testing, and the continuation (or otherwise)
+of a given loop. The potential combinations of interactions is why CTR
+testing options have been added.
+
+If both CTR-test and VLSET Modes are requested, then because the CTR decrement is on a per element basis, the total amount that CTR is decremented
+by will end up being VL *after* truncation (should that occur). In
+other words, the order is (as can be seen in pseudocode, below):
+
+1. compute the test
+2. (optionally) decrement CTR
+3. (optionally) truncate VL
+4. decide (based on step 1) whether to terminate looping
+ (including not executing step 5)
+5. decide whether to branch.
+
+CTR-test mode and CTi interaction is as follows: note that
+`BO[2]` is still required to be clear for decrements to be
+considered.
+
+* **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
+ if `BO[2]` is zero. Masked-out elements when `sz=0` are
+ skipped.
+* **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
+ if `BO[2]` is zero and a masked-out element is skipped
+ (`sz=0` and predicate bit is zero). This one special case is the
+ **opposite** of other combinations.
+* **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
+ if `BO[2]` is zero and the Condition Test succeeds.
+ Masked-out elements when `sz=0` are skipped.
+* **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
+ if `BO[2]` is zero and the Condition Test *fails*.
+ Masked-out elements when `sz=0` are skipped.
+
+Note that, interestingly, due to the side-effects of `VLSET` mode
it is actually useful to use Branch Conditional even
to perform no actual branch operation, i.e to point to the instruction
-after the branch.
-If VLSET mode was requested with REMAP, VL will have been set to the
-length of one of the loop endpoints, as specified by the bit from
-the Branch `BI` field.
+after the branch. Truncation of VL would thus conditionally occur yet control
+flow alteration would not.
Also, the unconditional bit `BO[0]` is still relevant when Predication
is applied to the Branch because in `ALL` mode all nonmasked bits have
is used for explicit looping, where the looping is to terminate if the end
of the Vector, VL, is reached. If however that loop is terminated early
because VL is truncated, VLSET with Vertical-First becomes meaningless.
-Therefore, the option to decide whether truncation should occur if the
+Therefore, with `VSb`, the option to decide whether truncation should occur if the
branch succeeds *or* if the branch condition fails allows for flexibility
required.
mode, the Vector CR Field is to be overwritten or not: in some cases it
is useful to know but in others all that is needed is the branch itself.
+*Programming note: One important point is that SVP64 instructions are 64 bit.
+(8 bytes not 4). This needs to be taken into consideration when computing
+branch offsets: the offset is relative to the start of the instruction,
+which includes the SVP64 Prefix*
+
Pseudocode for Horizontal-First Mode:
```
testbit = CRbits[BI & 0b11]
# testbit = CR[BI+32+srcstep*4]
else if not SVRMmode.sz:
+ # inverted CTR test skip mode
+ if ¬BO[2] & CTRtest & ¬CTI then
+ CTR = CTR - 1
continue
else
testbit = SVRMmode.SNZ
CR{SVCRf+srcstep} = CRbits
testbit = CRbits[BI & 0b11]
else if not SVRMmode.sz:
+ # inverted CTR test skip mode
+ if ¬BO[2] & CTRtest & ¬CTI then
+ CTR = CTR - 1
SVSTATE.srcstep = new_srcstep
exit # no branch testing
else
SVSTATE.VL = new_srcstep
```
-v3.0B branch pseudocode including LRu
+v3.0B branch pseudocode including LRu and CTR skipping
```
if (mode_is_64bit) then M <- 0
else M <- 32
-if ¬BO[2] then CTR <- CTR - 1
-ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
+ctrdec = ¬BO[2]
+if CTRtest & (cond_ok ^ CTi) then
+ ctrdec = 0b0
+if ctrdec then CTR <- CTR - 1
+ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
lr_ok <- SVRMmode.LRu
if ctr_ok & cond_ok then
if AA then NIA <-iea EXTS(BD || 0b00)