# SVP64 Branch Conditional behaviour
+**DRAFT STATUS**
+
+Please note: SVP64 Branch instructions should be
+considered completely separate and distinct from
+standard scalar OpenPOWER-approved v3.0B branches.
+**v3.0B branches are in no way impacted, altered,
+changed or modified in any way, shape or form by
+the SVP64 Vectorised Variants**.
+
Links
* <https://bugs.libre-soc.org/show_bug.cgi?id=664>
* <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
* [[openpower/isa/branch]]
-Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a Condition Register.
-When doing so in a Vector Context, it is quite reasonable and logical to test a *Vector* of
-CR Fields. In 3D Shader binaries, which are inherently parallelised
-and predicated, testing all or some results and branching based on
-multiple tests is extremely common.
-Therefore, `sv.bc` and other Vector-aware Branch Conditional instructions are worth
-including.
-
-The `BI` field of Branch Conditional operations is five bits,
-in scalar v3.0B this would select one bit of the 32 bit CR.
-In SVP64 there are 16 32 bit CRs, containing 128 4-bit CR Fields.
-Therefore, the 2 LSBs of `BI` select the bit from the CR Field, and the
-top 3 bits are extended to either scalar or vector and to
-select CR Fields 0..127 as specified
-in SVP64 [[sv/svp64/appendix]]
-
-When considering an "array" of branches, there are two useful modes:
-
-* Branch takes place on the first CR test to succeed
+Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
+Condition Register. However for parallel processing it is simply impossible
+to perform multiple independent branches: the Program Counter simply
+cannot branch to multiple destinations based on multiple conditions.
+The best that can be done is
+to test multiple Conditions and make a decision of a *single* branch,
+based on analysis of a *Vector* of CR Fields
+which have just been calculated from a *Vector* of results.
+
+In 3D Shader
+binaries, which are inherently parallelised and predicated, testing all or
+some results and branching based on multiple tests is extremely common,
+and a fundamental part of Shader Compilers. Example:
+without such multi-condition
+test-and-branch, if a predicate mask is all zeros a large batch of
+instructions may be masked out to `nop`, and it would waste
+CPU cycles not only to run them but also to load the predicate
+mask repeatedly for each one. 3D GPU ISAs can test for this scenario
+and jump over the fully-masked-out operations, by spotting that
+*all* Conditions are false. Or, conversely, they only call the function if at least
+one Condition) is set.
+Therefore, in order to be commercially competitive, `sv.bc` and
+other Vector-aware Branch Conditional instructions are a high priority
+for 3D GPU workloads.
+
+The `BI` field of Branch Conditional operations is five bits, in scalar
+v3.0B this would select one bit of the 32 bit CR,
+comprising eight CR Fields of 4 bits each. In SVP64 there are
+16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
+`BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
+are extended to either scalar or vector and to select CR Fields 0..127
+as specified in SVP64 [[sv/svp64/appendix]].
+
+When considering an "array" of branch-tests, there are four useful modes:
+AND, OR, NAND and NOR of all Conditions.
+NAND and NOR may be synthesised by
+inverting `BO[2]` which just leaves two modes:
+
+* Branch takes place on the first CR Field test to succeed
(a Great Big OR of all condition tests)
-* Branch takes place only if **all** CR tests succeed:
+* Branch takes place only if **all** CR field tests succeed:
a Great Big AND of all condition tests
(including those where the predicate is masked out
and the corresponding CR Field is considered to be
set to `SNZ`)
-In Vertical-First Mode, the `ALL` bit should
-not be used. If set, behaviour is `UNDEFINED`.
-(*The reason is that Vertical-First hints may permit
-multiple elements up to hint length to be executed
-in parallel, however the number is entirely up to
-implementors. Attempting to test an arbitrary
-indeterminate number of Conditional tests is impossible
-to define, and efforts to enforce such defined behaviour
-interfere with Vertical-First mode parallel
-opportunistic behaviour.*)
-
-In `svstep` mode,
-the whole CR Field, part of which is
-selected by `BI` (top 3 bits) is updated based on
-incrementing srcstep and dststep, and performing the
-same tests as [[sv/svstep]], following which the Branch
-Conditional instruction proceeds as normal (reading
-and testing the CR bit just updated, if the relevant
-`BO` bit is set). Note that the SVSTATE fields
-are still updated, and the CR field still updated,
-even if the `BO` bits do not require CR testing.
-
-Predication in both INT and CR modes may be applied to
-`sv.bc` and other SVP64 Branch Conditional operations,
-exactly as they may be applied to other SVP64 operations.
-When `sz` is zero, any masked-out Branch-element operations
-are not executed, exactly like all other SVP64
-operations.
-
-However when `sz` is non-zero, this normally requests insertion
-of a zero in place of the input data, when the relevant predicate
-mask bit is zero. This would mean that a zero is inserted in
-place of `CR[BI+32]` for testing against `BO`, which may not
-be desirable in all circumstances. Therefore, an extra field
-is provided `SNZ`, which, if set, will insert a **one** in
-place of a masked-out element instead of a zero.
-
-(*Note: Both options are provided because it is useful to
-deliberately cause the Branch-Conditional Vector testing
-to fail at a specific point, controlled by the Predicate
-mask. This is particularly useful in `VLSET` mode, which
-will truncate SVSTATE.VL at the point of the first failed
+When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
+then as the usual SVP64 rules apply,
+the loop ends at the first element tested, after taking
+predication into consideration. Thus, also as usual, when a predicate mask is
+given, and `BI` marked as scalar, and `sz` is zero, srcstep
+skips forward to the first non-zero predicated element, and only that
+one element is tested.
+
+In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
+AND) results in early exit: no more updates to CTR occur (if requested);
+no branch occurs, and LR is not updated (if requested). Likewise for
+non-ALL mode (Great Big Or) on first success early exit also occurs,
+however this time with the Branch proceeding. In both cases the testing
+of the Vector of CRs should be done in linear sequential order (or in
+REMAP re-sequenced order): such that tests that are sequentially beyond
+the exit point are *not* carried out. (*Note: it is standard practice in
+Programming languages to exit early from conditional tests, however
+a little unusual to consider in an ISA that is designed for Parallel
+Vector Processing. The reason is to have strictly-defined guaranteed
+behaviour*)
+
+In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
+behaviour. Given that only one element is being tested at a time
+in Vertical-First Mode, a test designed to be done on multiple
+bits is meaningless.
+
+Predication in both INT and CR modes may be applied to `sv.bc` and other
+SVP64 Branch Conditional operations, exactly as they may be applied to
+other SVP64 operations. When `sz` is zero, any masked-out Branch-element
+operations are not included in condition testing, exactly like all other
+SVP64 operations, *including* side-effects such as potentially updating
+LR or CTR, which will also be skipped. There is *one* exception here,
+which is when
+`BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
+predicate mask bit is also zero:
+under these special circumstances CTR will also decrement.
+
+When `sz` is non-zero, this normally requests insertion of a zero
+in place of the input data, when the relevant predicate mask bit is zero.
+This would mean that a zero is inserted in place of `CR[BI+32]` for
+testing against `BO`, which may not be desirable in all circumstances.
+Therefore, an extra field is provided `SNZ`, which, if set, will insert
+a **one** in place of a masked-out element, instead of a zero.
+
+(*Note: Both options are provided because it is useful to deliberately
+cause the Branch-Conditional Vector testing to fail at a specific point,
+controlled by the Predicate mask. This is particularly useful in `VLSET`
+mode, which will truncate SVSTATE.VL at the point of the first failed
test.*)
-SVP64 RM `MODE` for Branch Conditional:
+SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
+Conditional:
-| 0-1 | 2 | 3 4 | description |
-| --- | --- |---------|-------------------------- |
-| 00 | SNZ | ALL sz | normal mode |
-| 01 | VLI | ALL sz | VLSET mode |
-| 10 | SNZ | ALL sz | svstep mode |
-| 11 | VLI | ALL sz | svstep VLSET mode, in Horizontal-First |
-| 11 | VLI | SNZ sz | svstep VLSET mode, in Vertical-First |
+| 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description |
+| - | - | - | - | -- | -- | --- |---------|----------------- |
+|ALL|LRu| / | / | 0 | 0 | / | SNZ sz | normal mode |
+|ALL|LRu| / |VSb| 0 | 1 | VLI | SNZ sz | VLSET mode |
+|ALL|LRu|CTi| / | 1 | 0 | / | SNZ sz | CTR-test mode |
+|ALL|LRu|CTi|VSb| 1 | 1 | VLI | SNZ sz | CTR-test+VLSET mode |
Fields:
-* **sz** if predication is enabled will put 4 copies of `SNZ` in place of the src CR Field when the predicate bit is zero. otherwise the element is ignored or skipped, depending on context.
+* **sz** if predication is enabled will put 4 copies of `SNZ` in place of
+ the src CR Field when the predicate bit is zero. otherwise the element
+ is ignored or skipped, depending on context.
* **ALL** when set, all branch conditional tests must pass in order for
-the branch to succeed.
-* **VLI** In VLSET mode, VL is set equal (truncated) to the first branch
-which succeeds. If VLI (Vector Length Inclusive) is clear, VL is truncated
-to *exclude* the current element, otherwise it is included. SVSTATE.MVL is not changed.
-
-svstep mode will run an increment of SVSTATE srcstep and dststep
-(which is still useful in Horizontal First Mode). Unlike `svstep.` however
-which updates only CR0 with the testing of REMAP loop progress,
-the CR Field is taken from the branch `BI` field, and updated
-prior to proceeding to each element branch conditional testing.
-
-Note that, interestingly, due to the useful side-effects of `VLSET` mode
-and `svstep` mode it is actually useful to use Branch Conditional even
-to perform no actual branch operation, i.e to point to the instruction
-after the branch.
+ the branch to succeed. When clear, it is the first sequentially
+ encountered successful test that causes the branch to succeed.
+* **VLI** VLSET is identical to Data-dependent Fail-First mode.
+ In VLSET mode, VL is set equal (truncated) to the first point
+ where, assuming Conditions are tested sequentially, the branch succeeds
+ *or fails* depending if VSb is set.
+ If VLI (Vector Length Inclusive) is clear,
+ VL is truncated to *exclude* the current element, otherwise it is
+ included. SVSTATE.MVL is not changed: only VL.
+* **LRu**: Link Register Update. When set, Link Register will
+ only be updated if the Branch Condition succeeds. This avoids
+ destruction of LR during loops (particularly Vertical-First
+ ones).
+* **VSb** is most relevant for Vertical-First VLSET Mode. After testing,
+ if VSb is set, VL is truncated if the branch succeeds. If VSb is clear,
+ VL is truncated if the branch did **not** take place.
+* **CTi** CTR inversion. CTR Mode normally decrements per element
+ tested. CTR inversion decrements if a test *fails*.
+
+Normally, CTR mode will decrement once per Condition Test, resulting
+under normal circumstances that CTR reduces by up to VL in Horizontal-First
+Mode. Just as when v3.0B Branch-Conditional saves at
+least one instruction on tight inner loops through auto-decrementation
+of CTR, likewise it is also possible to save instruction count for
+SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
+in circumstances where there is conditional interaction between the
+element computation and testing, and the continuation (or otherwise)
+of a given loop. The potential combinations of interactions is why CTR
+testing options have been added.
+
+If both CTR-test and VLSET Modes are requested, then because the CTR decrement is on a per element basis, the total amount that CTR is decremented
+by will end up being VL *after* truncation (should that occur). In
+other words, the order is (as can be seen in pseudocode, below):
+
+1. compute the test
+2. (optionally) decrement CTR
+3. (optionally) truncate VL
+4. decide (based on step 1) whether to terminate looping
+ (including not executing step 5)
+5. decide whether to branch.
-In particular, svstep mode is still useful for Horizontal-First Mode
-particularly in combination with REMAP. All "loop end" conditions
-will be tested on a per-element basis and placed into a Vector of
-CRs starting from the point specified by the Branch `BI` field.
-This Vector of CR Fields may then be subsequently used as a Predicate
-Mask, and, furthermore, if VLSET mode was requested, VL will have
-been set to the length of one of the loop endpoints, again as specified
-by the bit from the Branch `BI` field.
+CTR-test mode and CTi interaction is as follows: note that
+`BO[2]` is still required to be clear for decrements to be
+considered.
+
+* **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
+ if `BO[2]` is zero. Masked-out elements when `sz=0` are
+ skipped.
+* **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
+ if `BO[2]` is zero and a masked-out element is skipped
+ (`sz=0` and predicate bit is zero). This one special case is the
+ **opposite** of other combinations.
+* **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
+ if `BO[2]` is zero and the Condition Test succeeds.
+ Masked-out elements when `sz=0` are skipped.
+* **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
+ if `BO[2]` is zero and the Condition Test *fails*.
+ Masked-out elements when `sz=0` are skipped.
+
+Note that, interestingly, due to the side-effects of `VLSET` mode
+it is actually useful to use Branch Conditional even
+to perform no actual branch operation, i.e to point to the instruction
+after the branch. Truncation of VL would thus conditionally occur yet control
+flow alteration would not.
Also, the unconditional bit `BO[0]` is still relevant when Predication
is applied to the Branch because in `ALL` mode all nonmasked bits have
may still be decremented by the total number of nonmasked elements.
In short, Vectorised Branch becomes an extremely powerful tool.
+`VLSET` mode with Vertical-First is particularly unusual. Vertical-First
+is used for explicit looping, where the looping is to terminate if the end
+of the Vector, VL, is reached. If however that loop is terminated early
+because VL is truncated, VLSET with Vertical-First becomes meaningless.
+Therefore, with `VSb`, the option to decide whether truncation should occur if the
+branch succeeds *or* if the branch condition fails allows for flexibility
+required.
+
+`VLSET` mode with Horizontal-First when `VSb` is clear is still
+useful, because it can be used to truncate VL to the first predicated
+(non-masked-out) element.
+
Available options to combine:
* `BO[0]` to make an unconditional branch would seem irrelevant if
* `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
`OR` of all tests, respectively.
-Pseudocode for Horizontal-First Mode:
+In addition to the above, it is necessary to select whether, in `svstep`
+mode, the Vector CR Field is to be overwritten or not: in some cases it
+is useful to know but in others all that is needed is the branch itself.
-```
-
- cond_ok = not SVRMmode.ALL
- for srcstep in range(VL):
- new_srcstep, CRbits = SVSTATE_NEXT(srcstep)
- # select predicate bit or zero/one
- if predicate[srcstep]:
- # get SVP64 extended CR field 0..127
- SVCRf = SVP64EXTRA(BI>>2)
- CR{SVCRf+srcstep} = CRbits
- testbit = CRbits[BI & 0b11]
- # testbit = CR[BI+32+srcstep*4]
- else if not SVRMmode.sz:
- continue
- else
- testbit = SVRMmode.SNZ
- # actual element test here
- el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
- # merge in the test
- if SVRMmode.ALL:
- cond_ok &= el_cond_ok
- else
- cond_ok |= el_cond_ok
- # test for VL to be set (and exit)
- if ~el_cond_ok and VLSET
- if SVRMmode.VLI
- SVSTATE.VL = srcstep+1
- else
- SVSTATE.VL = srcstep
- break
- # early exit?
- if SVRMmode.ALL:
- if ~el_cond_ok:
- break
- else
- if el_cond_ok:
- break
-```
+*Programming note: One important point is that SVP64 instructions are 64 bit.
+(8 bytes not 4). This needs to be taken into consideration when computing
+branch offsets: the offset is relative to the start of the instruction,
+which includes the SVP64 Prefix*
-Pseudocode for Vertical-First Mode:
+Pseudocode for Horizontal-First Mode:
```
- new_srcstep, CRbits = SVSTATE_NEXT(srcstep)
+cond_ok = not SVRMmode.ALL
+for srcstep in range(VL):
# select predicate bit or zero/one
if predicate[srcstep]:
# get SVP64 extended CR field 0..127
SVCRf = SVP64EXTRA(BI>>2)
- CR{SVCRf+srcstep} = CRbits
+ CRbits = CR{SVCRf}
testbit = CRbits[BI & 0b11]
+ # testbit = CR[BI+32+srcstep*4]
else if not SVRMmode.sz:
- SVSTATE.srcstep = new_srcstep
- exit # no branch testing
+ # inverted CTR test skip mode
+ if ¬BO[2] & CTRtest & ¬CTI then
+ CTR = CTR - 1
+ continue
else
testbit = SVRMmode.SNZ
# actual element test here
- cond_ok <- BO[0] | ¬(testbit ^ BO[1])
+ el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
+ # merge in the test
+ if SVRMmode.ALL:
+ cond_ok &= el_cond_ok
+ else
+ cond_ok |= el_cond_ok
# test for VL to be set (and exit)
- if ~cond_ok and VLSET
+ if VLSET and VSb = el_cond_ok then
if SVRMmode.VLI
- SVSTATE.VL = new_srcstep+1
+ SVSTATE.VL = srcstep+1
else
- SVSTATE.VL = new_srcstep
+ SVSTATE.VL = srcstep
+ break
+ # early exit?
+ if SVRMmode.ALL:
+ if ~el_cond_ok:
+ break
+ else
+ if el_cond_ok:
+ break
+ if SVCRf.scalar:
+ break
+```
+
+Pseudocode for Vertical-First Mode:
+
+```
+# get SVP64 extended CR field 0..127
+SVCRf = SVP64EXTRA(BI>>2)
+CRbits = CR{SVCRf}
+# select predicate bit or zero/one
+if predicate[srcstep]:
+ if BRc = 1 then # CR0 vectorised
+ CR{SVCRf+srcstep} = CRbits
+ testbit = CRbits[BI & 0b11]
+else if not SVRMmode.sz:
+ # inverted CTR test skip mode
+ if ¬BO[2] & CTRtest & ¬CTI then
+ CTR = CTR - 1
SVSTATE.srcstep = new_srcstep
+ exit # no branch testing
+else
+ testbit = SVRMmode.SNZ
+# actual element test here
+cond_ok <- BO[0] | ¬(testbit ^ BO[1])
+# test for VL to be set (and exit)
+if VLSET and cond_ok = VSb then
+ if SVRMmode.VLI
+ SVSTATE.VL = new_srcstep+1
+ else
+ SVSTATE.VL = new_srcstep
+```
+
+v3.0B branch pseudocode including LRu and CTR skipping
+
+```
+if (mode_is_64bit) then M <- 0
+else M <- 32
+cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
+ctrdec = ¬BO[2]
+if CTRtest & (cond_ok ^ CTi) then
+ ctrdec = 0b0
+if ctrdec then CTR <- CTR - 1
+ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
+lr_ok <- SVRMmode.LRu
+if ctr_ok & cond_ok then
+ if AA then NIA <-iea EXTS(BD || 0b00)
+ else NIA <-iea CIA + EXTS(BD || 0b00)
+ lr_ok <- 0b1
+if LK & lr_ok then LR <-iea CIA + 4
+```
+
+# Example Shader code
+
+```
+while(a > 2) {
+ if(b < 5)
+ f();
+ else
+ g();
+ h();
+}
+```
+
+which compiles to something like:
+
+```
+vec<i32> a, b;
+// ...
+pred loop_pred = a > 2;
+while(loop_pred.any()) {
+ pred if_pred = loop_pred & (b < 5);
+ if(if_pred.any()) {
+ f(if_pred);
+ }
+label1:
+ pred else_pred = loop_pred & ~if_pred;
+ if(else_pred.any()) {
+ g(else_pred);
+ }
+ h(loop_pred);
+}
+```
+
+which will end up as:
+
+```
+ sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
+ sv.crweird r30, CR60.GT # transfer GT vector to r30
+while_loop:
+ sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
+ sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
+ # only calculate loop_pred & pred_b because needed in f()
+ sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
+ f(CR80.v.SO)
+skip_f:
+ # illustrate inversion of pred_b. invert r30, test ALL
+ # rather than SOME, but masked-out zero test would FAIL,
+ # therefore masked-out instead is tested against 1 not 0
+ sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
+ # else = loop & ~pred_b, need this because used in g()
+ sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
+ g(CR80.v.SO)
+skip_g:
+ # conditionally call h(r30) if any loop pred set
+ sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
+ sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
```