* <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
* [[openpower/isa/branch]]
+# Rationale
+
Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
Condition Register. However for parallel processing it is simply impossible
to perform multiple independent branches: the Program Counter simply
mask repeatedly for each one. 3D GPU ISAs can test for this scenario
and jump over the fully-masked-out operations, by spotting that
*all* Conditions are false. Or, conversely, they only call the function if at least
-one Condition) is set.
+one Condition is set.
Therefore, in order to be commercially competitive, `sv.bc` and
other Vector-aware Branch Conditional instructions are a high priority
for 3D GPU workloads.
-The `BI` field of Branch Conditional operations is five bits, in scalar
-v3.0B this would select one bit of the 32 bit CR,
-comprising eight CR Fields of 4 bits each. In SVP64 there are
-16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
-`BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
-are extended to either scalar or vector and to select CR Fields 0..127
-as specified in SVP64 [[sv/svp64/appendix]].
+Given that Power ISA v3.0B is already quite powerful, particularly
+the Condition Registers and their interaction with Branches, there
+are opportunities to create an extremely flexible and compact
+Vectorised Branch behaviour. In addition, the side-effects (updating
+of CTR, truncation of VL, described below) make it a useful instruction
+even if the branch points to the next instruction (no actual branch).
+
+# Overview
When considering an "array" of branch-tests, there are four useful modes:
AND, OR, NAND and NOR of all Conditions.
and the corresponding CR Field is considered to be
set to `SNZ`)
+Early-exit is enacted such that the Vectorised Branch does not
+perform needless extra tests, which will help reduce reads on
+the Condition Register file.
+
+Additional useful behaviour involves two primary Modes (both of
+which may be enabled and combined):
+
+* **VLSET Mode**: identical to Data-Dependent Fail-First Mode
+ for Arithmetic SVP64 operations, with more
+ flexibility and a close interaction and integration into the
+ underlying base Scalar v3.0B Branch instruction.
+* **CTR-test Mode**: gives much more flexibility over when and why
+ CTR is decremented, including options to decrement if a Condition
+ test succeeds *or if it fails*.
+
+With these side-effects, basic Boolean Logic Analysis advises thay
+it is important to provide a means
+to enact them each based on whether testing succeeds *or fails*. This
+results in a not-insignificant number of additional Mode Augmentation bits,
+accompanying VLSET and CTR-test Modes respectively.
+
+It is also important to note that Vectorised Branches can be used
+in either SVP64 Horizontal-First or Vertical-First Mode. Essentially
+the behaviour is identical in both Modes.
+
+It is also important
+to bear in mind that, fundamentally, Vectorised Branch-Conditional
+is still extremely close to the Scalar v3.0B Branch-Conditional
+instructions, and that the same v3.0B Scalar Branch-Conditional
+instructions are still
+*completely separate and independent*, being unaltered and
+unaffected by their SVP64 variants in every conceivable way.
+
+# Format and fields
+
+SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
+Conditional:
+
+| 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description |
+| - | - | - | - | -- | -- | --- |---------|----------------- |
+|ALL|LRu| / | / | 0 | 0 | / | SNZ sz | normal mode |
+|ALL|LRu| / |VSb| 0 | 1 | VLI | SNZ sz | VLSET mode |
+|ALL|LRu|CTi| / | 1 | 0 | / | SNZ sz | CTR-test mode |
+|ALL|LRu|CTi|VSb| 1 | 1 | VLI | SNZ sz | CTR-test+VLSET mode |
+
+Brief description of fields:
+
+* **sz** if predication is enabled will put 4 copies of `SNZ` in place of
+ the src CR Field when the predicate bit is zero. otherwise the element
+ is ignored or skipped, depending on context.
+* **ALL** when set, all branch conditional tests must pass in order for
+ the branch to succeed. When clear, it is the first sequentially
+ encountered successful test that causes the branch to succeed.
+ This is identical behaviour to how programming languages perform
+ early-exit on Boolean Logic chains.
+* **VLI** VLSET is identical to Data-dependent Fail-First mode.
+ In VLSET mode, VL is set equal (truncated) to the first point
+ where, assuming Conditions are tested sequentially, the branch succeeds
+ *or fails* depending if VSb is set.
+ If VLI (Vector Length Inclusive) is clear,
+ VL is truncated to *exclude* the current element, otherwise it is
+ included. SVSTATE.MVL is not changed: only VL.
+* **LRu**: Link Register Update. When set, Link Register will
+ only be updated if the Branch Condition succeeds. This avoids
+ destruction of LR during loops (particularly Vertical-First
+ ones).
+* **VSb** is most relevant for Vertical-First VLSET Mode. After testing,
+ if VSb is set, VL is truncated if the branch succeeds. If VSb is clear,
+ VL is truncated if the branch did **not** take place.
+* **CTi** CTR inversion. CTR-test Mode normally decrements per element
+ tested. CTR inversion decrements if a test *fails*. Only relevant
+ in CTR-test Mode.
+
+# Vectorised CR Field numbering, and Scalar behaviour
+
+It is important to keep in mind that just like all SVP64 instructions,
+the `BI` field of the base v3.0B Branch Conditional instruction
+may be extended by SVP64 EXTRA augmentation, as well as be marked
+as either Scalar or Vector.
+
+The `BI` field of Branch Conditional operations is five bits, in scalar
+v3.0B this would select one bit of the 32 bit CR,
+comprising eight CR Fields of 4 bits each. In SVP64 there are
+16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
+`BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
+are extended to either scalar or vector and to select CR Fields 0..127
+as specified in SVP64 [[sv/svp64/appendix]].
+
When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
-then as the usual SVP64 rules apply,
+then as the usual SVP64 rules apply:
the loop ends at the first element tested, after taking
predication into consideration. Thus, also as usual, when a predicate mask is
given, and `BI` marked as scalar, and `sz` is zero, srcstep
skips forward to the first non-zero predicated element, and only that
one element is tested.
+In other words, the fact that this is a Branch
+Operation (instead of an arithmetic one) does not result, ultimately,
+in significant changes as to
+how SVP64 is fundamentally applied, except with respect to early-out
+opportunities and CTR-testing, which are outlined below.
+
+# Description and Modes
+
In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
AND) results in early exit: no more updates to CTR occur (if requested);
no branch occurs, and LR is not updated (if requested). Likewise for
SVP64 Branch Conditional operations, exactly as they may be applied to
other SVP64 operations. When `sz` is zero, any masked-out Branch-element
operations are not included in condition testing, exactly like all other
-SVP64 operations. However whilst side-effects such as updating
-LR may be skipped when `sz` is zero, side-effects such as decrementing of
-CTR are under much more explicit control.
+SVP64 operations, *including* side-effects such as potentially updating
+LR or CTR, which will also be skipped. There is *one* exception here,
+which is when
+`BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
+predicate mask bit is also zero:
+under these special circumstances CTR will also decrement.
When `sz` is non-zero, this normally requests insertion of a zero
in place of the input data, when the relevant predicate mask bit is zero.
mode, which will truncate SVSTATE.VL at the point of the first failed
test.*)
-SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
-Conditional:
-
-| 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description |
-| - | - | - | - | -- | -- | --- |---------|----------------- |
-|ALL|LRu| / | / | 0 | 0 | / | SNZ sz | normal mode |
-|ALL|LRu| / |VSb| 0 | 1 | VLI | SNZ sz | VLSET mode |
-|ALL|LRu|CTi| / | 1 | 0 | / | SNZ sz | CTR skip mode |
-|ALL|LRu|CTi|VSb| 1 | 1 | VLI | SNZ sz | CTR skip+VLSET mode |
-
-Fields:
-
-* **sz** if predication is enabled will put 4 copies of `SNZ` in place of
- the src CR Field when the predicate bit is zero. otherwise the element
- is ignored or skipped, depending on context.
-* **ALL** when set, all branch conditional tests must pass in order for
- the branch to succeed. When clear, it is the first sequentially
- encountered successful test that causes the branch to succeed.
-* **VLI** VLSET is identical to Data-dependent Fail-First mode.
- In VLSET mode, VL is set equal (truncated) to the first point
- where, assuming Conditions are tested sequentially, the branch succeeds
- *or fails* depending if VSb is set.
- If VLI (Vector Length Inclusive) is clear,
- VL is truncated to *exclude* the current element, otherwise it is
- included. SVSTATE.MVL is not changed: only VL.
-* **LRu**: Link Register Update. When set, Link Register will
- only be updated if the Branch Condition succeeds. This avoids
- destruction of LR during loops (particularly Vertical-First
- ones).
-* **VSb** is most relevant for Vertical-First VLSET Mode. After testing,
- if VSb is set, VL is truncated if the branch succeeds. If VSb is clear,
- VL is truncated if the branch did **not** take place.
-* **CTi** CTR inversion. CTR Mode normally decrements per element
- tested. CTR inversion decrements if a test *fails*.
Normally, CTR mode will decrement once per Condition Test, resulting
-under normal circumstances that CTR reduces by up to VL.
-Just as when v3.0B Branch-Conditional saves at
+under normal circumstances that CTR reduces by up to VL in Horizontal-First
+Mode. Just as when v3.0B Branch-Conditional saves at
least one instruction on tight inner loops through auto-decrementation
of CTR, likewise it is also possible to save instruction count for
-SVP64 loops in both Vertical-First and Horizontal-First Mode.
-
-If both CTR+VLSET Modes are requested, then because the CTR decrement is
-per element tested, the total amount that CTR is decremented
-by will end up being VL *after* truncation (should that occur).
-
-Enabling CTR Skipping (Csk) has a number of options, which need explaining:
-
-* **Standard SVP64 CTR Mode** Skip=0, CTi=0, sz=0, no predicate specified.
- The number of elements tested end up being subtracted from CTR
- (as already explained above)
-* **Predicated CTR Mode** Csk=1, predicate is specified.
- Regardless of whether the Condition Test passes or fails,
- masked-out elements are *not included* in the
- count subtracted from CTR. If VL=3 but the predicate mask
- is 0b101 and all CR Field Conditions are tested then CTR
- will be reduced by two, *not* three (because only 2 predicate
- mask bits are enabled). This includes when sz=1.
-* **Non-predicated CTR Skip Mode**, Csk=1, sz=0, no
- predicate specified.
- Only the number of elements which pass the Condition Test (in
- both ALL or ANY mode) will be subtracted from CTR
-* **Non-predicated CTR Skip inverted**, Csk=1, sz=1,
- no predicate specified.
- Only the number of elements which **fail** the Condition
- test will be subtracted from CTR
+SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
+in circumstances where there is conditional interaction between the
+element computation and testing, and the continuation (or otherwise)
+of a given loop. The potential combinations of interactions is why CTR
+testing options have been added.
+
+If both CTR-test and VLSET Modes are requested, then because the CTR decrement is on a per element basis, the total amount that CTR is decremented
+by will end up being VL *after* truncation (should that occur). In
+other words, the order is strictly (as can be seen in pseudocode, below):
+
+1. compute the test
+2. (optionally) decrement CTR
+3. (optionally) truncate VL
+4. decide (based on step 1) whether to terminate looping
+ (including not executing step 5)
+5. decide whether to branch.
+
+CTR-test mode and CTi interaction is as follows: note that
+`BO[2]` is still required to be clear for decrements to be
+considered.
+
+* **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
+ if `BO[2]` is zero. Masked-out elements when `sz=0` are
+ skipped.
+* **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
+ if `BO[2]` is zero and a masked-out element is skipped
+ (`sz=0` and predicate bit is zero). This one special case is the
+ **opposite** of other combinations.
+* **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
+ if `BO[2]` is zero and the Condition Test succeeds.
+ Masked-out elements when `sz=0` are skipped.
+* **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
+ if `BO[2]` is zero and the Condition Test *fails*.
+ Masked-out elements when `sz=0` are skipped.
Note that, interestingly, due to the side-effects of `VLSET` mode
it is actually useful to use Branch Conditional even
branch offsets: the offset is relative to the start of the instruction,
which includes the SVP64 Prefix*
+# Pseudocode and examples
+
Pseudocode for Horizontal-First Mode:
```
testbit = CRbits[BI & 0b11]
# testbit = CR[BI+32+srcstep*4]
else if not SVRMmode.sz:
+ # inverted CTR test skip mode
+ if ¬BO[2] & CTRtest & ¬CTI then
+ CTR = CTR - 1
continue
else
testbit = SVRMmode.SNZ
CR{SVCRf+srcstep} = CRbits
testbit = CRbits[BI & 0b11]
else if not SVRMmode.sz:
+ # inverted CTR test skip mode
+ if ¬BO[2] & CTRtest & ¬CTI then
+ CTR = CTR - 1
SVSTATE.srcstep = new_srcstep
exit # no branch testing
else
else M <- 32
cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
ctrdec = ¬BO[2]
-if CSk & (cond_ok ^ CTi) then
+if CTRtest & (cond_ok ^ CTi) then
ctrdec = 0b0
if ctrdec then CTR <- CTR - 1
ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])