+[[!tag standards]]
# SVP64 Branch Conditional behaviour
**DRAFT STATUS**
-Please note: SVP64 Branch instructions should be
+Please note: although similar, SVP64 Branch instructions should be
considered completely separate and distinct from
standard scalar OpenPOWER-approved v3.0B branches.
**v3.0B branches are in no way impacted, altered,
changed or modified in any way, shape or form by
-the SVP64 Vectorised Variants**.
+the SVP64 Vectorised Variants**. It is
+extremely important to note that Branches are the
+sole semi-exception in SVP64 to `Scalar Identity Behaviour`.
+SVP64 Branches contain additional modes that are useful
+for scalar operations (i.e. even when VL=1 or when
+using single-bit predication).
Links
* <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
* [[openpower/isa/branch]]
+# Rationale
+
Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
Condition Register. However for parallel processing it is simply impossible
to perform multiple independent branches: the Program Counter simply
without such multi-condition
test-and-branch, if a predicate mask is all zeros a large batch of
instructions may be masked out to `nop`, and it would waste
-CPU cycles not only to run them but also to load the predicate
-mask repeatedly for each one. 3D GPU ISAs can test for this scenario
-and jump over the fully-masked-out operations, by spotting that
-*all* Conditions are false. Or, conversely, they only call the function if at least
-one Condition) is set.
+CPU cycles to run them. 3D GPU ISAs can test for this scenario
+and, with the appropriate predicate-analysis instruction,
+jump over fully-masked-out operations, by spotting that
+*all* Conditions are false.
+
+Unless Branches are aware and capable of such analysis, additional
+instructions would be required which perform Horizontal Cumulative
+analysis of Vectorised Condition Register Fields, in order to
+reduce the Vector of CR Fields down to one single yes or no
+decision that a Scalar-only v3.0B Branch-Conditional could cope with.
+Such instructions would be unavoidable, required, and costly
+by comparison to a single Vector-aware Branch.
Therefore, in order to be commercially competitive, `sv.bc` and
other Vector-aware Branch Conditional instructions are a high priority
-for 3D GPU workloads.
+for 3D GPU (and CUDA) workloads.
-The `BI` field of Branch Conditional operations is five bits, in scalar
-v3.0B this would select one bit of the 32 bit CR,
-comprising eight CR Fields of 4 bits each. In SVP64 there are
-16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
-`BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
-are extended to either scalar or vector and to select CR Fields 0..127
-as specified in SVP64 [[sv/svp64/appendix]].
+Given that Power ISA v3.0B is already quite powerful, particularly
+the Condition Registers and their interaction with Branches, there
+are opportunities to create extremely flexible and compact
+Vectorised Branch behaviour. In addition, the side-effects (updating
+of CTR, truncation of VL, described below) make it a useful instruction
+even if the branch points to the next instruction (no actual branch).
+
+# Overview
-When considering an "array" of branch-tests, there are four useful modes:
+When considering an "array" of branch-tests, there are four
+primarily-useful modes:
AND, OR, NAND and NOR of all Conditions.
-NAND and NOR may be synthesised by
-inverting `BO[2]` which just leaves two modes:
+NAND and NOR may be synthesised from AND and OR by
+inverting `BO[1]` which just leaves two modes:
-* Branch takes place on the first CR Field test to succeed
+* Branch takes place on the **first** CR Field test to succeed
(a Great Big OR of all condition tests)
* Branch takes place only if **all** CR field tests succeed:
a Great Big AND of all condition tests
- (including those where the predicate is masked out
- and the corresponding CR Field is considered to be
- set to `SNZ`)
+
+Early-exit is enacted such that the Vectorised Branch does not
+perform needless extra tests, which will help reduce reads on
+the Condition Register file.
+
+*Note: Early-exit is **MANDATORY** (required) behaviour.
+Branches **MUST** exit at the first failure point, for
+exactly the same reasons for which it is mandatory in
+programming languages doing early-exit: to avoid
+damaging side-effects. Speculative testing of Condition
+Register Fields is permitted, as is speculative updating
+of CTR, as long as, as usual in any Out-of-Order microarchitecture,
+that speculative testing is cancelled should an early-exit occur.*
+
+Additional useful behaviour involves two primary Modes (both of
+which may be enabled and combined):
+
+* **VLSET Mode**: identical to Data-Dependent Fail-First Mode
+ for Arithmetic SVP64 operations, with more
+ flexibility and a close interaction and integration into the
+ underlying base Scalar v3.0B Branch instruction.
+ Truncation of VL takes place around the early-exit point.
+* **CTR-test Mode**: gives much more flexibility over when and why
+ CTR is decremented, including options to decrement if a Condition
+ test succeeds *or if it fails*.
+
+With these side-effects, basic Boolean Logic Analysis advises that
+it is important to provide a means
+to enact them each based on whether testing succeeds *or fails*. This
+results in a not-insignificant number of additional Mode Augmentation bits,
+accompanying VLSET and CTR-test Modes respectively.
+
+Predicate skipping or zeroing may, as usual with SVP64, be controlled
+by `sz`.
+Where the predicate is masked out and
+zeroing is enabled, then in such circumstances
+the same Boolean Logic Analysis dictates that
+rather than testing only against zero, the option to test
+against one is also prudent. This introduces a new
+immediate field, `SNZ`, which works in conjunction with
+`sz`.
+
+
+Vectorised Branches can be used
+in either SVP64 Horizontal-First or Vertical-First Mode. Essentially,
+at an element level, the behaviour is identical in both Modes,
+although the `ALL` bit is meaningless in Vertical-First Mode.
+
+It is also important
+to bear in mind that, fundamentally, Vectorised Branch-Conditional
+is still extremely close to the Scalar v3.0B Branch-Conditional
+instructions, and that the same v3.0B Scalar Branch-Conditional
+instructions are still
+*completely separate and independent*, being unaltered and
+unaffected by their SVP64 variants in every conceivable way.
+
+*Programming note: One important point is that SVP64 instructions are 64 bit.
+(8 bytes not 4). This needs to be taken into consideration when computing
+branch offsets: the offset is relative to the start of the instruction,
+which **includes** the SVP64 Prefix*
+
+# Format and fields
+
+With element-width overrides being meaningless for Condition
+Register Fields, bits 4 thru 7 of SVP64 RM may be used for additional
+Mode bits.
+
+SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
+Conditional:
+
+| 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description |
+| - | - | - | - | -- | -- | --- |---------|----------------- |
+|ALL|SNZ| / | / | 0 | 0 | / | LRu sz | normal mode |
+|ALL|SNZ| / |VSb| 0 | 1 | VLI | LRu sz | VLSET mode |
+|ALL|SNZ|CTi| / | 1 | 0 | / | LRu sz | CTR-test mode |
+|ALL|SNZ|CTi|VSb| 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode |
+
+Brief description of fields:
+
+* **sz=1** if predication is enabled and `sz=1` and a predicate
+ element bit is zero, `SNZ` will
+ be substituted in place of the CR bit selected by `BI`,
+ as the Condition tested.
+ Contrast this with
+ normal SVP64 `sz=1` behaviour, where *only* a zero is put in
+ place of masked-out predicate bits.
+* **sz=0** When `sz=0` skipping occurs as usual on
+ masked-out elements, but unlike all
+ other SVP64 behaviour which entirely skips an element with
+ no related side-effects at all, there are certain
+ special circumstances where CTR
+ may be decremented. See CTR-test Mode, below.
+* **ALL** when set, all branch conditional tests must pass in order for
+ the branch to succeed. When clear, it is the first sequentially
+ encountered successful test that causes the branch to succeed.
+ This is identical behaviour to how programming languages perform
+ early-exit on Boolean Logic chains.
+* **VLI** VLSET is identical to Data-dependent Fail-First mode.
+ In VLSET mode, VL *may* (depending on `VSb`) be truncated.
+ If VLI (Vector Length Inclusive) is clear,
+ VL is truncated to *exclude* the current element, otherwise it is
+ included. SVSTATE.MVL is not altered: only VL.
+* **LRu**: Link Register Update. When set, Link Register will
+ only be updated if the Branch Condition succeeds. This avoids
+ destruction of LR during loops (particularly Vertical-First
+ ones).
+* **VSb** In VLSET Mode, after testing,
+ if VSb is set, VL is truncated if the test succeeds. If VSb is clear,
+ VL is truncated if a test *fails*. Masked-out (skipped)
+ bits are not considered
+ part of testing.
+* **CTi** CTR inversion. CTR-test Mode normally decrements per element
+ tested. CTR inversion decrements if a test *fails*. Only relevant
+ in CTR-test Mode.
+
+LRu and CTR-test modes are where SVP64 Branches subtly differ from
+Scalar v3.0B Branches. `bclr` for example will always update LR, whereas
+`sv.bclr/lru` will only update LR if the branch succeeds.
+
+Of special interest is that when using ALL Mode (Great Big AND
+of all Condition Tests), if `VL=0`,
+which is rare but can occur in Data-Dependent Modes, the Branch
+will always take place because there will be no failing Condition
+Tests to prevent it. Likewise when not using ALL Mode (Great Big OR
+of all Condition Tests) and `VL=0` the Branch is guaranteed not
+to occur because there will be no *successful* Condition Tests
+to make it happen.
+
+# Vectorised CR Field numbering, and Scalar behaviour
+
+It is important to keep in mind that just like all SVP64 instructions,
+the `BI` field of the base v3.0B Branch Conditional instruction
+may be extended by SVP64 EXTRA augmentation, as well as be marked
+as either Scalar or Vector. It is also crucially important to keep in mind
+that for CRs, SVP64 sequentially increments the CR *Field* numbers.
+CR *Fields* are treated as elements, not bit-numbers of the CR *register*.
+
+The `BI` operand of Branch Conditional operations is five bits, in scalar
+v3.0B this would select one bit of the 32 bit CR,
+comprising eight CR Fields of 4 bits each. In SVP64 there are
+16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
+`BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
+are extended to either scalar or vector and to select CR Fields 0..127
+as specified in SVP64 [[sv/svp64/appendix]].
When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
-then as the usual SVP64 rules apply,
-the loop ends at the first element tested, after taking
+then as the usual SVP64 rules apply:
+the Vector loop ends at the first element tested
+(the first CR *Field*), after taking
predication into consideration. Thus, also as usual, when a predicate mask is
given, and `BI` marked as scalar, and `sz` is zero, srcstep
skips forward to the first non-zero predicated element, and only that
one element is tested.
+In other words, the fact that this is a Branch
+Operation (instead of an arithmetic one) does not result, ultimately,
+in significant changes as to
+how SVP64 is fundamentally applied, except with respect to:
+
+* the unique properties associated with conditionally
+ changing the Program
+Counter (aka "a Branch"), resulting in early-out
+opportunities
+* CTR-testing
+
+Both are outlined below.
+
+# Horizontal-First and Vertical-First Modes
+
In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
AND) results in early exit: no more updates to CTR occur (if requested);
no branch occurs, and LR is not updated (if requested). Likewise for
in Vertical-First Mode, a test designed to be done on multiple
bits is meaningless.
+# Description and Modes
+
Predication in both INT and CR modes may be applied to `sv.bc` and other
SVP64 Branch Conditional operations, exactly as they may be applied to
other SVP64 operations. When `sz` is zero, any masked-out Branch-element
operations are not included in condition testing, exactly like all other
-SVP64 operations. This *includes* side-effects such as decrementing of
-CTR, which is also skipped on masked-out CR Field elements, when `sz`
-is zero.
-
-However when `sz` is non-zero, this normally requests insertion of a zero
+SVP64 operations, *including* side-effects such as potentially updating
+LR or CTR, which will also be skipped. There is *one* exception here,
+which is when
+`BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
+predicate mask bit is also zero:
+under these special circumstances CTR will also decrement.
+
+When `sz` is non-zero, this normally requests insertion of a zero
in place of the input data, when the relevant predicate mask bit is zero.
This would mean that a zero is inserted in place of `CR[BI+32]` for
testing against `BO`, which may not be desirable in all circumstances.
mode, which will truncate SVSTATE.VL at the point of the first failed
test.*)
-SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
-Conditional:
-
-| 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description |
-| - | - | - | - | -- | -- | --- |---------|----------------- |
-|ALL|LRu| / | / | 0 | 0 | / | SNZ sz | normal mode |
-|ALL|LRu| / |VSb| 0 | 1 | VLI | SNZ sz | VLSET mode |
-|ALL|LRu|Csk| / | 1 | 0 | / | SNZ sz | CTR mode |
-|ALL|LRu|Csk|VSb| 1 | 1 | VLI | SNZ sz | CTR+VLSET mode |
-
-Fields:
-
-* **sz** if predication is enabled will put 4 copies of `SNZ` in place of
- the src CR Field when the predicate bit is zero. otherwise the element
- is ignored or skipped, depending on context.
-* **ALL** when set, all branch conditional tests must pass in order for
- the branch to succeed. When clear, it is the first sequentially
- encountered successful test that causes the branch to succeed.
-* **VLI** VLSET is identical to Data-dependent Fail-First mode.
- In VLSET mode, VL is set equal (truncated) to the first point
- where, assuming Conditions are tested sequentially, the branch succeeds
- *or fails* depending if VSb is set.
- If VLI (Vector Length Inclusive) is clear,
- VL is truncated to *exclude* the current element, otherwise it is
- included. SVSTATE.MVL is not changed: only VL.
-* **LRu**: Link Register Update. When set, Link Register will
- only be updated if the Branch Condition succeeds. This avoids
- destruction of LR during loops (particularly Vertical-First
- ones).
-* **VSb** is most relevant for Vertical-First VLSET Mode. After testing,
- if VSb is set, VL is truncated if the branch succeeds. If VSb is clear,
- VL is truncated if the branch did **not** take place.
-* **Csk** CTR skipping. CTR Mode normally subtracts VL from CTR.
- Csk refines that further
-
-Normally, CTR mode will subtract VL from CTR rather than just decrement
-CTR by one. Just as when v3.0B Branch-Conditional saves at
+Normally, CTR mode will decrement once per Condition Test, resulting
+under normal circumstances that CTR reduces by up to VL in Horizontal-First
+Mode. Just as when v3.0B Branch-Conditional saves at
least one instruction on tight inner loops through auto-decrementation
of CTR, likewise it is also possible to save instruction count for
-SVP64 loops in both Vertical-First and Horizontal-First Mode.
-Setting CTR Mode in Vertical-First results in `UNDEFINED`
-behaviour. Given that Vertical-First steps through one element
-at a time, standard single (v3.0B) CTR decrementing should
-correspondingly be used instead.
-
-If both CTR+VLSET Modes are requested, the amount that CTR is decremented
-by is the value of VL *after* truncation (should that occur).
-
-Enabling CTR Skipping (Csk) has a number of options, which need explaining:
-
-* **Standard SVP64 CTR Mode** Csk=0, sz=0, no predicate specified.
- VL will be subtracted from CTR (as already explained above)
-* **Predicated CTR Mode** Csk=1, predicate is specified.
- Masked-out elements are *not included* in the
- count subtracted from CTR. If VL=3 but the predicate mask
- is 0b101 and all CR Field Conditions are tested then CTR
- will be reduced by two, *not* three (because only 2 predicate
- mask bits are enabled). This includes when sz=1.
-* **Non-predicated CTR Skip Mode**, Csk=1, sz=0, no
- predicate specified.
- Only those elements which pass the Condition Test (in
- both ALL or ANY mode) will be subtracted from CTR
-* **Non-predicated CTR Skip inverted**, Csk=1, sz=1,
- no predicate specified.
- Only those elements which **fail** the Condition
- test will be subtracted from CTR
-
-Note that, interestingly, due to the side-effects of `VLSET` mode
-it is actually useful to use Branch Conditional even
-to perform no actual branch operation, i.e to point to the instruction
-after the branch. Truncation of VL would thus conditionally occur yet control
-flow alteration would not.
+SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
+in circumstances where there is conditional interaction between the
+element computation and testing, and the continuation (or otherwise)
+of a given loop. The potential combinations of interactions is why CTR
+testing options have been added.
Also, the unconditional bit `BO[0]` is still relevant when Predication
is applied to the Branch because in `ALL` mode all nonmasked bits have
-to be tested. Even when svstep mode or VLSET mode are not used, CTR
-may still be decremented by the total number of nonmasked elements.
+to be tested, and when `sz=0` skipping occurs.
+Even when VLSET mode is not used, CTR
+may still be decremented by the total number of nonmasked elements,
+acting in effect as either a popcount or cntlz depending on which
+mode bits are set.
In short, Vectorised Branch becomes an extremely powerful tool.
+## CTR-test
+
+Where a standard Scalar v3.0B branch unconditionally decrements
+CTR when `BO[2]` is clear, CTR-test Mode introduces more flexibility
+which allows CTR to be used for many more types of Vector loops
+constructs.
+
+CTR-test mode and CTi interaction is as follows: note that
+`BO[2]` is still required to be clear for CTR decrements to be
+considered, exactly as is the case in Scalar Power ISA v3.0B
+
+* **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
+ if `BO[2]` is zero. Masked-out elements when `sz=0` are
+ skipped (i.e. CTR is *not* decremented when the predicate
+ bit is zero and `sz=0`).
+* **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
+ if `BO[2]` is zero and a masked-out element is skipped
+ (`sz=0` and predicate bit is zero). This one special case is the
+ **opposite** of other combinations, as well as being
+ completely different from normal SVP64 `sz=0` behaviour)
+* **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
+ if `BO[2]` is zero and the Condition Test succeeds.
+ Masked-out elements when `sz=0` are skipped (including
+ not decrementing CTR)
+* **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
+ if `BO[2]` is zero and the Condition Test *fails*.
+ Masked-out elements when `sz=0` are skipped (including
+ not decrementing CTR)
+
+`CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
+only time in the entirety of SVP64 that has side-effects when
+a predicate mask bit is clear. **All** other SVP64 operations
+entirely skip an element when sz=0 and a predicate mask bit is zero.
+It is also critical to emphasise that in this unusual mode,
+no other side-effects occur: **only** CTR is decremented, i.e. the
+rest of the Branch operation iss skipped.
+
+# VLSET Mode
+
+VLSET Mode truncates the Vector Length so that subsequent instructions
+operate on a reduced Vector Length. This is similar to
+Data-dependent Fail-First and LD/ST Fail-First, where for VLSET the
+truncation occurs at the Branch decision-point.
+
+Interestingly, due to the side-effects of `VLSET` mode
+it is actually useful to use Branch Conditional even
+to perform no actual branch operation, i.e to point to the instruction
+after the branch. Truncation of VL would thus conditionally occur yet control
+flow alteration would not.
+
`VLSET` mode with Vertical-First is particularly unusual. Vertical-First
-is used for explicit looping, where the looping is to terminate if the end
+is designed to be used for explicit looping, where an explicit call to
+`svstep` is required to move both srcstep and dststep on to
+the next element, until VL (or other condition) is reached.
+Vertical-First Looping is expected (required) to terminate if the end
of the Vector, VL, is reached. If however that loop is terminated early
because VL is truncated, VLSET with Vertical-First becomes meaningless.
-Therefore, with `VSb`, the option to decide whether truncation should occur if the
-branch succeeds *or* if the branch condition fails allows for flexibility
-required.
+Resolving this would require two branches: one Conditional, the other
+branching unconditionally to create the loop, where the Conditional
+one jumps over it.
-`VLSET` mode with Horizontal-First when `VSb` is clear is still
+Therefore, with `VSb`, the option to decide whether truncation should occur if the
+branch succeeds *or* if the branch condition fails allows for the flexibility
+required. This allows a Vertical-First Branch to *either* be used as
+a branch-back (loop) *or* as part of a conditional exit or function
+call from *inside* a loop, and for VLSET to be integrated into both
+types of decision-making.
+
+In the case of a Vertical-First branch-back (loop), with `VSb=0` the branch takes
+place if success conditions are met, but on exit from that loop
+(branch condition fails), VL will be truncated. This is extremely
+useful.
+
+`VLSET` mode with Horizontal-First when `VSb=0` is still
useful, because it can be used to truncate VL to the first predicated
(non-masked-out) element.
+The truncation point for VL, when VLi is clear, must not include skipped
+elements that preceded the current element being tested.
+Example: `sz=0, VLi=0, predicate mask = 0b110010` and the Condition
+Register failure point is at CR Field element 4.
+
+* Testing at element 0 is skipped because its predicate bit is zero
+* Testing at element 1 passed
+* Testing elements 2 and 3 are skipped because their
+ respective predicate mask bits are zero
+* Testing element 4 fails therefore VL is truncated to **2**
+ not 4 due to elements 2 and 3 being skipped.
+
+If `sz=1` in the above example *then* VL would have been set to 4 because
+in non-zeroing mode the zero'd elements are still effectively part of the
+Vector (with their respective elements set to `SNZ`)
+
+If `VLI=1` then VL would be set to 5 regardless of sz, due to being inclusive
+of the element actually being tested.
+
+## VLSET and CTR-test combined
+
+If both CTR-test and VLSET Modes are requested, it's important to
+observe the correct order. What occurs depends on whether VLi
+is enabled, because VLi affects the length, VL.
+
+If VLi (VL truncate inclusive) is set:
+
+1. compute the test including whether CTR triggers
+2. (optionally) decrement CTR
+3. (optionally) truncate VL (VSb inverts the decision)
+4. decide (based on step 1) whether to terminate looping
+ (including not executing step 5)
+5. decide whether to branch.
+
+If VLi is clear, then when a test fails that element
+and any following it
+should **not** be considered part of the Vector. Consequently:
+
+1. compute the branch test including whether CTR triggers
+2. if the test fails against VSb, truncate VL to the *previous*
+ element, and terminate looping. No further steps executed.
+3. (optionally) decrement CTR
+4. decide whether to branch.
+
+# Boolean Logic combinations
+
+In a Scalar ISA, Branch-Conditional testing even of vector
+results may be performed through inversion of tests. NOR of
+all tests may be performed by inversion of the scalar condition
+and branching *out* from the scalar loop around elements,
+using scalar operations.
+
+In a parallel (Vector) ISA it is the ISA itself which must perform
+the prerequisite logic manipulation.
+Thus for SVP64 there are an extraordinary number of nesessary combinations
+which provide completely different and useful behaviour.
Available options to combine:
* `BO[0]` to make an unconditional branch would seem irrelevant if
- it were not for predication and for side-effects.
+ it were not for predication and for side-effects (CTR Mode
+ for example)
+* Enabling CTR-test Mode and setting `BO[2]` can still result in the
+ Branch
+ taking place, not because the Condition Test itself failed, but
+ because CTR reached zero **because**, as required by CTR-test mode,
+ CTR was decremented as a **result** of Condition Tests failing.
* `BO[1]` to select whether the CR bit being tested is zero or nonzero
* `R30` and `~R30` and other predicate mask options including CR and
inverted CR bit testing
predicate bits
* `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
`OR` of all tests, respectively.
+* Predicate Mask bits, which combine in effect with the CR being
+ tested.
+* Inversion of Predicate Masks (`~r3` instead of `r3`, or using
+ `NE` rather than `EQ`) which results in an additional
+ level of possible ANDing, ORing etc. that would otherwise
+ need explicit instructions.
-In addition to the above, it is necessary to select whether, in `svstep`
-mode, the Vector CR Field is to be overwritten or not: in some cases it
-is useful to know but in others all that is needed is the branch itself.
+The most obviously useful combinations here are to set `BO[1]` to zero
+in order to turn `ALL` into Great-Big-NAND and `ANY` into
+Great-Big-NOR. Other Mode bits which perform behavioural inversion then
+have to work round the fact that the Condition Testing is NOR or NAND.
+The alternative to not having additional behavioural inversion
+(`SNZ`, `VSb`, `CTi`) would be to have a second (unconditional)
+branch directly after the first, which the first branch jumps over.
+This contrivance is avoided by the behavioural inversion bits.
-*Programming note: One important point is that SVP64 instructions are 64 bit.
-(8 bytes not 4). This needs to be taken into consideration when computing
-branch offsets: the offset is relative to the start of the instruction,
-which includes the SVP64 Prefix*
+# Pseudocode and examples
+
+Please see [[svp64/appendix]] regarding CR bit ordering and for
+the definition of `CR{n}`
+
+For comparative purposes this is a copy of the v3.0B `bc` pseudocode
+
+```
+if (mode_is_64bit) then M <- 0
+else M <- 32
+if ¬BO[2] then CTR <- CTR - 1
+ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
+cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
+if ctr_ok & cond_ok then
+ if AA then NIA <-iea EXTS(BD || 0b00)
+ else NIA <-iea CIA + EXTS(BD || 0b00)
+if LK then LR <-iea CIA + 4
+```
+
+Simplified pseudocode including LRu and CTR skipping, which illustrates
+clearly that SVP64 Scalar Branches (VL=1) are **not** identical to
+v3.0B Scalar Branches. The key areas where differences occur are
+the inclusion of predication (which can still be used when VL=1), in
+when and why CTR is decremented (CTRtest Mode) and whether LR is
+updated (which is unconditional in v3.0B when LK=1, and conditional
+in SVP64 when LRu=1).
+
+```
+if (mode_is_64bit) then M <- 0
+else M <- 32
+testbit = CR[BI+32]
+if ¬predicate_bit then testbit = SVRMmode.SNZ
+ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
+cond_ok <- BO[0] | ¬(testbit ^ BO[1])
+if ¬predicate_bit & ¬SVRMmode.sz then
+ if ¬BO[2] & CTRtest & ¬CTi then
+ CTR = CTR - 1
+ stop # instruction finishes here
+if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
+lr_ok <- SVRMmode.LRu
+if ctr_ok & cond_ok then
+ if AA then NIA <-iea EXTS(BD || 0b00)
+ else NIA <-iea CIA + EXTS(BD || 0b00)
+ lr_ok <- 0b1
+if LK & lr_ok then LR <-iea CIA + 4
+```
+
+Below is the pseudocode for SVP64 Branches, which is a little less
+obvious but identical to the above. The lack of obviousness is down
+to the early-exit opportunities.
Pseudocode for Horizontal-First Mode:
```
+if (mode_is_64bit) then M <- 0
+else M <- 32
cond_ok = not SVRMmode.ALL
for srcstep in range(VL):
# select predicate bit or zero/one
testbit = CRbits[BI & 0b11]
# testbit = CR[BI+32+srcstep*4]
else if not SVRMmode.sz:
- continue
+ # inverted CTR test skip mode
+ if ¬BO[2] & CTRtest & ¬CTI then
+ CTR = CTR - 1
+ continue # skip to next element
else
testbit = SVRMmode.SNZ
# actual element test here
+ ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
+ # check if CTR dec should occur
+ ctrdec = ¬BO[2]
+ if CTRtest & (el_cond_ok ^ CTi) then
+ ctrdec = 0b0
+ if ctrdec then CTR <- CTR - 1
# merge in the test
if SVRMmode.ALL:
- cond_ok &= el_cond_ok
+ cond_ok &= (el_cond_ok & ctr_ok)
else
- cond_ok |= el_cond_ok
+ cond_ok |= (el_cond_ok & ctr_ok)
# test for VL to be set (and exit)
- if VLSET and VSb = el_cond_ok then
+ if VLSET and VSb = (el_cond_ok & ctr_ok) then
if SVRMmode.VLI
SVSTATE.VL = srcstep+1
else
SVSTATE.VL = srcstep
break
# early exit?
- if SVRMmode.ALL:
- if ~el_cond_ok:
- break
- else
- if el_cond_ok:
- break
+ if SVRMmode.ALL != (el_cond_ok & ctr_ok):
+ break
+ # SVP64 rules about Scalar registers still apply!
if SVCRf.scalar:
break
+# loop finally done, now test if branch (and update LR)
+lr_ok <- SVRMmode.LRu
+if cond_ok then
+ if AA then NIA <-iea EXTS(BD || 0b00)
+ else NIA <-iea CIA + EXTS(BD || 0b00)
+ lr_ok <- 0b1
+if LK & lr_ok then LR <-iea CIA + 4
```
Pseudocode for Vertical-First Mode:
CR{SVCRf+srcstep} = CRbits
testbit = CRbits[BI & 0b11]
else if not SVRMmode.sz:
+ # inverted CTR test skip mode
+ if ¬BO[2] & CTRtest & ¬CTI then
+ CTR = CTR - 1
SVSTATE.srcstep = new_srcstep
exit # no branch testing
else
SVSTATE.VL = new_srcstep
```
-v3.0B branch pseudocode including LRu
-
-```
-if (mode_is_64bit) then M <- 0
-else M <- 32
-if ¬BO[2] then CTR <- CTR - 1
-ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
-cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
-lr_ok <- SVRMmode.LRu
-if ctr_ok & cond_ok then
- if AA then NIA <-iea EXTS(BD || 0b00)
- else NIA <-iea CIA + EXTS(BD || 0b00)
- lr_ok <- 0b1
-if LK & lr_ok then LR <-iea CIA + 4
-```
-
# Example Shader code
```
+// assume f() g() or h() modify a and/or b
while(a > 2) {
if(b < 5)
f();
vec<i32> a, b;
// ...
pred loop_pred = a > 2;
+// loop continues while any of a elements greater than 2
while(loop_pred.any()) {
+ // vector of predicate bits
pred if_pred = loop_pred & (b < 5);
+ // only call f() if at least 1 bit set
if(if_pred.any()) {
f(if_pred);
}
label1:
+ // loop mask ANDs with inverted if-test
pred else_pred = loop_pred & ~if_pred;
+ // only call g() if at least 1 bit set
if(else_pred.any()) {
g(else_pred);
}
which will end up as:
```
- sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
- sv.crweird r30, CR60.GT # transfer GT vector to r30
+ # start from while loop test point
+ b looptest
while_loop:
sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
skip_g:
# conditionally call h(r30) if any loop pred set
sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
+looptest:
+ sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
+ sv.crweird r30, CR60.GT # transfer GT vector to r30
sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
+end:
```