expected to be Micro-coded by most Hardware implementations.
---------
-
-\newpage{}
-
-# SVP64 Branch Conditional behaviour
-
-Please note: although similar, SVP64 Branch instructions should be
-considered completely separate and distinct from standard scalar
-OpenPOWER-approved v3.0B branches. **v3.0B branches are in no way
-impacted, altered, changed or modified in any way, shape or form by the
-SVP64 Vectorised Variants**.
-
-It is also extremely important to note that Branches are the sole
-pseudo-exception in SVP64 to `Scalar Identity Behaviour`. SVP64 Branches
-contain additional modes that are useful for scalar operations (i.e. even
-when VL=1 or when using single-bit predication).
-
-**Rationale**
-
-Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test
-a Condition Register. However for parallel processing it is simply
-impossible to perform multiple independent branches: the Program
-Counter simply cannot branch to multiple destinations based on multiple
-conditions. The best that can be done is to test multiple Conditions
-and make a decision of a *single* branch, based on analysis of a *Vector*
-of CR Fields which have just been calculated from a *Vector* of results.
-
-In 3D Shader binaries, which are inherently parallelised and predicated,
-testing all or some results and branching based on multiple tests is
-extremely common, and a fundamental part of Shader Compilers. Example:
-without such multi-condition test-and-branch, if a predicate mask is
-all zeros a large batch of instructions may be masked out to `nop`,
-and it would waste CPU cycles to run them. 3D GPU ISAs can test for
-this scenario and, with the appropriate predicate-analysis instruction,
-jump over fully-masked-out operations, by spotting that *all* Conditions
-are false.
-
-Unless Branches are aware and capable of such analysis, additional
-instructions would be required which perform Horizontal Cumulative
-analysis of Vectorised Condition Register Fields, in order to reduce
-the Vector of CR Fields down to one single yes or no decision that a
-Scalar-only v3.0B Branch-Conditional could cope with. Such instructions
-would be unavoidable, required, and costly by comparison to a single
-Vector-aware Branch. Therefore, in order to be commercially competitive,
-`sv.bc` and other Vector-aware Branch Conditional instructions are a
-high priority for 3D GPU (and OpenCL-style) workloads.
-
-Given that Power ISA v3.0B is already quite powerful, particularly
-the Condition Registers and their interaction with Branches, there are
-opportunities to create extremely flexible and compact Vectorised Branch
-behaviour. In addition, the side-effects (updating of CTR, truncation
-of VL, described below) make it a useful instruction even if the branch
-points to the next instruction (no actual branch).
-
-## Overview
-
-When considering an "array" of branch-tests, there are four
-primarily-useful modes: AND, OR, NAND and NOR of all Conditions.
-NAND and NOR may be synthesised from AND and OR by inverting `BO[1]`
-which just leaves two modes:
-
-* Branch takes place on the **first** CR Field test to succeed
- (a Great Big OR of all condition tests). Exit occurs
- on the first **successful** test.
-* Branch takes place only if **all** CR field tests succeed:
- a Great Big AND of all condition tests. Exit occurs
- on the first **failed** test.
-
-Early-exit is enacted such that the Vectorised Branch does not
-perform needless extra tests, which will help reduce reads on
-the Condition Register file.
-
-*Note: Early-exit is **MANDATORY** (required) behaviour. Branches
-**MUST** exit at the first sequentially-encountered failure point,
-for exactly the same reasons for which it is mandatory in programming
-languages doing early-exit: to avoid damaging side-effects and to provide
-deterministic behaviour. Speculative testing of Condition Register
-Fields is permitted, as is speculative calculation of CTR, as long as,
-as usual in any Out-of-Order microarchitecture, that speculative testing
-is cancelled should an early-exit occur. i.e. the speculation must be
-"precise": Program Order must be preserved*
-
-Also note that when early-exit occurs in Horizontal-first Mode, srcstep,
-dststep etc. are all reset, ready to begin looping from the beginning
-for the next instruction. However for Vertical-first Mode srcstep
-etc. are incremented "as usual" i.e. an early-exit has no special impact,
-regardless of whether the branch occurred or not. This can leave srcstep
-etc. in what may be considered an unusual state on exit from a loop and
-it is up to the programmer to reset srcstep, dststep etc. to known-good
-values *(easily achieved with `setvl`)*.
-
-Additional useful behaviour involves two primary Modes (both of which
-may be enabled and combined):
-
-* **VLSET Mode**: identical to Data-Dependent Fail-First Mode
- for Arithmetic SVP64 operations, with more
- flexibility and a close interaction and integration into the
- underlying base Scalar v3.0B Branch instruction.
- Truncation of VL takes place around the early-exit point.
-* **CTR-test Mode**: gives much more flexibility over when and why
- CTR is decremented, including options to decrement if a Condition
- test succeeds *or if it fails*.
-
-With these side-effects, basic Boolean Logic Analysis advises that it
-is important to provide a means to enact them each based on whether
-testing succeeds *or fails*. This results in a not-insignificant number
-of additional Mode Augmentation bits, accompanying VLSET and CTR-test
-Modes respectively.
-
-Predicate skipping or zeroing may, as usual with SVP64, be controlled by
-`sz`. Where the predicate is masked out and zeroing is enabled, then in
-such circumstances the same Boolean Logic Analysis dictates that rather
-than testing only against zero, the option to test against one is also
-prudent. This introduces a new immediate field, `SNZ`, which works in
-conjunction with `sz`.
-
-Vectorised Branches can be used in either SVP64 Horizontal-First or
-Vertical-First Mode. Essentially, at an element level, the behaviour
-is identical in both Modes, although the `ALL` bit is meaningless in
-Vertical-First Mode.
-
-It is also important to bear in mind that, fundamentally, Vectorised
-Branch-Conditional is still extremely close to the Scalar v3.0B
-Branch-Conditional instructions, and that the same v3.0B Scalar
-Branch-Conditional instructions are still *completely separate and
-independent*, being unaltered and unaffected by their SVP64 variants in
-every conceivable way.
-
-*Programming note: One important point is that SVP64 instructions are
-64 bit. (8 bytes not 4). This needs to be taken into consideration
-when computing branch offsets: the offset is relative to the start of
-the instruction, which **includes** the SVP64 Prefix*
-
-## Format and fields
-
-With element-width overrides being meaningless for Condition Register
-Fields, bits 4 thru 7 of SVP64 RM may be used for additional Mode bits.
-
-SVP64 RM `MODE` (includes repurposing `ELWIDTH` bits 4:5, and
-`ELWIDTH_SRC` bits 6-7 for *alternate* uses) for Branch Conditional:
-
-| 4 | 5 | 6 | 7 | 17 | 18 | 19 | 20 | 21 | 22 23 | description |
-| - | - | - | - | -- | -- | -- | -- | --- |--------|----------------- |
-|ALL|SNZ| / | / | SL |SLu | 0 | 0 | / | LRu sz | simple mode |
-|ALL|SNZ| / |VSb| SL |SLu | 0 | 1 | VLI | LRu sz | VLSET mode |
-|ALL|SNZ|CTi| / | SL |SLu | 1 | 0 | / | LRu sz | CTR-test mode |
-|ALL|SNZ|CTi|VSb| SL |SLu | 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode |
-
-Brief description of fields:
-
-* **sz=1** if predication is enabled and `sz=1` and a predicate
- element bit is zero, `SNZ` will
- be substituted in place of the CR bit selected by `BI`,
- as the Condition tested.
- Contrast this with
- normal SVP64 `sz=1` behaviour, where *only* a zero is put in
- place of masked-out predicate bits.
-* **sz=0** When `sz=0` skipping occurs as usual on
- masked-out elements, but unlike all
- other SVP64 behaviour which entirely skips an element with
- no related side-effects at all, there are certain
- special circumstances where CTR
- may be decremented. See CTR-test Mode, below.
-* **ALL** when set, all branch conditional tests must pass in order for
- the branch to succeed. When clear, it is the first sequentially
- encountered successful test that causes the branch to succeed.
- This is identical behaviour to how programming languages perform
- early-exit on Boolean Logic chains.
-* **VLI** VLSET is identical to Data-dependent Fail-First mode.
- In VLSET mode, VL *may* (depending on `VSb`) be truncated.
- If VLI (Vector Length Inclusive) is clear,
- VL is truncated to *exclude* the current element, otherwise it is
- included. SVSTATE.MVL is not altered: only VL.
-* **SL** identical to `LR` except applicable to SVSTATE. If `SL`
- is set, SVSTATE is transferred to SVLR (conditionally on
- whether `SLu` is set).
-* **SLu**: SVSTATE Link Update, like `LRu` except applies to SVSTATE.
-* **LRu**: Link Register Update, used in conjunction with LK=1
- to make LR update conditional
-* **VSb** In VLSET Mode, after testing,
- if VSb is set, VL is truncated if the test succeeds. If VSb is clear,
- VL is truncated if a test *fails*. Masked-out (skipped)
- bits are not considered
- part of testing when `sz=0`
-* **CTi** CTR inversion. CTR-test Mode normally decrements per element
- tested. CTR inversion decrements if a test *fails*. Only relevant
- in CTR-test Mode.
-
-LRu and CTR-test modes are where SVP64 Branches subtly differ from
-Scalar v3.0B Branches. `sv.bcl` for example will always update LR,
-whereas `sv.bcl/lru` will only update LR if the branch succeeds.
-
-Of special interest is that when using ALL Mode (Great Big AND of all
-Condition Tests), if `VL=0`, which is rare but can occur in Data-Dependent
-Modes, the Branch will always take place because there will be no failing
-Condition Tests to prevent it. Likewise when not using ALL Mode (Great
-Big OR of all Condition Tests) and `VL=0` the Branch is guaranteed not
-to occur because there will be no *successful* Condition Tests to make
-it happen.
-
-## Vectorised CR Field numbering, and Scalar behaviour
-
-It is important to keep in mind that just like all SVP64 instructions,
-the `BI` field of the base v3.0B Branch Conditional instruction may be
-extended by SVP64 EXTRA augmentation, as well as be marked as either
-Scalar or Vector. It is also crucially important to keep in mind that for
-CRs, SVP64 sequentially increments the CR *Field* numbers. CR *Fields*
-are treated as elements, not bit-numbers of the CR *register*.
-
-The `BI` operand of Branch Conditional operations is five bits, in scalar
-v3.0B this would select one bit of the 32 bit CR, comprising eight CR
-Fields of 4 bits each. In SVP64 there are 16 32 bit CRs, containing
-128 4-bit CR Fields. Therefore, the 2 LSBs of `BI` select the bit from
-the CR Field (EQ LT GT SO), and the top 3 bits are extended to either
-scalar or vector and to select CR Fields 0..127 as specified in SVP64
-[[sv/svp64/appendix]].
-
-When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
-then as the usual SVP64 rules apply: the Vector loop ends at the first
-element tested (the first CR *Field*), after taking predication into
-consideration. Thus, also as usual, when a predicate mask is given, and
-`BI` marked as scalar, and `sz` is zero, srcstep skips forward to the
-first non-zero predicated element, and only that one element is tested.
-
-In other words, the fact that this is a Branch Operation (instead of an
-arithmetic one) does not result, ultimately, in significant changes as
-to how SVP64 is fundamentally applied, except with respect to:
-
-* the unique properties associated with conditionally
- changing the Program Counter (aka "a Branch"), resulting in early-out
- opportunities
-* CTR-testing
-
-Both are outlined below, in later sections.
-
-## Horizontal-First and Vertical-First Modes
-
-In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
-AND) results in early exit: no more updates to CTR occur (if requested);
-no branch occurs, and LR is not updated (if requested). Likewise for
-non-ALL mode (Great Big Or) on first success early exit also occurs,
-however this time with the Branch proceeding. In both cases the testing
-of the Vector of CRs should be done in linear sequential order (or in
-REMAP re-sequenced order): such that tests that are sequentially beyond
-the exit point are *not* carried out. (*Note: it is standard practice
-in Programming languages to exit early from conditional tests, however a
-little unusual to consider in an ISA that is designed for Parallel Vector
-Processing. The reason is to have strictly-defined guaranteed behaviour*)
-
-In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
-behaviour. Given that only one element is being tested at a time in
-Vertical-First Mode, a test designed to be done on multiple bits is
-meaningless.
-
-## Description and Modes
-
-Predication in both INT and CR modes may be applied to `sv.bc` and other
-SVP64 Branch Conditional operations, exactly as they may be applied to
-other SVP64 operations. When `sz` is zero, any masked-out Branch-element
-operations are not included in condition testing, exactly like all other
-SVP64 operations, *including* side-effects such as potentially updating
-LR or CTR, which will also be skipped. There is *one* exception here,
-which is when `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
-predicate mask bit is also zero: under these special circumstances CTR
-will also decrement.
-
-When `sz` is non-zero, this normally requests insertion of a zero in
-place of the input data, when the relevant predicate mask bit is zero.
-This would mean that a zero is inserted in place of `CR[BI+32]` for
-testing against `BO`, which may not be desirable in all circumstances.
-Therefore, an extra field is provided `SNZ`, which, if set, will insert
-a **one** in place of a masked-out element, instead of a zero.
-
-(*Note: Both options are provided because it is useful to deliberately
-cause the Branch-Conditional Vector testing to fail at a specific point,
-controlled by the Predicate mask. This is particularly useful in `VLSET`
-mode, which will truncate SVSTATE.VL at the point of the first failed
-test.*)
-
-Normally, CTR mode will decrement once per Condition Test, resulting under
-normal circumstances that CTR reduces by up to VL in Horizontal-First
-Mode. Just as when v3.0B Branch-Conditional saves at least one instruction
-on tight inner loops through auto-decrementation of CTR, likewise it
-is also possible to save instruction count for SVP64 loops in both
-Vertical-First and Horizontal-First Mode, particularly in circumstances
-where there is conditional interaction between the element computation
-and testing, and the continuation (or otherwise) of a given loop. The
-potential combinations of interactions is why CTR testing options have
-been added.
-
-Also, the unconditional bit `BO[0]` is still relevant when Predication
-is applied to the Branch because in `ALL` mode all nonmasked bits have
-to be tested, and when `sz=0` skipping occurs. Even when VLSET mode is
-not used, CTR may still be decremented by the total number of nonmasked
-elements, acting in effect as either a popcount or cntlz depending
-on which mode bits are set. In short, Vectorised Branch becomes an
-extremely powerful tool.
-
-**Micro-Architectural Implementation Note**: *when implemented on top
-of a Multi-Issue Out-of-Order Engine it is possible to pass a copy of
-the predicate and the prerequisite CR Fields to all Branch Units, as
-well as the current value of CTR at the time of multi-issue, and for
-each Branch Unit to compute how many times CTR would be subtracted,
-in a fully-deterministic and parallel fashion. A SIMD-based Branch
-Unit, receiving and processing multiple CR Fields covered by multiple
-predicate bits, would do the exact same thing. Obviously, however, if
-CTR is modified within any given loop (mtctr) the behaviour of CTR is
-no longer deterministic.*
-
-### Link Register Update
-
-For a Scalar Branch, unconditional updating of the Link Register LR
-is useful and practical. However, if a loop of CR Fields is tested,
-unconditional updating of LR becomes problematic.
-
-For example when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode,
-LR's value will be unconditionally overwritten after the first element,
-such that for execution (testing) of the second element, LR has the value
-`CIA+8`. This is covered in the `bclrl` example, in a later section.
-
-The addition of a LRu bit modifies behaviour in conjunction with LK,
-as follows:
-
-* `sv.bc` When LRu=0,LK=0, Link Register is not updated
-* `sv.bcl` When LRu=0,LK=1, Link Register is updated unconditionally
-* `sv.bcl/lru` When LRu=1,LK=1, Link Register will
- only be updated if the Branch Condition fails.
-* `sv.bc/lru` When LRu=1,LK=0, Link Register will only be updated if
- the Branch Condition succeeds.
-
-This avoids destruction of LR during loops (particularly Vertical-First
-ones).
-
-**SVLR and SVSTATE**
-
-For precisely the reasons why `LK=1` was added originally to the Power
-ISA, with SVSTATE being a peer of the Program Counter it becomes necessary
-to also add an SVLR (SVSTATE Link Register) and corresponding control bits
-`SL` and `SLu`.
-
-### CTR-test
-
-Where a standard Scalar v3.0B branch unconditionally decrements CTR when
-`BO[2]` is clear, CTR-test Mode introduces more flexibility which allows
-CTR to be used for many more types of Vector loops constructs.
-
-CTR-test mode and CTi interaction is as follows: note that `BO[2]`
-is still required to be clear for CTR decrements to be considered,
-exactly as is the case in Scalar Power ISA v3.0B
-
-* **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
- if `BO[2]` is zero. Masked-out elements when `sz=0` are
- skipped (i.e. CTR is *not* decremented when the predicate
- bit is zero and `sz=0`).
-* **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
- if `BO[2]` is zero and a masked-out element is skipped
- (`sz=0` and predicate bit is zero). This one special case is the
- **opposite** of other combinations, as well as being
- completely different from normal SVP64 `sz=0` behaviour)
-* **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
- if `BO[2]` is zero and the Condition Test succeeds.
- Masked-out elements when `sz=0` are skipped (including
- not decrementing CTR)
-* **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
- if `BO[2]` is zero and the Condition Test *fails*.
- Masked-out elements when `sz=0` are skipped (including
- not decrementing CTR)
-
-`CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
-only time in the entirety of SVP64 that has side-effects when
-a predicate mask bit is clear. **All** other SVP64 operations
-entirely skip an element when sz=0 and a predicate mask bit is zero.
-It is also critical to emphasise that in this unusual mode,
-no other side-effects occur: **only** CTR is decremented, i.e. the
-rest of the Branch operation is skipped.
-
-### VLSET Mode
-
-VLSET Mode truncates the Vector Length so that subsequent instructions
-operate on a reduced Vector Length. This is similar to Data-dependent
-Fail-First and LD/ST Fail-First, where for VLSET the truncation occurs
-at the Branch decision-point.
-
-Interestingly, due to the side-effects of `VLSET` mode it is actually
-useful to use Branch Conditional even to perform no actual branch
-operation, i.e to point to the instruction after the branch. Truncation of
-VL would thus conditionally occur yet control flow alteration would not.
-
-`VLSET` mode with Vertical-First is particularly unusual. Vertical-First
-is designed to be used for explicit looping, where an explicit call to
-`svstep` is required to move both srcstep and dststep on to the next
-element, until VL (or other condition) is reached. Vertical-First Looping
-is expected (required) to terminate if the end of the Vector, VL, is
-reached. If however that loop is terminated early because VL is truncated,
-VLSET with Vertical-First becomes meaningless. Resolving this would
-require two branches: one Conditional, the other branching unconditionally
-to create the loop, where the Conditional one jumps over it.
-
-Therefore, with `VSb`, the option to decide whether truncation should
-occur if the branch succeeds *or* if the branch condition fails allows
-for the flexibility required. This allows a Vertical-First Branch to
-*either* be used as a branch-back (loop) *or* as part of a conditional
-exit or function call from *inside* a loop, and for VLSET to be integrated
-into both types of decision-making.
-
-In the case of a Vertical-First branch-back (loop), with `VSb=0` the
-branch takes place if success conditions are met, but on exit from that
-loop (branch condition fails), VL will be truncated. This is extremely
-useful.
-
-`VLSET` mode with Horizontal-First when `VSb=0` is still useful, because
-it can be used to truncate VL to the first predicated (non-masked-out)
-element.
-
-The truncation point for VL, when VLi is clear, must not include skipped
-elements that preceded the current element being tested. Example:
-`sz=0, VLi=0, predicate mask = 0b110010` and the Condition Register
-failure point is at CR Field element 4.
-
-* Testing at element 0 is skipped because its predicate bit is zero
-* Testing at element 1 passed
-* Testing elements 2 and 3 are skipped because their
- respective predicate mask bits are zero
-* Testing element 4 fails therefore VL is truncated to **2**
- not 4 due to elements 2 and 3 being skipped.
-
-If `sz=1` in the above example *then* VL would have been set to 4 because
-in non-zeroing mode the zero'd elements are still effectively part of the
-Vector (with their respective elements set to `SNZ`)
-
-If `VLI=1` then VL would be set to 5 regardless of sz, due to being inclusive
-of the element actually being tested.
-
-### VLSET and CTR-test combined
-
-If both CTR-test and VLSET Modes are requested, it is important to
-observe the correct order. What occurs depends on whether VLi is enabled,
-because VLi affects the length, VL.
-
-If VLi (VL truncate inclusive) is set:
-
-1. compute the test including whether CTR triggers
-2. (optionally) decrement CTR
-3. (optionally) truncate VL (VSb inverts the decision)
-4. decide (based on step 1) whether to terminate looping
- (including not executing step 5)
-5. decide whether to branch.
-
-If VLi is clear, then when a test fails that element
-and any following it
-should **not** be considered part of the Vector. Consequently:
-
-1. compute the branch test including whether CTR triggers
-2. if the test fails against VSb, truncate VL to the *previous*
- element, and terminate looping. No further steps executed.
-3. (optionally) decrement CTR
-4. decide whether to branch.
-
-## Boolean Logic combinations
-
-In a Scalar ISA, Branch-Conditional testing even of vector results may be
-performed through inversion of tests. NOR of all tests may be performed
-by inversion of the scalar condition and branching *out* from the scalar
-loop around elements, using scalar operations.
-
-In a parallel (Vector) ISA it is the ISA itself which must perform
-the prerequisite logic manipulation. Thus for SVP64 there are an
-extraordinary number of nesessary combinations which provide completely
-different and useful behaviour. Available options to combine:
-
-* `BO[0]` to make an unconditional branch would seem irrelevant if
- it were not for predication and for side-effects (CTR Mode
- for example)
-* Enabling CTR-test Mode and setting `BO[2]` can still result in the
- Branch
- taking place, not because the Condition Test itself failed, but
- because CTR reached zero **because**, as required by CTR-test mode,
- CTR was decremented as a **result** of Condition Tests failing.
-* `BO[1]` to select whether the CR bit being tested is zero or nonzero
-* `R30` and `~R30` and other predicate mask options including CR and
- inverted CR bit testing
-* `sz` and `SNZ` to insert either zeros or ones in place of masked-out
- predicate bits
-* `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
- `OR` of all tests, respectively.
-* Predicate Mask bits, which combine in effect with the CR being
- tested.
-* Inversion of Predicate Masks (`~r3` instead of `r3`, or using
- `NE` rather than `EQ`) which results in an additional
- level of possible ANDing, ORing etc. that would otherwise
- need explicit instructions.
-
-The most obviously useful combinations here are to set `BO[1]` to zero
-in order to turn `ALL` into Great-Big-NAND and `ANY` into Great-Big-NOR.
-Other Mode bits which perform behavioural inversion then have to work
-round the fact that the Condition Testing is NOR or NAND. The alternative
-to not having additional behavioural inversion (`SNZ`, `VSb`, `CTi`)
-would be to have a second (unconditional) branch directly after the first,
-which the first branch jumps over. This contrivance is avoided by the
-behavioural inversion bits.
-
-## Pseudocode and examples
-
-Please see the SVP64 appendix regarding CR bit ordering and for
-the definition of `CR{n}`
-
-For comparative purposes this is a copy of the v3.0B `bc` pseudocode
-
-```
- if (mode_is_64bit) then M <- 0
- else M <- 32
- if ¬BO[2] then CTR <- CTR - 1
- ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
- cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
- if ctr_ok & cond_ok then
- if AA then NIA <-iea EXTS(BD || 0b00)
- else NIA <-iea CIA + EXTS(BD || 0b00)
- if LK then LR <-iea CIA + 4
-```
-
-Simplified pseudocode including LRu and CTR skipping, which illustrates
-clearly that SVP64 Scalar Branches (VL=1) are **not** identical to
-v3.0B Scalar Branches. The key areas where differences occur are the
-inclusion of predication (which can still be used when VL=1), in when and
-why CTR is decremented (CTRtest Mode) and whether LR is updated (which
-is unconditional in v3.0B when LK=1, and conditional in SVP64 when LRu=1).
-
-Inline comments highlight the fact that the Scalar Branch behaviour and
-pseudocode is still clearly visible and embedded within the Vectorised
-variant:
-
-```
- if (mode_is_64bit) then M <- 0
- else M <- 32
- # the bit of CR to test, if the predicate bit is zero,
- # is overridden
- testbit = CR[BI+32]
- if ¬predicate_bit then testbit = SVRMmode.SNZ
- # otherwise apart from the override ctr_ok and cond_ok
- # are exactly the same
- ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
- cond_ok <- BO[0] | ¬(testbit ^ BO[1])
- if ¬predicate_bit & ¬SVRMmode.sz then
- # this is entirely new: CTR-test mode still decrements CTR
- # even when predicate-bits are zero
- if ¬BO[2] & CTRtest & ¬CTi then
- CTR = CTR - 1
- # instruction finishes here
- else
- # usual BO[2] CTR-mode now under CTR-test mode as well
- if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
- # new VLset mode, conditional test truncates VL
- if VLSET and VSb = (cond_ok & ctr_ok) then
- if SVRMmode.VLI then SVSTATE.VL = srcstep+1
- else SVSTATE.VL = srcstep
- # usual LR is now conditional, but also joined by SVLR
- lr_ok <- LK
- svlr_ok <- SVRMmode.SL
- if ctr_ok & cond_ok then
- if AA then NIA <-iea EXTS(BD || 0b00)
- else NIA <-iea CIA + EXTS(BD || 0b00)
- if SVRMmode.LRu then lr_ok <- ¬lr_ok
- if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
- if lr_ok then LR <-iea CIA + 4
- if svlr_ok then SVLR <- SVSTATE
-```
-
-Below is the pseudocode for SVP64 Branches, which is a little less
-obvious but identical to the above. The lack of obviousness is down to
-the early-exit opportunities.
-
-Effective pseudocode for Horizontal-First Mode:
-
-```
- if (mode_is_64bit) then M <- 0
- else M <- 32
- cond_ok = not SVRMmode.ALL
- for srcstep in range(VL):
- # select predicate bit or zero/one
- if predicate[srcstep]:
- # get SVP64 extended CR field 0..127
- SVCRf = SVP64EXTRA(BI>>2)
- CRbits = CR{SVCRf}
- testbit = CRbits[BI & 0b11]
- # testbit = CR[BI+32+srcstep*4]
- else if not SVRMmode.sz:
- # inverted CTR test skip mode
- if ¬BO[2] & CTRtest & ¬CTI then
- CTR = CTR - 1
- continue # skip to next element
- else
- testbit = SVRMmode.SNZ
- # actual element test here
- ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
- el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
- # check if CTR dec should occur
- ctrdec = ¬BO[2]
- if CTRtest & (el_cond_ok ^ CTi) then
- ctrdec = 0b0
- if ctrdec then CTR <- CTR - 1
- # merge in the test
- if SVRMmode.ALL:
- cond_ok &= (el_cond_ok & ctr_ok)
- else
- cond_ok |= (el_cond_ok & ctr_ok)
- # test for VL to be set (and exit)
- if VLSET and VSb = (el_cond_ok & ctr_ok) then
- if SVRMmode.VLI then SVSTATE.VL = srcstep+1
- else SVSTATE.VL = srcstep
- break
- # early exit?
- if SVRMmode.ALL != (el_cond_ok & ctr_ok):
- break
- # SVP64 rules about Scalar registers still apply!
- if SVCRf.scalar:
- break
- # loop finally done, now test if branch (and update LR)
- lr_ok <- LK
- svlr_ok <- SVRMmode.SL
- if cond_ok then
- if AA then NIA <-iea EXTS(BD || 0b00)
- else NIA <-iea CIA + EXTS(BD || 0b00)
- if SVRMmode.LRu then lr_ok <- ¬lr_ok
- if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
- if lr_ok then LR <-iea CIA + 4
- if svlr_ok then SVLR <- SVSTATE
-```
-
-Pseudocode for Vertical-First Mode:
-
-```
- # get SVP64 extended CR field 0..127
- SVCRf = SVP64EXTRA(BI>>2)
- CRbits = CR{SVCRf}
- # select predicate bit or zero/one
- if predicate[srcstep]:
- if BRc = 1 then # CR0 vectorised
- CR{SVCRf+srcstep} = CRbits
- testbit = CRbits[BI & 0b11]
- else if not SVRMmode.sz:
- # inverted CTR test skip mode
- if ¬BO[2] & CTRtest & ¬CTI then
- CTR = CTR - 1
- SVSTATE.srcstep = new_srcstep
- exit # no branch testing
- else
- testbit = SVRMmode.SNZ
- # actual element test here
- cond_ok <- BO[0] | ¬(testbit ^ BO[1])
- # test for VL to be set (and exit)
- if VLSET and cond_ok = VSb then
- if SVRMmode.VLI
- SVSTATE.VL = new_srcstep+1
- else
- SVSTATE.VL = new_srcstep
-```
-
-### Example Shader code
-
-```
- // assume f() g() or h() modify a and/or b
- while(a > 2) {
- if(b < 5)
- f();
- else
- g();
- h();
- }
-```
-
-which compiles to something like:
-
-```
- vec<i32> a, b;
- // ...
- pred loop_pred = a > 2;
- // loop continues while any of a elements greater than 2
- while(loop_pred.any()) {
- // vector of predicate bits
- pred if_pred = loop_pred & (b < 5);
- // only call f() if at least 1 bit set
- if(if_pred.any()) {
- f(if_pred);
- }
- label1:
- // loop mask ANDs with inverted if-test
- pred else_pred = loop_pred & ~if_pred;
- // only call g() if at least 1 bit set
- if(else_pred.any()) {
- g(else_pred);
- }
- h(loop_pred);
- }
-```
-
-which will end up as:
-
-```
- # start from while loop test point
- b looptest
- while_loop:
- sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
- sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
- # only calculate loop_pred & pred_b because needed in f()
- sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
- f(CR80.v.SO)
- skip_f:
- # illustrate inversion of pred_b. invert r30, test ALL
- # rather than SOME, but masked-out zero test would FAIL,
- # therefore masked-out instead is tested against 1 not 0
- sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
- # else = loop & ~pred_b, need this because used in g()
- sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
- g(CR80.v.SO)
- skip_g:
- # conditionally call h(r30) if any loop pred set
- sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
- looptest:
- sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
- sv.crweird r30, CR60.GT # transfer GT vector to r30
- sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
- end:
-```
-
-### LRu example
-
-show why LRu would be useful in a loop. Imagine the following
-c code:
-
-```
- for (int i = 0; i < 8; i++) {
- if (x < y) break;
- }
-```
-
-Under these circumstances exiting from the loop is not only based on
-CTR it has become conditional on a CR result. Thus it is desirable that
-NIA *and* LR only be modified if the conditions are met
-
-v3.0 pseudocode for `bclrl`:
-
-```
- if (mode_is_64bit) then M <- 0
- else M <- 32
- if ¬BO[2] then CTR <- CTR - 1
- ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
- cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
- if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
- if LK then LR <-iea CIA + 4
-```
-
-the latter part for SVP64 `bclrl` becomes:
-
-```
- for i in 0 to VL-1:
- ...
- ...
- cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
- lr_ok <- LK
- if ctr_ok & cond_ok then
- NIA <-iea LR[0:61] || 0b00
- if SVRMmode.LRu then lr_ok <- ¬lr_ok
- if lr_ok then LR <-iea CIA + 4
- # if NIA modified exit loop
-```
-
-The reason why should be clear from this being a Vector loop:
-unconditional destruction of LR when LK=1 makes `sv.bclrl` ineffective,
-because the intention going into the loop is that the branch should be to
-the copy of LR set at the *start* of the loop, not half way through it.
-However if the change to LR only occurs if the branch is taken then it
-becomes a useful instruction.
-
-The following pseudocode should **not** be implemented because it
-violates the fundamental principle of SVP64 which is that SVP64 looping
-is a thin wrapper around Scalar Instructions. The pseducode below is
-more an actual Vector ISA Branch and as such is not at all appropriate:
-
-```
- for i in 0 to VL-1:
- ...
- ...
- cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
- if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
- # only at the end of looping is LK checked.
- # this completely violates the design principle of SVP64
- # and would actually need to be a separate (scalar)
- # instruction "set LR to CIA+4 but retrospectively"
- # which is clearly impossible
- if LK then LR <-iea CIA + 4
-```
-
[[!tag opf_rfc]]
--------