-[[!tag standards]]
# SVP64 Branch Conditional behaviour
**DRAFT STATUS**
Please note: although similar, SVP64 Branch instructions should be
-considered completely separate and distinct from
-standard scalar OpenPOWER-approved v3.0B branches.
-**v3.0B branches are in no way impacted, altered,
-changed or modified in any way, shape or form by
-the SVP64 Vectorised Variants**.
-
-It is also
-extremely important to note that Branches are the
-sole pseudo-exception in SVP64 to `Scalar Identity Behaviour`.
-SVP64 Branches contain additional modes that are useful
-for scalar operations (i.e. even when VL=1 or when
-using single-bit predication).
+considered completely separate and distinct from standard scalar
+OpenPOWER-approved v3.0B branches. **v3.0B branches are in no way
+impacted, altered, changed or modified in any way, shape or form by the
+SVP64 Vectorised Variants**.
+
+It is also extremely important to note that Branches are the sole
+pseudo-exception in SVP64 to `Scalar Identity Behaviour`. SVP64 Branches
+contain additional modes that are useful for scalar operations (i.e. even
+when VL=1 or when using single-bit predication).
Links
* [[sv/cr_int_predication]]
* [TODO](https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=fa99590eeb61e63b2d2ea81f303b9b4320e3bbe1)
-# Rationale
-
-Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
-Condition Register. However for parallel processing it is simply impossible
-to perform multiple independent branches: the Program Counter simply
-cannot branch to multiple destinations based on multiple conditions.
-The best that can be done is
-to test multiple Conditions and make a decision of a *single* branch,
-based on analysis of a *Vector* of CR Fields
-which have just been calculated from a *Vector* of results.
-
-In 3D Shader
-binaries, which are inherently parallelised and predicated, testing all or
-some results and branching based on multiple tests is extremely common,
-and a fundamental part of Shader Compilers. Example:
-without such multi-condition
-test-and-branch, if a predicate mask is all zeros a large batch of
-instructions may be masked out to `nop`, and it would waste
-CPU cycles to run them. 3D GPU ISAs can test for this scenario
-and, with the appropriate predicate-analysis instruction,
-jump over fully-masked-out operations, by spotting that
-*all* Conditions are false.
+## Rationale
+
+Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test
+a Condition Register. However for parallel processing it is simply
+impossible to perform multiple independent branches: the Program
+Counter simply cannot branch to multiple destinations based on multiple
+conditions. The best that can be done is to test multiple Conditions
+and make a decision of a *single* branch, based on analysis of a *Vector*
+of CR Fields which have just been calculated from a *Vector* of results.
+
+In 3D Shader binaries, which are inherently parallelised and predicated,
+testing all or some results and branching based on multiple tests is
+extremely common, and a fundamental part of Shader Compilers. Example:
+without such multi-condition test-and-branch, if a predicate mask is
+all zeros a large batch of instructions may be masked out to `nop`,
+and it would waste CPU cycles to run them. 3D GPU ISAs can test for
+this scenario and, with the appropriate predicate-analysis instruction,
+jump over fully-masked-out operations, by spotting that *all* Conditions
+are false.
Unless Branches are aware and capable of such analysis, additional
instructions would be required which perform Horizontal Cumulative
-analysis of Vectorised Condition Register Fields, in order to
-reduce the Vector of CR Fields down to one single yes or no
-decision that a Scalar-only v3.0B Branch-Conditional could cope with.
-Such instructions would be unavoidable, required, and costly
-by comparison to a single Vector-aware Branch.
-Therefore, in order to be commercially competitive, `sv.bc` and
-other Vector-aware Branch Conditional instructions are a high priority
-for 3D GPU (and OpenCL-style) workloads.
+analysis of Vectorised Condition Register Fields, in order to reduce
+the Vector of CR Fields down to one single yes or no decision that a
+Scalar-only v3.0B Branch-Conditional could cope with. Such instructions
+would be unavoidable, required, and costly by comparison to a single
+Vector-aware Branch. Therefore, in order to be commercially competitive,
+`sv.bc` and other Vector-aware Branch Conditional instructions are a
+high priority for 3D GPU (and OpenCL-style) workloads.
Given that Power ISA v3.0B is already quite powerful, particularly
-the Condition Registers and their interaction with Branches, there
-are opportunities to create extremely flexible and compact
-Vectorised Branch behaviour. In addition, the side-effects (updating
-of CTR, truncation of VL, described below) make it a useful instruction
-even if the branch points to the next instruction (no actual branch).
+the Condition Registers and their interaction with Branches, there are
+opportunities to create extremely flexible and compact Vectorised Branch
+behaviour. In addition, the side-effects (updating of CTR, truncation
+of VL, described below) make it a useful instruction even if the branch
+points to the next instruction (no actual branch).
-# Overview
+## Overview
When considering an "array" of branch-tests, there are four
-primarily-useful modes:
-AND, OR, NAND and NOR of all Conditions.
-NAND and NOR may be synthesised from AND and OR by
-inverting `BO[1]` which just leaves two modes:
+primarily-useful modes: AND, OR, NAND and NOR of all Conditions.
+NAND and NOR may be synthesised from AND and OR by inverting `BO[1]`
+which just leaves two modes:
* Branch takes place on the **first** CR Field test to succeed
(a Great Big OR of all condition tests). Exit occurs
perform needless extra tests, which will help reduce reads on
the Condition Register file.
-*Note: Early-exit is **MANDATORY** (required) behaviour.
-Branches **MUST** exit at the first sequentially-encountered
-failure point, for
-exactly the same reasons for which it is mandatory in
-programming languages doing early-exit: to avoid
-damaging side-effects and to provide deterministic
-behaviour. Speculative testing of Condition
-Register Fields is permitted, as is speculative calculation
-of CTR, as long as, as usual in any Out-of-Order microarchitecture,
-that speculative testing is cancelled should an early-exit occur.
-i.e. the speculation must be "precise": Program Order must be preserved*
-
-Also note that when early-exit occurs in Horizontal-first Mode,
-srcstep, dststep etc. are all reset, ready to begin looping from the
-beginning for the next instruction. However for Vertical-first
-Mode srcstep etc. are incremented "as usual" i.e. an early-exit
-has no special impact, regardless of whether the branch
-occurred or not. This can leave srcstep etc. in what may be
-considered an unusual
-state on exit from a loop and it is up to the programmer to
-reset srcstep, dststep etc. to known-good values
-*(easily achieved with `setvl`)*.
-
-Additional useful behaviour involves two primary Modes (both of
-which may be enabled and combined):
+*Note: Early-exit is **MANDATORY** (required) behaviour. Branches
+**MUST** exit at the first sequentially-encountered failure point,
+for exactly the same reasons for which it is mandatory in programming
+languages doing early-exit: to avoid damaging side-effects and to provide
+deterministic behaviour. Speculative testing of Condition Register
+Fields is permitted, as is speculative calculation of CTR, as long as,
+as usual in any Out-of-Order microarchitecture, that speculative testing
+is cancelled should an early-exit occur. i.e. the speculation must be
+"precise": Program Order must be preserved*
+
+Also note that when early-exit occurs in Horizontal-first Mode, srcstep,
+dststep etc. are all reset, ready to begin looping from the beginning
+for the next instruction. However for Vertical-first Mode srcstep
+etc. are incremented "as usual" i.e. an early-exit has no special impact,
+regardless of whether the branch occurred or not. This can leave srcstep
+etc. in what may be considered an unusual state on exit from a loop and
+it is up to the programmer to reset srcstep, dststep etc. to known-good
+values *(easily achieved with `setvl`)*.
+
+Additional useful behaviour involves two primary Modes (both of which
+may be enabled and combined):
* **VLSET Mode**: identical to Data-Dependent Fail-First Mode
for Arithmetic SVP64 operations, with more
CTR is decremented, including options to decrement if a Condition
test succeeds *or if it fails*.
-With these side-effects, basic Boolean Logic Analysis advises that
-it is important to provide a means
-to enact them each based on whether testing succeeds *or fails*. This
-results in a not-insignificant number of additional Mode Augmentation bits,
-accompanying VLSET and CTR-test Modes respectively.
-
-Predicate skipping or zeroing may, as usual with SVP64, be controlled
-by `sz`.
-Where the predicate is masked out and
-zeroing is enabled, then in such circumstances
-the same Boolean Logic Analysis dictates that
-rather than testing only against zero, the option to test
-against one is also prudent. This introduces a new
-immediate field, `SNZ`, which works in conjunction with
-`sz`.
-
-
-Vectorised Branches can be used
-in either SVP64 Horizontal-First or Vertical-First Mode. Essentially,
-at an element level, the behaviour is identical in both Modes,
-although the `ALL` bit is meaningless in Vertical-First Mode.
-
-It is also important
-to bear in mind that, fundamentally, Vectorised Branch-Conditional
-is still extremely close to the Scalar v3.0B Branch-Conditional
-instructions, and that the same v3.0B Scalar Branch-Conditional
-instructions are still
-*completely separate and independent*, being unaltered and
-unaffected by their SVP64 variants in every conceivable way.
-
-*Programming note: One important point is that SVP64 instructions are 64 bit.
-(8 bytes not 4). This needs to be taken into consideration when computing
-branch offsets: the offset is relative to the start of the instruction,
-which **includes** the SVP64 Prefix*
-
-# Format and fields
-
-With element-width overrides being meaningless for Condition
-Register Fields, bits 4 thru 7 of SVP64 RM may be used for additional
-Mode bits.
-
-SVP64 RM `MODE` (includes repurposing `ELWIDTH` bits 4:5,
-and `ELWIDTH_SRC` bits 6-7 for *alternate* uses) for Branch
-Conditional:
+With these side-effects, basic Boolean Logic Analysis advises that it
+is important to provide a means to enact them each based on whether
+testing succeeds *or fails*. This results in a not-insignificant number
+of additional Mode Augmentation bits, accompanying VLSET and CTR-test
+Modes respectively.
+
+Predicate skipping or zeroing may, as usual with SVP64, be controlled by
+`sz`. Where the predicate is masked out and zeroing is enabled, then in
+such circumstances the same Boolean Logic Analysis dictates that rather
+than testing only against zero, the option to test against one is also
+prudent. This introduces a new immediate field, `SNZ`, which works in
+conjunction with `sz`.
+
+Vectorised Branches can be used in either SVP64 Horizontal-First or
+Vertical-First Mode. Essentially, at an element level, the behaviour
+is identical in both Modes, although the `ALL` bit is meaningless in
+Vertical-First Mode.
+
+It is also important to bear in mind that, fundamentally, Vectorised
+Branch-Conditional is still extremely close to the Scalar v3.0B
+Branch-Conditional instructions, and that the same v3.0B Scalar
+Branch-Conditional instructions are still *completely separate and
+independent*, being unaltered and unaffected by their SVP64 variants in
+every conceivable way.
+
+*Programming note: One important point is that SVP64 instructions are
+64 bit. (8 bytes not 4). This needs to be taken into consideration
+when computing branch offsets: the offset is relative to the start of
+the instruction, which **includes** the SVP64 Prefix*
+
+## Format and fields
+
+With element-width overrides being meaningless for Condition Register
+Fields, bits 4 thru 7 of SVP64 RM may be used for additional Mode bits.
+
+SVP64 RM `MODE` (includes repurposing `ELWIDTH` bits 4:5, and
+`ELWIDTH_SRC` bits 6-7 for *alternate* uses) for Branch Conditional:
| 4 | 5 | 6 | 7 | 17 | 18 | 19 | 20 | 21 | 22 23 | description |
| - | - | - | - | -- | -- | -- | -- | --- |--------|----------------- |
in CTR-test Mode.
LRu and CTR-test modes are where SVP64 Branches subtly differ from
-Scalar v3.0B Branches. `sv.bcl` for example will always update LR, whereas
-`sv.bcl/lru` will only update LR if the branch succeeds.
+Scalar v3.0B Branches. `sv.bcl` for example will always update LR,
+whereas `sv.bcl/lru` will only update LR if the branch succeeds.
-Of special interest is that when using ALL Mode (Great Big AND
-of all Condition Tests), if `VL=0`,
-which is rare but can occur in Data-Dependent Modes, the Branch
-will always take place because there will be no failing Condition
-Tests to prevent it. Likewise when not using ALL Mode (Great Big OR
-of all Condition Tests) and `VL=0` the Branch is guaranteed not
-to occur because there will be no *successful* Condition Tests
-to make it happen.
+Of special interest is that when using ALL Mode (Great Big AND of all
+Condition Tests), if `VL=0`, which is rare but can occur in Data-Dependent
+Modes, the Branch will always take place because there will be no failing
+Condition Tests to prevent it. Likewise when not using ALL Mode (Great
+Big OR of all Condition Tests) and `VL=0` the Branch is guaranteed not
+to occur because there will be no *successful* Condition Tests to make
+it happen.
-# Vectorised CR Field numbering, and Scalar behaviour
+## Vectorised CR Field numbering, and Scalar behaviour
It is important to keep in mind that just like all SVP64 instructions,
-the `BI` field of the base v3.0B Branch Conditional instruction
-may be extended by SVP64 EXTRA augmentation, as well as be marked
-as either Scalar or Vector. It is also crucially important to keep in mind
-that for CRs, SVP64 sequentially increments the CR *Field* numbers.
-CR *Fields* are treated as elements, not bit-numbers of the CR *register*.
+the `BI` field of the base v3.0B Branch Conditional instruction may be
+extended by SVP64 EXTRA augmentation, as well as be marked as either
+Scalar or Vector. It is also crucially important to keep in mind that for
+CRs, SVP64 sequentially increments the CR *Field* numbers. CR *Fields*
+are treated as elements, not bit-numbers of the CR *register*.
The `BI` operand of Branch Conditional operations is five bits, in scalar
-v3.0B this would select one bit of the 32 bit CR,
-comprising eight CR Fields of 4 bits each. In SVP64 there are
-16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
-`BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
-are extended to either scalar or vector and to select CR Fields 0..127
-as specified in SVP64 [[sv/svp64/appendix]].
+v3.0B this would select one bit of the 32 bit CR, comprising eight CR
+Fields of 4 bits each. In SVP64 there are 16 32 bit CRs, containing
+128 4-bit CR Fields. Therefore, the 2 LSBs of `BI` select the bit from
+the CR Field (EQ LT GT SO), and the top 3 bits are extended to either
+scalar or vector and to select CR Fields 0..127 as specified in SVP64
+[[sv/svp64/appendix]].
When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
-then as the usual SVP64 rules apply:
-the Vector loop ends at the first element tested
-(the first CR *Field*), after taking
-predication into consideration. Thus, also as usual, when a predicate mask is
-given, and `BI` marked as scalar, and `sz` is zero, srcstep
-skips forward to the first non-zero predicated element, and only that
-one element is tested.
-
-In other words, the fact that this is a Branch
-Operation (instead of an arithmetic one) does not result, ultimately,
-in significant changes as to
-how SVP64 is fundamentally applied, except with respect to:
+then as the usual SVP64 rules apply: the Vector loop ends at the first
+element tested (the first CR *Field*), after taking predication into
+consideration. Thus, also as usual, when a predicate mask is given, and
+`BI` marked as scalar, and `sz` is zero, srcstep skips forward to the
+first non-zero predicated element, and only that one element is tested.
+
+In other words, the fact that this is a Branch Operation (instead of an
+arithmetic one) does not result, ultimately, in significant changes as
+to how SVP64 is fundamentally applied, except with respect to:
* the unique properties associated with conditionally
- changing the Program
-Counter (aka "a Branch"), resulting in early-out
-opportunities
+ changing the Program Counter (aka "a Branch"), resulting in early-out
+ opportunities
* CTR-testing
Both are outlined below, in later sections.
-# Horizontal-First and Vertical-First Modes
+## Horizontal-First and Vertical-First Modes
In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
AND) results in early exit: no more updates to CTR occur (if requested);
however this time with the Branch proceeding. In both cases the testing
of the Vector of CRs should be done in linear sequential order (or in
REMAP re-sequenced order): such that tests that are sequentially beyond
-the exit point are *not* carried out. (*Note: it is standard practice in
-Programming languages to exit early from conditional tests, however
-a little unusual to consider in an ISA that is designed for Parallel
-Vector Processing. The reason is to have strictly-defined guaranteed
-behaviour*)
+the exit point are *not* carried out. (*Note: it is standard practice
+in Programming languages to exit early from conditional tests, however a
+little unusual to consider in an ISA that is designed for Parallel Vector
+Processing. The reason is to have strictly-defined guaranteed behaviour*)
-In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
-behaviour. Given that only one element is being tested at a time
-in Vertical-First Mode, a test designed to be done on multiple
-bits is meaningless.
+In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
+behaviour. Given that only one element is being tested at a time in
+Vertical-First Mode, a test designed to be done on multiple bits is
+meaningless.
-# Description and Modes
+## Description and Modes
Predication in both INT and CR modes may be applied to `sv.bc` and other
SVP64 Branch Conditional operations, exactly as they may be applied to
operations are not included in condition testing, exactly like all other
SVP64 operations, *including* side-effects such as potentially updating
LR or CTR, which will also be skipped. There is *one* exception here,
-which is when
-`BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
-predicate mask bit is also zero:
-under these special circumstances CTR will also decrement.
+which is when `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
+predicate mask bit is also zero: under these special circumstances CTR
+will also decrement.
-When `sz` is non-zero, this normally requests insertion of a zero
-in place of the input data, when the relevant predicate mask bit is zero.
+When `sz` is non-zero, this normally requests insertion of a zero in
+place of the input data, when the relevant predicate mask bit is zero.
This would mean that a zero is inserted in place of `CR[BI+32]` for
testing against `BO`, which may not be desirable in all circumstances.
Therefore, an extra field is provided `SNZ`, which, if set, will insert
mode, which will truncate SVSTATE.VL at the point of the first failed
test.*)
-Normally, CTR mode will decrement once per Condition Test, resulting
-under normal circumstances that CTR reduces by up to VL in Horizontal-First
-Mode. Just as when v3.0B Branch-Conditional saves at
-least one instruction on tight inner loops through auto-decrementation
-of CTR, likewise it is also possible to save instruction count for
-SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
-in circumstances where there is conditional interaction between the
-element computation and testing, and the continuation (or otherwise)
-of a given loop. The potential combinations of interactions is why CTR
-testing options have been added.
+Normally, CTR mode will decrement once per Condition Test, resulting under
+normal circumstances that CTR reduces by up to VL in Horizontal-First
+Mode. Just as when v3.0B Branch-Conditional saves at least one instruction
+on tight inner loops through auto-decrementation of CTR, likewise it
+is also possible to save instruction count for SVP64 loops in both
+Vertical-First and Horizontal-First Mode, particularly in circumstances
+where there is conditional interaction between the element computation
+and testing, and the continuation (or otherwise) of a given loop. The
+potential combinations of interactions is why CTR testing options have
+been added.
Also, the unconditional bit `BO[0]` is still relevant when Predication
is applied to the Branch because in `ALL` mode all nonmasked bits have
-to be tested, and when `sz=0` skipping occurs.
-Even when VLSET mode is not used, CTR
-may still be decremented by the total number of nonmasked elements,
-acting in effect as either a popcount or cntlz depending on which
-mode bits are set.
-In short, Vectorised Branch becomes an extremely powerful tool.
-
-**Micro-Architectural Implementation Note**: *when implemented on
-top of a Multi-Issue Out-of-Order Engine it is possible to pass
-a copy of the predicate and the prerequisite CR Fields to all
-Branch Units, as well as the current value of CTR at the time of
-multi-issue, and for each Branch Unit to compute how many times
-CTR would be subtracted, in a fully-deterministic and parallel
-fashion. A SIMD-based Branch Unit, receiving and processing
-multiple CR Fields covered by multiple predicate bits, would
-do the exact same thing. Obviously, however, if CTR is modified
-within any given loop (mtctr) the behaviour of CTR is no longer
-deterministic.*
-
-## Link Register Update
-
-For a Scalar Branch, unconditional updating of the Link Register
-LR is useful and practical. However, if a loop of CR Fields is
-tested, unconditional updating of LR becomes problematic.
+to be tested, and when `sz=0` skipping occurs. Even when VLSET mode is
+not used, CTR may still be decremented by the total number of nonmasked
+elements, acting in effect as either a popcount or cntlz depending
+on which mode bits are set. In short, Vectorised Branch becomes an
+extremely powerful tool.
+
+**Micro-Architectural Implementation Note**: *when implemented on top
+of a Multi-Issue Out-of-Order Engine it is possible to pass a copy of
+the predicate and the prerequisite CR Fields to all Branch Units, as
+well as the current value of CTR at the time of multi-issue, and for
+each Branch Unit to compute how many times CTR would be subtracted,
+in a fully-deterministic and parallel fashion. A SIMD-based Branch
+Unit, receiving and processing multiple CR Fields covered by multiple
+predicate bits, would do the exact same thing. Obviously, however, if
+CTR is modified within any given loop (mtctr) the behaviour of CTR is
+no longer deterministic.*
+
+### Link Register Update
+
+For a Scalar Branch, unconditional updating of the Link Register LR
+is useful and practical. However, if a loop of CR Fields is tested,
+unconditional updating of LR becomes problematic.
For example when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode,
LR's value will be unconditionally overwritten after the first element,
-such that for execution (testing) of the second element, LR
-has the value `CIA+8`. This is covered in the `bclrl` example, in
-a later section.
+such that for execution (testing) of the second element, LR has the value
+`CIA+8`. This is covered in the `bclrl` example, in a later section.
-The addition of a LRu bit modifies behaviour in conjunction
-with LK, as follows:
+The addition of a LRu bit modifies behaviour in conjunction with LK,
+as follows:
* `sv.bc` When LRu=0,LK=0, Link Register is not updated
* `sv.bcl` When LRu=0,LK=1, Link Register is updated unconditionally
* `sv.bc/lru` When LRu=1,LK=0, Link Register will only be updated if
the Branch Condition succeeds.
-This avoids
-destruction of LR during loops (particularly Vertical-First
+This avoids destruction of LR during loops (particularly Vertical-First
ones).
**SVLR and SVSTATE**
For precisely the reasons why `LK=1` was added originally to the Power
-ISA, with SVSTATE being a peer of the Program Counter it becomes
-necessary to also add an SVLR (SVSTATE Link Register)
-and corresponding control bits `SL` and `SLu`.
+ISA, with SVSTATE being a peer of the Program Counter it becomes necessary
+to also add an SVLR (SVSTATE Link Register) and corresponding control bits
+`SL` and `SLu`.
-## CTR-test
+### CTR-test
-Where a standard Scalar v3.0B branch unconditionally decrements
-CTR when `BO[2]` is clear, CTR-test Mode introduces more flexibility
-which allows CTR to be used for many more types of Vector loops
-constructs.
+Where a standard Scalar v3.0B branch unconditionally decrements CTR when
+`BO[2]` is clear, CTR-test Mode introduces more flexibility which allows
+CTR to be used for many more types of Vector loops constructs.
-CTR-test mode and CTi interaction is as follows: note that
-`BO[2]` is still required to be clear for CTR decrements to be
-considered, exactly as is the case in Scalar Power ISA v3.0B
+CTR-test mode and CTi interaction is as follows: note that `BO[2]`
+is still required to be clear for CTR decrements to be considered,
+exactly as is the case in Scalar Power ISA v3.0B
* **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
if `BO[2]` is zero. Masked-out elements when `sz=0` are
Masked-out elements when `sz=0` are skipped (including
not decrementing CTR)
-`CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
-only time in the entirety of SVP64 that has side-effects when
-a predicate mask bit is clear. **All** other SVP64 operations
-entirely skip an element when sz=0 and a predicate mask bit is zero.
-It is also critical to emphasise that in this unusual mode,
-no other side-effects occur: **only** CTR is decremented, i.e. the
-rest of the Branch operation is skipped.
+`CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the only
+time in the entirety of SVP64 that has side-effects when a predicate mask
+bit is clear. **All** other SVP64 operations entirely skip an element
+when sz=0 and a predicate mask bit is zero. It is also critical to
+emphasise that in this unusual mode, no other side-effects occur: **only**
+CTR is decremented, i.e. the rest of the Branch operation is skipped.
-## VLSET Mode
+### VLSET Mode
VLSET Mode truncates the Vector Length so that subsequent instructions
-operate on a reduced Vector Length. This is similar to
-Data-dependent Fail-First and LD/ST Fail-First, where for VLSET the
-truncation occurs at the Branch decision-point.
+operate on a reduced Vector Length. This is similar to Data-dependent
+Fail-First and LD/ST Fail-First, where for VLSET the truncation occurs
+at the Branch decision-point.
-Interestingly, due to the side-effects of `VLSET` mode
-it is actually useful to use Branch Conditional even
-to perform no actual branch operation, i.e to point to the instruction
-after the branch. Truncation of VL would thus conditionally occur yet control
-flow alteration would not.
+Interestingly, due to the side-effects of `VLSET` mode it is actually
+useful to use Branch Conditional even to perform no actual branch
+operation, i.e to point to the instruction after the branch. Truncation of
+VL would thus conditionally occur yet control flow alteration would not.
`VLSET` mode with Vertical-First is particularly unusual. Vertical-First
is designed to be used for explicit looping, where an explicit call to
-`svstep` is required to move both srcstep and dststep on to
-the next element, until VL (or other condition) is reached.
-Vertical-First Looping is expected (required) to terminate if the end
-of the Vector, VL, is reached. If however that loop is terminated early
-because VL is truncated, VLSET with Vertical-First becomes meaningless.
-Resolving this would require two branches: one Conditional, the other
-branching unconditionally to create the loop, where the Conditional
-one jumps over it.
-
-Therefore, with `VSb`, the option to decide whether truncation should occur if the
-branch succeeds *or* if the branch condition fails allows for the flexibility
-required. This allows a Vertical-First Branch to *either* be used as
-a branch-back (loop) *or* as part of a conditional exit or function
-call from *inside* a loop, and for VLSET to be integrated into both
-types of decision-making.
-
-In the case of a Vertical-First branch-back (loop), with `VSb=0` the branch takes
-place if success conditions are met, but on exit from that loop
-(branch condition fails), VL will be truncated. This is extremely
+`svstep` is required to move both srcstep and dststep on to the next
+element, until VL (or other condition) is reached. Vertical-First Looping
+is expected (required) to terminate if the end of the Vector, VL, is
+reached. If however that loop is terminated early because VL is truncated,
+VLSET with Vertical-First becomes meaningless. Resolving this would
+require two branches: one Conditional, the other branching unconditionally
+to create the loop, where the Conditional one jumps over it.
+
+Therefore, with `VSb`, the option to decide whether truncation should
+occur if the branch succeeds *or* if the branch condition fails allows
+for the flexibility required. This allows a Vertical-First Branch to
+*either* be used as a branch-back (loop) *or* as part of a conditional
+exit or function call from *inside* a loop, and for VLSET to be integrated
+into both types of decision-making.
+
+In the case of a Vertical-First branch-back (loop), with `VSb=0` the
+branch takes place if success conditions are met, but on exit from that
+loop (branch condition fails), VL will be truncated. This is extremely
useful.
-`VLSET` mode with Horizontal-First when `VSb=0` is still
-useful, because it can be used to truncate VL to the first predicated
-(non-masked-out) element.
+`VLSET` mode with Horizontal-First when `VSb=0` is still useful, because
+it can be used to truncate VL to the first predicated (non-masked-out)
+element.
The truncation point for VL, when VLi is clear, must not include skipped
-elements that preceded the current element being tested.
-Example: `sz=0, VLi=0, predicate mask = 0b110010` and the Condition
-Register failure point is at CR Field element 4.
+elements that preceded the current element being tested. Example:
+`sz=0, VLi=0, predicate mask = 0b110010` and the Condition Register
+failure point is at CR Field element 4.
* Testing at element 0 is skipped because its predicate bit is zero
* Testing at element 1 passed
not 4 due to elements 2 and 3 being skipped.
If `sz=1` in the above example *then* VL would have been set to 4 because
-in non-zeroing mode the zero'd elements are still effectively part of the
-Vector (with their respective elements set to `SNZ`)
+in non-zeroing mode the zero'd elements are still effectively part of
+the Vector (with their respective elements set to `SNZ`)
-If `VLI=1` then VL would be set to 5 regardless of sz, due to being inclusive
-of the element actually being tested.
+If `VLI=1` then VL would be set to 5 regardless of sz, due to being
+inclusive of the element actually being tested.
-## VLSET and CTR-test combined
+### VLSET and CTR-test combined
-If both CTR-test and VLSET Modes are requested, it's important to
-observe the correct order. What occurs depends on whether VLi
-is enabled, because VLi affects the length, VL.
+If both CTR-test and VLSET Modes are requested, it is important to
+observe the correct order. What occurs depends on whether VLi is enabled,
+because VLi affects the length, VL.
If VLi (VL truncate inclusive) is set:
(including not executing step 5)
5. decide whether to branch.
-If VLi is clear, then when a test fails that element
-and any following it
-should **not** be considered part of the Vector. Consequently:
+If VLi is clear, then when a test fails that element and any following
+it should **not** be considered part of the Vector. Consequently:
1. compute the branch test including whether CTR triggers
2. if the test fails against VSb, truncate VL to the *previous*
3. (optionally) decrement CTR
4. decide whether to branch.
-# Boolean Logic combinations
+## Boolean Logic combinations
-In a Scalar ISA, Branch-Conditional testing even of vector
-results may be performed through inversion of tests. NOR of
-all tests may be performed by inversion of the scalar condition
-and branching *out* from the scalar loop around elements,
-using scalar operations.
+In a Scalar ISA, Branch-Conditional testing even of vector results may be
+performed through inversion of tests. NOR of all tests may be performed
+by inversion of the scalar condition and branching *out* from the scalar
+loop around elements, using scalar operations.
In a parallel (Vector) ISA it is the ISA itself which must perform
-the prerequisite logic manipulation.
-Thus for SVP64 there are an extraordinary number of nesessary combinations
-which provide completely different and useful behaviour.
-Available options to combine:
+the prerequisite logic manipulation. Thus for SVP64 there are an
+extraordinary number of nesessary combinations which provide completely
+different and useful behaviour. Available options to combine:
* `BO[0]` to make an unconditional branch would seem irrelevant if
it were not for predication and for side-effects (CTR Mode
need explicit instructions.
The most obviously useful combinations here are to set `BO[1]` to zero
-in order to turn `ALL` into Great-Big-NAND and `ANY` into
-Great-Big-NOR. Other Mode bits which perform behavioural inversion then
-have to work round the fact that the Condition Testing is NOR or NAND.
-The alternative to not having additional behavioural inversion
-(`SNZ`, `VSb`, `CTi`) would be to have a second (unconditional)
-branch directly after the first, which the first branch jumps over.
-This contrivance is avoided by the behavioural inversion bits.
+in order to turn `ALL` into Great-Big-NAND and `ANY` into Great-Big-NOR.
+Other Mode bits which perform behavioural inversion then have to work
+round the fact that the Condition Testing is NOR or NAND. The alternative
+to not having additional behavioural inversion (`SNZ`, `VSb`, `CTi`)
+would be to have a second (unconditional) branch directly after the first,
+which the first branch jumps over. This contrivance is avoided by the
+behavioural inversion bits.
-# Pseudocode and examples
+## Pseudocode and examples
Please see [[svp64/appendix]] regarding CR bit ordering and for
the definition of `CR{n}`
For comparative purposes this is a copy of the v3.0B `bc` pseudocode
```
-if (mode_is_64bit) then M <- 0
-else M <- 32
-if ¬BO[2] then CTR <- CTR - 1
-ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
-cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
-if ctr_ok & cond_ok then
- if AA then NIA <-iea EXTS(BD || 0b00)
- else NIA <-iea CIA + EXTS(BD || 0b00)
-if LK then LR <-iea CIA + 4
+ if (mode_is_64bit) then M <- 0
+ else M <- 32
+ if ¬BO[2] then CTR <- CTR - 1
+ ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
+ cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
+ if ctr_ok & cond_ok then
+ if AA then NIA <-iea EXTS(BD || 0b00)
+ else NIA <-iea CIA + EXTS(BD || 0b00)
+ if LK then LR <-iea CIA + 4
```
Simplified pseudocode including LRu and CTR skipping, which illustrates
clearly that SVP64 Scalar Branches (VL=1) are **not** identical to
-v3.0B Scalar Branches. The key areas where differences occur are
-the inclusion of predication (which can still be used when VL=1), in
-when and why CTR is decremented (CTRtest Mode) and whether LR is
-updated (which is unconditional in v3.0B when LK=1, and conditional
-in SVP64 when LRu=1).
+v3.0B Scalar Branches. The key areas where differences occur are the
+inclusion of predication (which can still be used when VL=1), in when and
+why CTR is decremented (CTRtest Mode) and whether LR is updated (which
+is unconditional in v3.0B when LK=1, and conditional in SVP64 when LRu=1).
-Inline comments highlight the fact that the Scalar Branch behaviour
-and pseudocode is still clearly visible and embedded within the
-Vectorised variant:
+Inline comments highlight the fact that the Scalar Branch behaviour and
+pseudocode is still clearly visible and embedded within the Vectorised
+variant:
```
-if (mode_is_64bit) then M <- 0
-else M <- 32
-# the bit of CR to test, if the predicate bit is zero,
-# is overridden
-testbit = CR[BI+32]
-if ¬predicate_bit then testbit = SVRMmode.SNZ
-# otherwise apart from the override ctr_ok and cond_ok
-# are exactly the same
-ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
-cond_ok <- BO[0] | ¬(testbit ^ BO[1])
-if ¬predicate_bit & ¬SVRMmode.sz then
- # this is entirely new: CTR-test mode still decrements CTR
- # even when predicate-bits are zero
- if ¬BO[2] & CTRtest & ¬CTi then
- CTR = CTR - 1
- # instruction finishes here
-else
- # usual BO[2] CTR-mode now under CTR-test mode as well
- if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
- # new VLset mode, conditional test truncates VL
- if VLSET and VSb = (cond_ok & ctr_ok) then
- if SVRMmode.VLI then SVSTATE.VL = srcstep+1
- else SVSTATE.VL = srcstep
- # usual LR is now conditional, but also joined by SVLR
- lr_ok <- LK
- svlr_ok <- SVRMmode.SL
- if ctr_ok & cond_ok then
- if AA then NIA <-iea EXTS(BD || 0b00)
- else NIA <-iea CIA + EXTS(BD || 0b00)
- if SVRMmode.LRu then lr_ok <- ¬lr_ok
- if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
- if lr_ok then LR <-iea CIA + 4
- if svlr_ok then SVLR <- SVSTATE
+ if (mode_is_64bit) then M <- 0
+ else M <- 32
+ # the bit of CR to test, if the predicate bit is zero,
+ # is overridden
+ testbit = CR[BI+32]
+ if ¬predicate_bit then testbit = SVRMmode.SNZ
+ # otherwise apart from the override ctr_ok and cond_ok
+ # are exactly the same
+ ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
+ cond_ok <- BO[0] | ¬(testbit ^ BO[1])
+ if ¬predicate_bit & ¬SVRMmode.sz then
+ # this is entirely new: CTR-test mode still decrements CTR
+ # even when predicate-bits are zero
+ if ¬BO[2] & CTRtest & ¬CTi then
+ CTR = CTR - 1
+ # instruction finishes here
+ else
+ # usual BO[2] CTR-mode now under CTR-test mode as well
+ if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
+ # new VLset mode, conditional test truncates VL
+ if VLSET and VSb = (cond_ok & ctr_ok) then
+ if SVRMmode.VLI then SVSTATE.VL = srcstep+1
+ else SVSTATE.VL = srcstep
+ # usual LR is now conditional, but also joined by SVLR
+ lr_ok <- LK
+ svlr_ok <- SVRMmode.SL
+ if ctr_ok & cond_ok then
+ if AA then NIA <-iea EXTS(BD || 0b00)
+ else NIA <-iea CIA + EXTS(BD || 0b00)
+ if SVRMmode.LRu then lr_ok <- ¬lr_ok
+ if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
+ if lr_ok then LR <-iea CIA + 4
+ if svlr_ok then SVLR <- SVSTATE
```
Below is the pseudocode for SVP64 Branches, which is a little less
-obvious but identical to the above. The lack of obviousness is down
-to the early-exit opportunities.
+obvious but identical to the above. The lack of obviousness is down to
+the early-exit opportunities.
Effective pseudocode for Horizontal-First Mode:
```
-if (mode_is_64bit) then M <- 0
-else M <- 32
-cond_ok = not SVRMmode.ALL
-for srcstep in range(VL):
+ if (mode_is_64bit) then M <- 0
+ else M <- 32
+ cond_ok = not SVRMmode.ALL
+ for srcstep in range(VL):
+ # select predicate bit or zero/one
+ if predicate[srcstep]:
+ # get SVP64 extended CR field 0..127
+ SVCRf = SVP64EXTRA(BI>>2)
+ CRbits = CR{SVCRf}
+ testbit = CRbits[BI & 0b11]
+ # testbit = CR[BI+32+srcstep*4]
+ else if not SVRMmode.sz:
+ # inverted CTR test skip mode
+ if ¬BO[2] & CTRtest & ¬CTI then
+ CTR = CTR - 1
+ continue # skip to next element
+ else
+ testbit = SVRMmode.SNZ
+ # actual element test here
+ ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
+ el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
+ # check if CTR dec should occur
+ ctrdec = ¬BO[2]
+ if CTRtest & (el_cond_ok ^ CTi) then
+ ctrdec = 0b0
+ if ctrdec then CTR <- CTR - 1
+ # merge in the test
+ if SVRMmode.ALL:
+ cond_ok &= (el_cond_ok & ctr_ok)
+ else
+ cond_ok |= (el_cond_ok & ctr_ok)
+ # test for VL to be set (and exit)
+ if VLSET and VSb = (el_cond_ok & ctr_ok) then
+ if SVRMmode.VLI then SVSTATE.VL = srcstep+1
+ else SVSTATE.VL = srcstep
+ break
+ # early exit?
+ if SVRMmode.ALL != (el_cond_ok & ctr_ok):
+ break
+ # SVP64 rules about Scalar registers still apply!
+ if SVCRf.scalar:
+ break
+ # loop finally done, now test if branch (and update LR)
+ lr_ok <- LK
+ svlr_ok <- SVRMmode.SL
+ if cond_ok then
+ if AA then NIA <-iea EXTS(BD || 0b00)
+ else NIA <-iea CIA + EXTS(BD || 0b00)
+ if SVRMmode.LRu then lr_ok <- ¬lr_ok
+ if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
+ if lr_ok then LR <-iea CIA + 4
+ if svlr_ok then SVLR <- SVSTATE
+```
+
+Pseudocode for Vertical-First Mode:
+
+```
+ # get SVP64 extended CR field 0..127
+ SVCRf = SVP64EXTRA(BI>>2)
+ CRbits = CR{SVCRf}
# select predicate bit or zero/one
if predicate[srcstep]:
- # get SVP64 extended CR field 0..127
- SVCRf = SVP64EXTRA(BI>>2)
- CRbits = CR{SVCRf}
+ if BRc = 1 then # CR0 vectorised
+ CR{SVCRf+srcstep} = CRbits
testbit = CRbits[BI & 0b11]
- # testbit = CR[BI+32+srcstep*4]
else if not SVRMmode.sz:
# inverted CTR test skip mode
if ¬BO[2] & CTRtest & ¬CTI then
- CTR = CTR - 1
- continue # skip to next element
+ CTR = CTR - 1
+ SVSTATE.srcstep = new_srcstep
+ exit # no branch testing
else
testbit = SVRMmode.SNZ
# actual element test here
- ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
- el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
- # check if CTR dec should occur
- ctrdec = ¬BO[2]
- if CTRtest & (el_cond_ok ^ CTi) then
- ctrdec = 0b0
- if ctrdec then CTR <- CTR - 1
- # merge in the test
- if SVRMmode.ALL:
- cond_ok &= (el_cond_ok & ctr_ok)
- else
- cond_ok |= (el_cond_ok & ctr_ok)
+ cond_ok <- BO[0] | ¬(testbit ^ BO[1])
# test for VL to be set (and exit)
- if VLSET and VSb = (el_cond_ok & ctr_ok) then
- if SVRMmode.VLI then SVSTATE.VL = srcstep+1
- else SVSTATE.VL = srcstep
- break
- # early exit?
- if SVRMmode.ALL != (el_cond_ok & ctr_ok):
- break
- # SVP64 rules about Scalar registers still apply!
- if SVCRf.scalar:
- break
-# loop finally done, now test if branch (and update LR)
-lr_ok <- LK
-svlr_ok <- SVRMmode.SL
-if cond_ok then
- if AA then NIA <-iea EXTS(BD || 0b00)
- else NIA <-iea CIA + EXTS(BD || 0b00)
- if SVRMmode.LRu then lr_ok <- ¬lr_ok
- if SVRMmode.SLu then svlr_ok <- ¬svlr_ok
-if lr_ok then LR <-iea CIA + 4
-if svlr_ok then SVLR <- SVSTATE
+ if VLSET and cond_ok = VSb then
+ if SVRMmode.VLI
+ SVSTATE.VL = new_srcstep+1
+ else
+ SVSTATE.VL = new_srcstep
```
-Pseudocode for Vertical-First Mode:
-
-```
-# get SVP64 extended CR field 0..127
-SVCRf = SVP64EXTRA(BI>>2)
-CRbits = CR{SVCRf}
-# select predicate bit or zero/one
-if predicate[srcstep]:
- if BRc = 1 then # CR0 vectorised
- CR{SVCRf+srcstep} = CRbits
- testbit = CRbits[BI & 0b11]
-else if not SVRMmode.sz:
- # inverted CTR test skip mode
- if ¬BO[2] & CTRtest & ¬CTI then
- CTR = CTR - 1
- SVSTATE.srcstep = new_srcstep
- exit # no branch testing
-else
- testbit = SVRMmode.SNZ
-# actual element test here
-cond_ok <- BO[0] | ¬(testbit ^ BO[1])
-# test for VL to be set (and exit)
-if VLSET and cond_ok = VSb then
- if SVRMmode.VLI
- SVSTATE.VL = new_srcstep+1
- else
- SVSTATE.VL = new_srcstep
-```
-
-# Example Shader code
+### Example Shader code
```
-// assume f() g() or h() modify a and/or b
-while(a > 2) {
- if(b < 5)
- f();
- else
- g();
- h();
-}
+ // assume f() g() or h() modify a and/or b
+ while(a > 2) {
+ if(b < 5)
+ f();
+ else
+ g();
+ h();
+ }
```
which compiles to something like:
```
-vec<i32> a, b;
-// ...
-pred loop_pred = a > 2;
-// loop continues while any of a elements greater than 2
-while(loop_pred.any()) {
- // vector of predicate bits
- pred if_pred = loop_pred & (b < 5);
- // only call f() if at least 1 bit set
- if(if_pred.any()) {
- f(if_pred);
- }
-label1:
- // loop mask ANDs with inverted if-test
- pred else_pred = loop_pred & ~if_pred;
- // only call g() if at least 1 bit set
- if(else_pred.any()) {
- g(else_pred);
+ vec<i32> a, b;
+ // ...
+ pred loop_pred = a > 2;
+ // loop continues while any of a elements greater than 2
+ while(loop_pred.any()) {
+ // vector of predicate bits
+ pred if_pred = loop_pred & (b < 5);
+ // only call f() if at least 1 bit set
+ if(if_pred.any()) {
+ f(if_pred);
+ }
+ label1:
+ // loop mask ANDs with inverted if-test
+ pred else_pred = loop_pred & ~if_pred;
+ // only call g() if at least 1 bit set
+ if(else_pred.any()) {
+ g(else_pred);
+ }
+ h(loop_pred);
}
- h(loop_pred);
-}
```
which will end up as:
```
- # start from while loop test point
- b looptest
-while_loop:
- sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
- sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
- # only calculate loop_pred & pred_b because needed in f()
- sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
- f(CR80.v.SO)
-skip_f:
- # illustrate inversion of pred_b. invert r30, test ALL
- # rather than SOME, but masked-out zero test would FAIL,
- # therefore masked-out instead is tested against 1 not 0
- sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
- # else = loop & ~pred_b, need this because used in g()
- sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
- g(CR80.v.SO)
-skip_g:
- # conditionally call h(r30) if any loop pred set
- sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
-looptest:
- sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
- sv.crweird r30, CR60.GT # transfer GT vector to r30
- sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
-end:
+ # start from while loop test point
+ b looptest
+ while_loop:
+ sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
+ sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
+ # only calculate loop_pred & pred_b because needed in f()
+ sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
+ f(CR80.v.SO)
+ skip_f:
+ # illustrate inversion of pred_b. invert r30, test ALL
+ # rather than SOME, but masked-out zero test would FAIL,
+ # therefore masked-out instead is tested against 1 not 0
+ sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
+ # else = loop & ~pred_b, need this because used in g()
+ sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
+ g(CR80.v.SO)
+ skip_g:
+ # conditionally call h(r30) if any loop pred set
+ sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
+ looptest:
+ sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
+ sv.crweird r30, CR60.GT # transfer GT vector to r30
+ sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
+ end:
```
-# TODO LRu example
+
+### LRu example
show why LRu would be useful in a loop. Imagine the following
c code:
```
-for (int i = 0; i < 8; i++) {
- if (x < y) break;
-}
+ for (int i = 0; i < 8; i++) {
+ if (x < y) break;
+ }
```
-Under these circumstances exiting from the loop is not only
-based on CTR it has become conditional on a CR result.
-Thus it is desirable that NIA *and* LR only be modified
-if the conditions are met
+Under these circumstances exiting from the loop is not only based on
+CTR it has become conditional on a CR result. Thus it is desirable that
+NIA *and* LR only be modified if the conditions are met
v3.0 pseudocode for `bclrl`:
```
-if (mode_is_64bit) then M <- 0
-else M <- 32
-if ¬BO[2] then CTR <- CTR - 1
-ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
-cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
-if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
-if LK then LR <-iea CIA + 4
+ if (mode_is_64bit) then M <- 0
+ else M <- 32
+ if ¬BO[2] then CTR <- CTR - 1
+ ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
+ cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
+ if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
+ if LK then LR <-iea CIA + 4
```
the latter part for SVP64 `bclrl` becomes:
```
-for i in 0 to VL-1:
- ...
- ...
- cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
- lr_ok <- LK
- if ctr_ok & cond_ok then
- NIA <-iea LR[0:61] || 0b00
- if SVRMmode.LRu then lr_ok <- ¬lr_ok
- if lr_ok then LR <-iea CIA + 4
- # if NIA modified exit loop
+ for i in 0 to VL-1:
+ ...
+ ...
+ cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
+ lr_ok <- LK
+ if ctr_ok & cond_ok then
+ NIA <-iea LR[0:61] || 0b00
+ if SVRMmode.LRu then lr_ok <- ¬lr_ok
+ if lr_ok then LR <-iea CIA + 4
+ # if NIA modified exit loop
```
The reason why should be clear from this being a Vector loop:
-unconditional destruction of LR when LK=1 makes `sv.bclrl`
-ineffective, because the intention going into the loop is
-that the branch should be to the copy of LR set at the *start*
-of the loop, not half way through it.
-However if the change to LR only occurs if
-the branch is taken then it becomes a useful instruction.
-
-The following pseudocode should **not** be implemented because
-it violates the fundamental principle of SVP64 which is that
-SVP64 looping is a thin wrapper around Scalar Instructions.
-The pseducode below is more an actual Vector ISA Branch and
-as such is not at all appropriate:
+unconditional destruction of LR when LK=1 makes `sv.bclrl` ineffective,
+because the intention going into the loop is that the branch should be to
+the copy of LR set at the *start* of the loop, not half way through it.
+However if the change to LR only occurs if the branch is taken then it
+becomes a useful instruction.
+
+The following pseudocode should **not** be implemented because it
+violates the fundamental principle of SVP64 which is that SVP64 looping
+is a thin wrapper around Scalar Instructions. The pseducode below is
+more an actual Vector ISA Branch and as such is not at all appropriate:
```
-for i in 0 to VL-1:
- ...
- ...
- cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
- if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
-# only at the end of looping is LK checked.
-# this completely violates the design principle of SVP64
-# and would actually need to be a separate (scalar)
-# instruction "set LR to CIA+4 but retrospectively"
-# which is clearly impossible
-if LK then LR <-iea CIA + 4
+ for i in 0 to VL-1:
+ ...
+ ...
+ cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
+ if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00
+ # only at the end of looping is LK checked.
+ # this completely violates the design principle of SVP64
+ # and would actually need to be a separate (scalar)
+ # instruction "set LR to CIA+4 but retrospectively"
+ # which is clearly impossible
+ if LK then LR <-iea CIA + 4
```
+
+--------
+
+\newpage{}
+
+[[!tag standards]]