From 00ad6892edcbe7d08ac5d2f78b799a57428e20c5 Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Sun, 2 Apr 2023 18:23:56 +0100 Subject: [PATCH] remove branches from ls010 and re-add using pandoc --- openpower/sv/rfc/Makefile | 4 +- openpower/sv/rfc/ls010.mdwn | 792 ------------------------------------ 2 files changed, 2 insertions(+), 794 deletions(-) diff --git a/openpower/sv/rfc/Makefile b/openpower/sv/rfc/Makefile index a05267f03..27df9704f 100644 --- a/openpower/sv/rfc/Makefile +++ b/openpower/sv/rfc/Makefile @@ -1,6 +1,6 @@ all: ls001.pdf ls002.pdf ls003.pdf ls004.pdf ls005.pdf ls006.pdf ls007.pdf -ls010.pdf: ls010.mdwn ../ldst.mdwn +ls010.pdf: ls010.mdwn ../ldst.mdwn ../branches.mdwn pandoc \ -V margin-top=0.9in \ -V margin-bottom=0.9in \ @@ -9,7 +9,7 @@ ls010.pdf: ls010.mdwn ../ldst.mdwn -V fontsize=9pt \ -V papersize=legal \ -V linkcolor=blue \ - -f markdown ls010.mdwn ../ldst.mdwn \ + -f markdown ls010.mdwn ../ldst.mdwn ../branches.mdwn \ -s --self-contained \ --mathjax \ -o ls010.pdf diff --git a/openpower/sv/rfc/ls010.mdwn b/openpower/sv/rfc/ls010.mdwn index acdba62ab..4ce796f0e 100644 --- a/openpower/sv/rfc/ls010.mdwn +++ b/openpower/sv/rfc/ls010.mdwn @@ -1532,798 +1532,6 @@ treat *individual bits* of a GPR effectively as elements. They are expected to be Micro-coded by most Hardware implementations. --------- - -\newpage{} - -# SVP64 Branch Conditional behaviour - -Please note: although similar, SVP64 Branch instructions should be -considered completely separate and distinct from standard scalar -OpenPOWER-approved v3.0B branches. **v3.0B branches are in no way -impacted, altered, changed or modified in any way, shape or form by the -SVP64 Vectorised Variants**. - -It is also extremely important to note that Branches are the sole -pseudo-exception in SVP64 to `Scalar Identity Behaviour`. SVP64 Branches -contain additional modes that are useful for scalar operations (i.e. even -when VL=1 or when using single-bit predication). - -**Rationale** - -Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test -a Condition Register. However for parallel processing it is simply -impossible to perform multiple independent branches: the Program -Counter simply cannot branch to multiple destinations based on multiple -conditions. The best that can be done is to test multiple Conditions -and make a decision of a *single* branch, based on analysis of a *Vector* -of CR Fields which have just been calculated from a *Vector* of results. - -In 3D Shader binaries, which are inherently parallelised and predicated, -testing all or some results and branching based on multiple tests is -extremely common, and a fundamental part of Shader Compilers. Example: -without such multi-condition test-and-branch, if a predicate mask is -all zeros a large batch of instructions may be masked out to `nop`, -and it would waste CPU cycles to run them. 3D GPU ISAs can test for -this scenario and, with the appropriate predicate-analysis instruction, -jump over fully-masked-out operations, by spotting that *all* Conditions -are false. - -Unless Branches are aware and capable of such analysis, additional -instructions would be required which perform Horizontal Cumulative -analysis of Vectorised Condition Register Fields, in order to reduce -the Vector of CR Fields down to one single yes or no decision that a -Scalar-only v3.0B Branch-Conditional could cope with. Such instructions -would be unavoidable, required, and costly by comparison to a single -Vector-aware Branch. Therefore, in order to be commercially competitive, -`sv.bc` and other Vector-aware Branch Conditional instructions are a -high priority for 3D GPU (and OpenCL-style) workloads. - -Given that Power ISA v3.0B is already quite powerful, particularly -the Condition Registers and their interaction with Branches, there are -opportunities to create extremely flexible and compact Vectorised Branch -behaviour. In addition, the side-effects (updating of CTR, truncation -of VL, described below) make it a useful instruction even if the branch -points to the next instruction (no actual branch). - -## Overview - -When considering an "array" of branch-tests, there are four -primarily-useful modes: AND, OR, NAND and NOR of all Conditions. -NAND and NOR may be synthesised from AND and OR by inverting `BO[1]` -which just leaves two modes: - -* Branch takes place on the **first** CR Field test to succeed - (a Great Big OR of all condition tests). Exit occurs - on the first **successful** test. -* Branch takes place only if **all** CR field tests succeed: - a Great Big AND of all condition tests. Exit occurs - on the first **failed** test. - -Early-exit is enacted such that the Vectorised Branch does not -perform needless extra tests, which will help reduce reads on -the Condition Register file. - -*Note: Early-exit is **MANDATORY** (required) behaviour. Branches -**MUST** exit at the first sequentially-encountered failure point, -for exactly the same reasons for which it is mandatory in programming -languages doing early-exit: to avoid damaging side-effects and to provide -deterministic behaviour. Speculative testing of Condition Register -Fields is permitted, as is speculative calculation of CTR, as long as, -as usual in any Out-of-Order microarchitecture, that speculative testing -is cancelled should an early-exit occur. i.e. the speculation must be -"precise": Program Order must be preserved* - -Also note that when early-exit occurs in Horizontal-first Mode, srcstep, -dststep etc. are all reset, ready to begin looping from the beginning -for the next instruction. However for Vertical-first Mode srcstep -etc. are incremented "as usual" i.e. an early-exit has no special impact, -regardless of whether the branch occurred or not. This can leave srcstep -etc. in what may be considered an unusual state on exit from a loop and -it is up to the programmer to reset srcstep, dststep etc. to known-good -values *(easily achieved with `setvl`)*. - -Additional useful behaviour involves two primary Modes (both of which -may be enabled and combined): - -* **VLSET Mode**: identical to Data-Dependent Fail-First Mode - for Arithmetic SVP64 operations, with more - flexibility and a close interaction and integration into the - underlying base Scalar v3.0B Branch instruction. - Truncation of VL takes place around the early-exit point. -* **CTR-test Mode**: gives much more flexibility over when and why - CTR is decremented, including options to decrement if a Condition - test succeeds *or if it fails*. - -With these side-effects, basic Boolean Logic Analysis advises that it -is important to provide a means to enact them each based on whether -testing succeeds *or fails*. This results in a not-insignificant number -of additional Mode Augmentation bits, accompanying VLSET and CTR-test -Modes respectively. - -Predicate skipping or zeroing may, as usual with SVP64, be controlled by -`sz`. Where the predicate is masked out and zeroing is enabled, then in -such circumstances the same Boolean Logic Analysis dictates that rather -than testing only against zero, the option to test against one is also -prudent. This introduces a new immediate field, `SNZ`, which works in -conjunction with `sz`. - -Vectorised Branches can be used in either SVP64 Horizontal-First or -Vertical-First Mode. Essentially, at an element level, the behaviour -is identical in both Modes, although the `ALL` bit is meaningless in -Vertical-First Mode. - -It is also important to bear in mind that, fundamentally, Vectorised -Branch-Conditional is still extremely close to the Scalar v3.0B -Branch-Conditional instructions, and that the same v3.0B Scalar -Branch-Conditional instructions are still *completely separate and -independent*, being unaltered and unaffected by their SVP64 variants in -every conceivable way. - -*Programming note: One important point is that SVP64 instructions are -64 bit. (8 bytes not 4). This needs to be taken into consideration -when computing branch offsets: the offset is relative to the start of -the instruction, which **includes** the SVP64 Prefix* - -## Format and fields - -With element-width overrides being meaningless for Condition Register -Fields, bits 4 thru 7 of SVP64 RM may be used for additional Mode bits. - -SVP64 RM `MODE` (includes repurposing `ELWIDTH` bits 4:5, and -`ELWIDTH_SRC` bits 6-7 for *alternate* uses) for Branch Conditional: - -| 4 | 5 | 6 | 7 | 17 | 18 | 19 | 20 | 21 | 22 23 | description | -| - | - | - | - | -- | -- | -- | -- | --- |--------|----------------- | -|ALL|SNZ| / | / | SL |SLu | 0 | 0 | / | LRu sz | simple mode | -|ALL|SNZ| / |VSb| SL |SLu | 0 | 1 | VLI | LRu sz | VLSET mode | -|ALL|SNZ|CTi| / | SL |SLu | 1 | 0 | / | LRu sz | CTR-test mode | -|ALL|SNZ|CTi|VSb| SL |SLu | 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode | - -Brief description of fields: - -* **sz=1** if predication is enabled and `sz=1` and a predicate - element bit is zero, `SNZ` will - be substituted in place of the CR bit selected by `BI`, - as the Condition tested. - Contrast this with - normal SVP64 `sz=1` behaviour, where *only* a zero is put in - place of masked-out predicate bits. -* **sz=0** When `sz=0` skipping occurs as usual on - masked-out elements, but unlike all - other SVP64 behaviour which entirely skips an element with - no related side-effects at all, there are certain - special circumstances where CTR - may be decremented. See CTR-test Mode, below. -* **ALL** when set, all branch conditional tests must pass in order for - the branch to succeed. When clear, it is the first sequentially - encountered successful test that causes the branch to succeed. - This is identical behaviour to how programming languages perform - early-exit on Boolean Logic chains. -* **VLI** VLSET is identical to Data-dependent Fail-First mode. - In VLSET mode, VL *may* (depending on `VSb`) be truncated. - If VLI (Vector Length Inclusive) is clear, - VL is truncated to *exclude* the current element, otherwise it is - included. SVSTATE.MVL is not altered: only VL. -* **SL** identical to `LR` except applicable to SVSTATE. If `SL` - is set, SVSTATE is transferred to SVLR (conditionally on - whether `SLu` is set). -* **SLu**: SVSTATE Link Update, like `LRu` except applies to SVSTATE. -* **LRu**: Link Register Update, used in conjunction with LK=1 - to make LR update conditional -* **VSb** In VLSET Mode, after testing, - if VSb is set, VL is truncated if the test succeeds. If VSb is clear, - VL is truncated if a test *fails*. Masked-out (skipped) - bits are not considered - part of testing when `sz=0` -* **CTi** CTR inversion. CTR-test Mode normally decrements per element - tested. CTR inversion decrements if a test *fails*. Only relevant - in CTR-test Mode. - -LRu and CTR-test modes are where SVP64 Branches subtly differ from -Scalar v3.0B Branches. `sv.bcl` for example will always update LR, -whereas `sv.bcl/lru` will only update LR if the branch succeeds. - -Of special interest is that when using ALL Mode (Great Big AND of all -Condition Tests), if `VL=0`, which is rare but can occur in Data-Dependent -Modes, the Branch will always take place because there will be no failing -Condition Tests to prevent it. Likewise when not using ALL Mode (Great -Big OR of all Condition Tests) and `VL=0` the Branch is guaranteed not -to occur because there will be no *successful* Condition Tests to make -it happen. - -## Vectorised CR Field numbering, and Scalar behaviour - -It is important to keep in mind that just like all SVP64 instructions, -the `BI` field of the base v3.0B Branch Conditional instruction may be -extended by SVP64 EXTRA augmentation, as well as be marked as either -Scalar or Vector. It is also crucially important to keep in mind that for -CRs, SVP64 sequentially increments the CR *Field* numbers. CR *Fields* -are treated as elements, not bit-numbers of the CR *register*. - -The `BI` operand of Branch Conditional operations is five bits, in scalar -v3.0B this would select one bit of the 32 bit CR, comprising eight CR -Fields of 4 bits each. In SVP64 there are 16 32 bit CRs, containing -128 4-bit CR Fields. Therefore, the 2 LSBs of `BI` select the bit from -the CR Field (EQ LT GT SO), and the top 3 bits are extended to either -scalar or vector and to select CR Fields 0..127 as specified in SVP64 -[[sv/svp64/appendix]]. - -When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar, -then as the usual SVP64 rules apply: the Vector loop ends at the first -element tested (the first CR *Field*), after taking predication into -consideration. Thus, also as usual, when a predicate mask is given, and -`BI` marked as scalar, and `sz` is zero, srcstep skips forward to the -first non-zero predicated element, and only that one element is tested. - -In other words, the fact that this is a Branch Operation (instead of an -arithmetic one) does not result, ultimately, in significant changes as -to how SVP64 is fundamentally applied, except with respect to: - -* the unique properties associated with conditionally - changing the Program Counter (aka "a Branch"), resulting in early-out - opportunities -* CTR-testing - -Both are outlined below, in later sections. - -## Horizontal-First and Vertical-First Modes - -In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big -AND) results in early exit: no more updates to CTR occur (if requested); -no branch occurs, and LR is not updated (if requested). Likewise for -non-ALL mode (Great Big Or) on first success early exit also occurs, -however this time with the Branch proceeding. In both cases the testing -of the Vector of CRs should be done in linear sequential order (or in -REMAP re-sequenced order): such that tests that are sequentially beyond -the exit point are *not* carried out. (*Note: it is standard practice -in Programming languages to exit early from conditional tests, however a -little unusual to consider in an ISA that is designed for Parallel Vector -Processing. The reason is to have strictly-defined guaranteed behaviour*) - -In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED` -behaviour. Given that only one element is being tested at a time in -Vertical-First Mode, a test designed to be done on multiple bits is -meaningless. - -## Description and Modes - -Predication in both INT and CR modes may be applied to `sv.bc` and other -SVP64 Branch Conditional operations, exactly as they may be applied to -other SVP64 operations. When `sz` is zero, any masked-out Branch-element -operations are not included in condition testing, exactly like all other -SVP64 operations, *including* side-effects such as potentially updating -LR or CTR, which will also be skipped. There is *one* exception here, -which is when `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element -predicate mask bit is also zero: under these special circumstances CTR -will also decrement. - -When `sz` is non-zero, this normally requests insertion of a zero in -place of the input data, when the relevant predicate mask bit is zero. -This would mean that a zero is inserted in place of `CR[BI+32]` for -testing against `BO`, which may not be desirable in all circumstances. -Therefore, an extra field is provided `SNZ`, which, if set, will insert -a **one** in place of a masked-out element, instead of a zero. - -(*Note: Both options are provided because it is useful to deliberately -cause the Branch-Conditional Vector testing to fail at a specific point, -controlled by the Predicate mask. This is particularly useful in `VLSET` -mode, which will truncate SVSTATE.VL at the point of the first failed -test.*) - -Normally, CTR mode will decrement once per Condition Test, resulting under -normal circumstances that CTR reduces by up to VL in Horizontal-First -Mode. Just as when v3.0B Branch-Conditional saves at least one instruction -on tight inner loops through auto-decrementation of CTR, likewise it -is also possible to save instruction count for SVP64 loops in both -Vertical-First and Horizontal-First Mode, particularly in circumstances -where there is conditional interaction between the element computation -and testing, and the continuation (or otherwise) of a given loop. The -potential combinations of interactions is why CTR testing options have -been added. - -Also, the unconditional bit `BO[0]` is still relevant when Predication -is applied to the Branch because in `ALL` mode all nonmasked bits have -to be tested, and when `sz=0` skipping occurs. Even when VLSET mode is -not used, CTR may still be decremented by the total number of nonmasked -elements, acting in effect as either a popcount or cntlz depending -on which mode bits are set. In short, Vectorised Branch becomes an -extremely powerful tool. - -**Micro-Architectural Implementation Note**: *when implemented on top -of a Multi-Issue Out-of-Order Engine it is possible to pass a copy of -the predicate and the prerequisite CR Fields to all Branch Units, as -well as the current value of CTR at the time of multi-issue, and for -each Branch Unit to compute how many times CTR would be subtracted, -in a fully-deterministic and parallel fashion. A SIMD-based Branch -Unit, receiving and processing multiple CR Fields covered by multiple -predicate bits, would do the exact same thing. Obviously, however, if -CTR is modified within any given loop (mtctr) the behaviour of CTR is -no longer deterministic.* - -### Link Register Update - -For a Scalar Branch, unconditional updating of the Link Register LR -is useful and practical. However, if a loop of CR Fields is tested, -unconditional updating of LR becomes problematic. - -For example when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode, -LR's value will be unconditionally overwritten after the first element, -such that for execution (testing) of the second element, LR has the value -`CIA+8`. This is covered in the `bclrl` example, in a later section. - -The addition of a LRu bit modifies behaviour in conjunction with LK, -as follows: - -* `sv.bc` When LRu=0,LK=0, Link Register is not updated -* `sv.bcl` When LRu=0,LK=1, Link Register is updated unconditionally -* `sv.bcl/lru` When LRu=1,LK=1, Link Register will - only be updated if the Branch Condition fails. -* `sv.bc/lru` When LRu=1,LK=0, Link Register will only be updated if - the Branch Condition succeeds. - -This avoids destruction of LR during loops (particularly Vertical-First -ones). - -**SVLR and SVSTATE** - -For precisely the reasons why `LK=1` was added originally to the Power -ISA, with SVSTATE being a peer of the Program Counter it becomes necessary -to also add an SVLR (SVSTATE Link Register) and corresponding control bits -`SL` and `SLu`. - -### CTR-test - -Where a standard Scalar v3.0B branch unconditionally decrements CTR when -`BO[2]` is clear, CTR-test Mode introduces more flexibility which allows -CTR to be used for many more types of Vector loops constructs. - -CTR-test mode and CTi interaction is as follows: note that `BO[2]` -is still required to be clear for CTR decrements to be considered, -exactly as is the case in Scalar Power ISA v3.0B - -* **CTR-test=0, CTi=0**: CTR decrements on a per-element basis - if `BO[2]` is zero. Masked-out elements when `sz=0` are - skipped (i.e. CTR is *not* decremented when the predicate - bit is zero and `sz=0`). -* **CTR-test=0, CTi=1**: CTR decrements on a per-element basis - if `BO[2]` is zero and a masked-out element is skipped - (`sz=0` and predicate bit is zero). This one special case is the - **opposite** of other combinations, as well as being - completely different from normal SVP64 `sz=0` behaviour) -* **CTR-test=1, CTi=0**: CTR decrements on a per-element basis - if `BO[2]` is zero and the Condition Test succeeds. - Masked-out elements when `sz=0` are skipped (including - not decrementing CTR) -* **CTR-test=1, CTi=1**: CTR decrements on a per-element basis - if `BO[2]` is zero and the Condition Test *fails*. - Masked-out elements when `sz=0` are skipped (including - not decrementing CTR) - -`CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the -only time in the entirety of SVP64 that has side-effects when -a predicate mask bit is clear. **All** other SVP64 operations -entirely skip an element when sz=0 and a predicate mask bit is zero. -It is also critical to emphasise that in this unusual mode, -no other side-effects occur: **only** CTR is decremented, i.e. the -rest of the Branch operation is skipped. - -### VLSET Mode - -VLSET Mode truncates the Vector Length so that subsequent instructions -operate on a reduced Vector Length. This is similar to Data-dependent -Fail-First and LD/ST Fail-First, where for VLSET the truncation occurs -at the Branch decision-point. - -Interestingly, due to the side-effects of `VLSET` mode it is actually -useful to use Branch Conditional even to perform no actual branch -operation, i.e to point to the instruction after the branch. Truncation of -VL would thus conditionally occur yet control flow alteration would not. - -`VLSET` mode with Vertical-First is particularly unusual. Vertical-First -is designed to be used for explicit looping, where an explicit call to -`svstep` is required to move both srcstep and dststep on to the next -element, until VL (or other condition) is reached. Vertical-First Looping -is expected (required) to terminate if the end of the Vector, VL, is -reached. If however that loop is terminated early because VL is truncated, -VLSET with Vertical-First becomes meaningless. Resolving this would -require two branches: one Conditional, the other branching unconditionally -to create the loop, where the Conditional one jumps over it. - -Therefore, with `VSb`, the option to decide whether truncation should -occur if the branch succeeds *or* if the branch condition fails allows -for the flexibility required. This allows a Vertical-First Branch to -*either* be used as a branch-back (loop) *or* as part of a conditional -exit or function call from *inside* a loop, and for VLSET to be integrated -into both types of decision-making. - -In the case of a Vertical-First branch-back (loop), with `VSb=0` the -branch takes place if success conditions are met, but on exit from that -loop (branch condition fails), VL will be truncated. This is extremely -useful. - -`VLSET` mode with Horizontal-First when `VSb=0` is still useful, because -it can be used to truncate VL to the first predicated (non-masked-out) -element. - -The truncation point for VL, when VLi is clear, must not include skipped -elements that preceded the current element being tested. Example: -`sz=0, VLi=0, predicate mask = 0b110010` and the Condition Register -failure point is at CR Field element 4. - -* Testing at element 0 is skipped because its predicate bit is zero -* Testing at element 1 passed -* Testing elements 2 and 3 are skipped because their - respective predicate mask bits are zero -* Testing element 4 fails therefore VL is truncated to **2** - not 4 due to elements 2 and 3 being skipped. - -If `sz=1` in the above example *then* VL would have been set to 4 because -in non-zeroing mode the zero'd elements are still effectively part of the -Vector (with their respective elements set to `SNZ`) - -If `VLI=1` then VL would be set to 5 regardless of sz, due to being inclusive -of the element actually being tested. - -### VLSET and CTR-test combined - -If both CTR-test and VLSET Modes are requested, it is important to -observe the correct order. What occurs depends on whether VLi is enabled, -because VLi affects the length, VL. - -If VLi (VL truncate inclusive) is set: - -1. compute the test including whether CTR triggers -2. (optionally) decrement CTR -3. (optionally) truncate VL (VSb inverts the decision) -4. decide (based on step 1) whether to terminate looping - (including not executing step 5) -5. decide whether to branch. - -If VLi is clear, then when a test fails that element -and any following it -should **not** be considered part of the Vector. Consequently: - -1. compute the branch test including whether CTR triggers -2. if the test fails against VSb, truncate VL to the *previous* - element, and terminate looping. No further steps executed. -3. (optionally) decrement CTR -4. decide whether to branch. - -## Boolean Logic combinations - -In a Scalar ISA, Branch-Conditional testing even of vector results may be -performed through inversion of tests. NOR of all tests may be performed -by inversion of the scalar condition and branching *out* from the scalar -loop around elements, using scalar operations. - -In a parallel (Vector) ISA it is the ISA itself which must perform -the prerequisite logic manipulation. Thus for SVP64 there are an -extraordinary number of nesessary combinations which provide completely -different and useful behaviour. Available options to combine: - -* `BO[0]` to make an unconditional branch would seem irrelevant if - it were not for predication and for side-effects (CTR Mode - for example) -* Enabling CTR-test Mode and setting `BO[2]` can still result in the - Branch - taking place, not because the Condition Test itself failed, but - because CTR reached zero **because**, as required by CTR-test mode, - CTR was decremented as a **result** of Condition Tests failing. -* `BO[1]` to select whether the CR bit being tested is zero or nonzero -* `R30` and `~R30` and other predicate mask options including CR and - inverted CR bit testing -* `sz` and `SNZ` to insert either zeros or ones in place of masked-out - predicate bits -* `ALL` or `ANY` behaviour corresponding to `AND` of all tests and - `OR` of all tests, respectively. -* Predicate Mask bits, which combine in effect with the CR being - tested. -* Inversion of Predicate Masks (`~r3` instead of `r3`, or using - `NE` rather than `EQ`) which results in an additional - level of possible ANDing, ORing etc. that would otherwise - need explicit instructions. - -The most obviously useful combinations here are to set `BO[1]` to zero -in order to turn `ALL` into Great-Big-NAND and `ANY` into Great-Big-NOR. -Other Mode bits which perform behavioural inversion then have to work -round the fact that the Condition Testing is NOR or NAND. The alternative -to not having additional behavioural inversion (`SNZ`, `VSb`, `CTi`) -would be to have a second (unconditional) branch directly after the first, -which the first branch jumps over. This contrivance is avoided by the -behavioural inversion bits. - -## Pseudocode and examples - -Please see the SVP64 appendix regarding CR bit ordering and for -the definition of `CR{n}` - -For comparative purposes this is a copy of the v3.0B `bc` pseudocode - -``` - if (mode_is_64bit) then M <- 0 - else M <- 32 - if ¬BO[2] then CTR <- CTR - 1 - ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3]) - cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1]) - if ctr_ok & cond_ok then - if AA then NIA <-iea EXTS(BD || 0b00) - else NIA <-iea CIA + EXTS(BD || 0b00) - if LK then LR <-iea CIA + 4 -``` - -Simplified pseudocode including LRu and CTR skipping, which illustrates -clearly that SVP64 Scalar Branches (VL=1) are **not** identical to -v3.0B Scalar Branches. The key areas where differences occur are the -inclusion of predication (which can still be used when VL=1), in when and -why CTR is decremented (CTRtest Mode) and whether LR is updated (which -is unconditional in v3.0B when LK=1, and conditional in SVP64 when LRu=1). - -Inline comments highlight the fact that the Scalar Branch behaviour and -pseudocode is still clearly visible and embedded within the Vectorised -variant: - -``` - if (mode_is_64bit) then M <- 0 - else M <- 32 - # the bit of CR to test, if the predicate bit is zero, - # is overridden - testbit = CR[BI+32] - if ¬predicate_bit then testbit = SVRMmode.SNZ - # otherwise apart from the override ctr_ok and cond_ok - # are exactly the same - ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3]) - cond_ok <- BO[0] | ¬(testbit ^ BO[1]) - if ¬predicate_bit & ¬SVRMmode.sz then - # this is entirely new: CTR-test mode still decrements CTR - # even when predicate-bits are zero - if ¬BO[2] & CTRtest & ¬CTi then - CTR = CTR - 1 - # instruction finishes here - else - # usual BO[2] CTR-mode now under CTR-test mode as well - if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1 - # new VLset mode, conditional test truncates VL - if VLSET and VSb = (cond_ok & ctr_ok) then - if SVRMmode.VLI then SVSTATE.VL = srcstep+1 - else SVSTATE.VL = srcstep - # usual LR is now conditional, but also joined by SVLR - lr_ok <- LK - svlr_ok <- SVRMmode.SL - if ctr_ok & cond_ok then - if AA then NIA <-iea EXTS(BD || 0b00) - else NIA <-iea CIA + EXTS(BD || 0b00) - if SVRMmode.LRu then lr_ok <- ¬lr_ok - if SVRMmode.SLu then svlr_ok <- ¬svlr_ok - if lr_ok then LR <-iea CIA + 4 - if svlr_ok then SVLR <- SVSTATE -``` - -Below is the pseudocode for SVP64 Branches, which is a little less -obvious but identical to the above. The lack of obviousness is down to -the early-exit opportunities. - -Effective pseudocode for Horizontal-First Mode: - -``` - if (mode_is_64bit) then M <- 0 - else M <- 32 - cond_ok = not SVRMmode.ALL - for srcstep in range(VL): - # select predicate bit or zero/one - if predicate[srcstep]: - # get SVP64 extended CR field 0..127 - SVCRf = SVP64EXTRA(BI>>2) - CRbits = CR{SVCRf} - testbit = CRbits[BI & 0b11] - # testbit = CR[BI+32+srcstep*4] - else if not SVRMmode.sz: - # inverted CTR test skip mode - if ¬BO[2] & CTRtest & ¬CTI then - CTR = CTR - 1 - continue # skip to next element - else - testbit = SVRMmode.SNZ - # actual element test here - ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3]) - el_cond_ok <- BO[0] | ¬(testbit ^ BO[1]) - # check if CTR dec should occur - ctrdec = ¬BO[2] - if CTRtest & (el_cond_ok ^ CTi) then - ctrdec = 0b0 - if ctrdec then CTR <- CTR - 1 - # merge in the test - if SVRMmode.ALL: - cond_ok &= (el_cond_ok & ctr_ok) - else - cond_ok |= (el_cond_ok & ctr_ok) - # test for VL to be set (and exit) - if VLSET and VSb = (el_cond_ok & ctr_ok) then - if SVRMmode.VLI then SVSTATE.VL = srcstep+1 - else SVSTATE.VL = srcstep - break - # early exit? - if SVRMmode.ALL != (el_cond_ok & ctr_ok): - break - # SVP64 rules about Scalar registers still apply! - if SVCRf.scalar: - break - # loop finally done, now test if branch (and update LR) - lr_ok <- LK - svlr_ok <- SVRMmode.SL - if cond_ok then - if AA then NIA <-iea EXTS(BD || 0b00) - else NIA <-iea CIA + EXTS(BD || 0b00) - if SVRMmode.LRu then lr_ok <- ¬lr_ok - if SVRMmode.SLu then svlr_ok <- ¬svlr_ok - if lr_ok then LR <-iea CIA + 4 - if svlr_ok then SVLR <- SVSTATE -``` - -Pseudocode for Vertical-First Mode: - -``` - # get SVP64 extended CR field 0..127 - SVCRf = SVP64EXTRA(BI>>2) - CRbits = CR{SVCRf} - # select predicate bit or zero/one - if predicate[srcstep]: - if BRc = 1 then # CR0 vectorised - CR{SVCRf+srcstep} = CRbits - testbit = CRbits[BI & 0b11] - else if not SVRMmode.sz: - # inverted CTR test skip mode - if ¬BO[2] & CTRtest & ¬CTI then - CTR = CTR - 1 - SVSTATE.srcstep = new_srcstep - exit # no branch testing - else - testbit = SVRMmode.SNZ - # actual element test here - cond_ok <- BO[0] | ¬(testbit ^ BO[1]) - # test for VL to be set (and exit) - if VLSET and cond_ok = VSb then - if SVRMmode.VLI - SVSTATE.VL = new_srcstep+1 - else - SVSTATE.VL = new_srcstep -``` - -### Example Shader code - -``` - // assume f() g() or h() modify a and/or b - while(a > 2) { - if(b < 5) - f(); - else - g(); - h(); - } -``` - -which compiles to something like: - -``` - vec a, b; - // ... - pred loop_pred = a > 2; - // loop continues while any of a elements greater than 2 - while(loop_pred.any()) { - // vector of predicate bits - pred if_pred = loop_pred & (b < 5); - // only call f() if at least 1 bit set - if(if_pred.any()) { - f(if_pred); - } - label1: - // loop mask ANDs with inverted if-test - pred else_pred = loop_pred & ~if_pred; - // only call g() if at least 1 bit set - if(else_pred.any()) { - g(else_pred); - } - h(loop_pred); - } -``` - -which will end up as: - -``` - # start from while loop test point - b looptest - while_loop: - sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector - sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none - # only calculate loop_pred & pred_b because needed in f() - sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b - f(CR80.v.SO) - skip_f: - # illustrate inversion of pred_b. invert r30, test ALL - # rather than SOME, but masked-out zero test would FAIL, - # therefore masked-out instead is tested against 1 not 0 - sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g - # else = loop & ~pred_b, need this because used in g() - sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT - g(CR80.v.SO) - skip_g: - # conditionally call h(r30) if any loop pred set - sv.bclr/m=r30/~ALL/sz BO[1]=1 h() - looptest: - sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector - sv.crweird r30, CR60.GT # transfer GT vector to r30 - sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop - end: -``` - -### LRu example - -show why LRu would be useful in a loop. Imagine the following -c code: - -``` - for (int i = 0; i < 8; i++) { - if (x < y) break; - } -``` - -Under these circumstances exiting from the loop is not only based on -CTR it has become conditional on a CR result. Thus it is desirable that -NIA *and* LR only be modified if the conditions are met - -v3.0 pseudocode for `bclrl`: - -``` - if (mode_is_64bit) then M <- 0 - else M <- 32 - if ¬BO[2] then CTR <- CTR - 1 - ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3]) - cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1]) - if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00 - if LK then LR <-iea CIA + 4 -``` - -the latter part for SVP64 `bclrl` becomes: - -``` - for i in 0 to VL-1: - ... - ... - cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1]) - lr_ok <- LK - if ctr_ok & cond_ok then - NIA <-iea LR[0:61] || 0b00 - if SVRMmode.LRu then lr_ok <- ¬lr_ok - if lr_ok then LR <-iea CIA + 4 - # if NIA modified exit loop -``` - -The reason why should be clear from this being a Vector loop: -unconditional destruction of LR when LK=1 makes `sv.bclrl` ineffective, -because the intention going into the loop is that the branch should be to -the copy of LR set at the *start* of the loop, not half way through it. -However if the change to LR only occurs if the branch is taken then it -becomes a useful instruction. - -The following pseudocode should **not** be implemented because it -violates the fundamental principle of SVP64 which is that SVP64 looping -is a thin wrapper around Scalar Instructions. The pseducode below is -more an actual Vector ISA Branch and as such is not at all appropriate: - -``` - for i in 0 to VL-1: - ... - ... - cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1]) - if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00 - # only at the end of looping is LK checked. - # this completely violates the design principle of SVP64 - # and would actually need to be a separate (scalar) - # instruction "set LR to CIA+4 but retrospectively" - # which is clearly impossible - if LK then LR <-iea CIA + 4 -``` - [[!tag opf_rfc]] -------- -- 2.30.2