X-Git-Url: https://git.libre-soc.org/?a=blobdiff_plain;f=openpower%2Fsv%2Fbranches.mdwn;h=75edd273c526e41571d77d17c955014c2eff0da4;hb=8bb0390b0b5edb11f1ce8b7d211dfb0b8a7e39ca;hp=ceb5022cca27dc33b1a448780075b2365600f150;hpb=5970ba7f029867fbb34096d93d525c7c10b4ad44;p=libreriscv.git diff --git a/openpower/sv/branches.mdwn b/openpower/sv/branches.mdwn index ceb5022cc..75edd273c 100644 --- a/openpower/sv/branches.mdwn +++ b/openpower/sv/branches.mdwn @@ -21,6 +21,7 @@ Links * * +* * [[openpower/isa/branch]] * [[sv/cr_int_predication]] @@ -56,7 +57,7 @@ Such instructions would be unavoidable, required, and costly by comparison to a single Vector-aware Branch. Therefore, in order to be commercially competitive, `sv.bc` and other Vector-aware Branch Conditional instructions are a high priority -for 3D GPU (and CUDA) workloads. +for 3D GPU (and OpenCL-style) workloads. Given that Power ISA v3.0B is already quite powerful, particularly the Condition Registers and their interaction with Branches, there @@ -160,12 +161,14 @@ Mode bits. SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch Conditional: -| 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description | -| - | - | - | - | -- | -- | --- |---------|----------------- | -|ALL|SNZ| / | / | 0 | 0 | / | LRu sz | normal mode | -|ALL|SNZ| / |VSb| 0 | 1 | VLI | LRu sz | VLSET mode | -|ALL|SNZ|CTi| / | 1 | 0 | / | LRu sz | CTR-test mode | -|ALL|SNZ|CTi|VSb| 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode | +| 4 | 5 | 6 | 7 | 17 | 18 | 19 | 20 | 21 | 22 23 | description | +| - | - | - | - | -- | -- | -- | -- | --- |--------|----------------- | +|ALL|SNZ| / | / | | | 0 | 0 | / | LRu sz | normal mode | +|ALL|SNZ| / |VSb| | | 0 | 1 | VLI | LRu sz | VLSET mode | +|ALL|SNZ|CTi| / | | | 1 | 0 | / | LRu sz | CTR-test mode | +|ALL|SNZ|CTi|VSb| | | 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode | + +TODO bits 17,18 for SVSTATE-variant of LR and LRu. Brief description of fields: @@ -192,31 +195,20 @@ Brief description of fields: If VLI (Vector Length Inclusive) is clear, VL is truncated to *exclude* the current element, otherwise it is included. SVSTATE.MVL is not altered: only VL. -* **LRu**: Link Register Update, used in conjunction with LK=1. - When LRu=1,LK=0, Link Register is updated unconditionally. - When LRu=1,LK=1, Link Register will - only be updated if the Branch Condition succeeds. - When LRu=0,LK=1, Link Register will only be updated if - the Branch Condition fails. This avoids - destruction of LR during loops (particularly Vertical-First - ones). +* **LRu**: Link Register Update, used in conjunction with LK=1 + to make LR update conditional * **VSb** In VLSET Mode, after testing, if VSb is set, VL is truncated if the test succeeds. If VSb is clear, VL is truncated if a test *fails*. Masked-out (skipped) bits are not considered - part of testing. + part of testing when `sz=0` * **CTi** CTR inversion. CTR-test Mode normally decrements per element tested. CTR inversion decrements if a test *fails*. Only relevant in CTR-test Mode. LRu and CTR-test modes are where SVP64 Branches subtly differ from -Scalar v3.0B Branches. `bclr` for example will always update LR, whereas -`sv.bclr/lru` will only update LR if the branch succeeds. - -*Programmer's Note: when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode, -LR's value will be unconditionally overwritten after the first element, -such that for execution (testing) of the second element, LR -has the value `CIA+8`. This is covered in the `bclrl` example, below. +Scalar v3.0B Branches. `sv.bcl` for example will always update LR, whereas +`sv.bcl/lru` will only update LR if the branch succeeds. Of special interest is that when using ALL Mode (Great Big AND of all Condition Tests), if `VL=0`, @@ -264,7 +256,7 @@ Counter (aka "a Branch"), resulting in early-out opportunities * CTR-testing -Both are outlined below. +Both are outlined below, in later sections. # Horizontal-First and Vertical-First Modes @@ -332,7 +324,7 @@ acting in effect as either a popcount or cntlz depending on which mode bits are set. In short, Vectorised Branch becomes an extremely powerful tool. -* **Micro-Architectural Implementation Note**: when implemented on +**Micro-Architectural Implementation Note**: *when implemented on top of a Multi-Issue Out-of-Order Engine it is possible to pass a copy of the predicate and the prerequisite CR Fields to all Branch Units, as well as the current value of CTR at the time of @@ -340,7 +332,35 @@ multi-issue, and for each Branch Unit to compute how many times CTR would be subtracted, in a fully-deterministic and parallel fashion. A SIMD-based Branch Unit, receiving and processing multiple CR Fields covered by multiple predicate bits, would -do the exact same thing.* +do the exact same thing. Obviously, however, if CTR is modified +within any given loop (mtctr) the behaviour of CTR is no longer +deterministic.* + +## Link Register Update + +For a Scalar Branch, unconditional updating of the Link Register +LR is useful and practical. However, if a loop of CR Fields is +tested, unconditional updating of LR becomes problematic. + +For example when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode, +LR's value will be unconditionally overwritten after the first element, +such that for execution (testing) of the second element, LR +has the value `CIA+8`. This is covered in the `bclrl` example, in +a later section. + +The addition of a LRu bit modifies behaviour in conjunction +with LK, as follows: + +* `sv.bc` When LRu=0,LK=0, Link Register is not updated +* `sv.bcl` When LRu=0,LK=1, Link Register is updated unconditionally +* `sv.bcl/lru` When LRu=1,LK=1, Link Register will + only be updated if the Branch Condition fails. +* `sv.bc/lru` When LRu=1,LK=0, Link Register will only be updated if + the Branch Condition succeeds. + +This avoids +destruction of LR during loops (particularly Vertical-First +ones). ## CTR-test @@ -379,7 +399,7 @@ It is also critical to emphasise that in this unusual mode, no other side-effects occur: **only** CTR is decremented, i.e. the rest of the Branch operation is skipped. -# VLSET Mode +## VLSET Mode VLSET Mode truncates the Vector Length so that subsequent instructions operate on a reduced Vector Length. This is similar to @@ -545,21 +565,25 @@ cond_ok <- BO[0] | ¬(testbit ^ BO[1]) if ¬predicate_bit & ¬SVRMmode.sz then if ¬BO[2] & CTRtest & ¬CTi then CTR = CTR - 1 - stop # instruction finishes here -if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1 -lr_ok <- SVRMmode.LRu -if ctr_ok & cond_ok then - if AA then NIA <-iea EXTS(BD || 0b00) - else NIA <-iea CIA + EXTS(BD || 0b00) - lr_ok <- ¬lr_ok -if (LK & lr_ok) | (¬LK & lr_ok) then LR <-iea CIA + 4 + # instruction finishes here +else + if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1 + if VLSET and VSb = (cond_ok & ctr_ok) then + if SVRMmode.VLI then SVSTATE.VL = srcstep+1 + else SVSTATE.VL = srcstep + lr_ok <- LK + if ctr_ok & cond_ok then + if AA then NIA <-iea EXTS(BD || 0b00) + else NIA <-iea CIA + EXTS(BD || 0b00) + if SVRMmode.LRu then lr_ok <- ¬lr_ok + if lr_ok then LR <-iea CIA + 4 ``` Below is the pseudocode for SVP64 Branches, which is a little less obvious but identical to the above. The lack of obviousness is down to the early-exit opportunities. -Pseudocode for Horizontal-First Mode: +Effective pseudocode for Horizontal-First Mode: ``` if (mode_is_64bit) then M <- 0 @@ -574,10 +598,10 @@ for srcstep in range(VL): testbit = CRbits[BI & 0b11] # testbit = CR[BI+32+srcstep*4] else if not SVRMmode.sz: - # inverted CTR test skip mode - if ¬BO[2] & CTRtest & ¬CTI then - CTR = CTR - 1 - continue # skip to next element + # inverted CTR test skip mode + if ¬BO[2] & CTRtest & ¬CTI then + CTR = CTR - 1 + continue # skip to next element else testbit = SVRMmode.SNZ # actual element test here @@ -595,10 +619,8 @@ for srcstep in range(VL): cond_ok |= (el_cond_ok & ctr_ok) # test for VL to be set (and exit) if VLSET and VSb = (el_cond_ok & ctr_ok) then - if SVRMmode.VLI - SVSTATE.VL = srcstep+1 - else - SVSTATE.VL = srcstep + if SVRMmode.VLI then SVSTATE.VL = srcstep+1 + else SVSTATE.VL = srcstep break # early exit? if SVRMmode.ALL != (el_cond_ok & ctr_ok): @@ -607,12 +629,12 @@ for srcstep in range(VL): if SVCRf.scalar: break # loop finally done, now test if branch (and update LR) -lr_ok <- SVRMmode.LRu +lr_ok <- LK if cond_ok then if AA then NIA <-iea EXTS(BD || 0b00) else NIA <-iea CIA + EXTS(BD || 0b00) - lr_ok <- ¬lr_ok -if (LK & lr_ok) | (¬LK & lr_ok) then LR <-iea CIA + 4 + if SVRMmode.LRu then lr_ok <- ¬lr_ok +if lr_ok then LR <-iea CIA + 4 ``` Pseudocode for Vertical-First Mode: @@ -712,7 +734,20 @@ end: ``` # TODO LRu example -show why LRu would be useful in a loop. +show why LRu would be useful in a loop. Imagine the following +c code: + +``` +for (int i = 0; i < 8; i++) { + if (x < y) break; +} +``` + +Under these circumstances exiting from the loop is not only +based on CTR it has become conditional on a CR result. +Thus it is desirable that NIA *and* LR only be modified +if the conditions are met + v3.0 pseudocode for `bclrl`: @@ -733,14 +768,16 @@ for i in 0 to VL-1: ... ... cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1]) + lr_ok <- LK if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00 - lr_ok = ¬lr_ok - if (LK & lr_ok) | (¬LK & lr_ok) then LR <-iea CIA + 4 + if SVRMmode.LRu then lr_ok <- ¬lr_ok + if lr_ok then LR <-iea CIA + 4 + # if NIA modified exit loop ``` The reason why should be clear from this being a Vector loop: -unconditional destruction of LR when LK=1 makes `bclrl` +unconditional destruction of LR when LK=1 makes `sv.bclrl` ineffective, because the intention going into the loop is that the branch should be to the copy of LR set at the *start* of the loop, not half way through it.