* <https://bugs.libre-soc.org/show_bug.cgi?id=664>
* <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
+* <https://lists.libre-soc.org/pipermail/libre-soc-dev/2022-April/004678.html>
* [[openpower/isa/branch]]
* [[sv/cr_int_predication]]
by comparison to a single Vector-aware Branch.
Therefore, in order to be commercially competitive, `sv.bc` and
other Vector-aware Branch Conditional instructions are a high priority
-for 3D GPU (and CUDA) workloads.
+for 3D GPU (and OpenCL-style) workloads.
Given that Power ISA v3.0B is already quite powerful, particularly
the Condition Registers and their interaction with Branches, there
SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
Conditional:
-| 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description |
-| - | - | - | - | -- | -- | --- |---------|----------------- |
-|ALL|SNZ| / | / | 0 | 0 | / | LRu sz | normal mode |
-|ALL|SNZ| / |VSb| 0 | 1 | VLI | LRu sz | VLSET mode |
-|ALL|SNZ|CTi| / | 1 | 0 | / | LRu sz | CTR-test mode |
-|ALL|SNZ|CTi|VSb| 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode |
+| 4 | 5 | 6 | 7 | 17 | 18 | 19 | 20 | 21 | 22 23 | description |
+| - | - | - | - | -- | -- | -- | -- | --- |--------|----------------- |
+|ALL|SNZ| / | / | | | 0 | 0 | / | LRu sz | normal mode |
+|ALL|SNZ| / |VSb| | | 0 | 1 | VLI | LRu sz | VLSET mode |
+|ALL|SNZ|CTi| / | | | 1 | 0 | / | LRu sz | CTR-test mode |
+|ALL|SNZ|CTi|VSb| | | 1 | 1 | VLI | LRu sz | CTR-test+VLSET mode |
+
+TODO bits 17,18 for SVSTATE-variant of LR and LRu.
Brief description of fields:
If VLI (Vector Length Inclusive) is clear,
VL is truncated to *exclude* the current element, otherwise it is
included. SVSTATE.MVL is not altered: only VL.
-* **LRu**: Link Register Update, used in conjunction with LK=1.
- When LRu=1,LK=0, Link Register is updated unconditionally.
- When LRu=1,LK=1, Link Register will
- only be updated if the Branch Condition succeeds.
- When LRu=0,LK=1, Link Register will only be updated if
- the Branch Condition fails. This avoids
- destruction of LR during loops (particularly Vertical-First
- ones).
+* **LRu**: Link Register Update, used in conjunction with LK=1
+ to make LR update conditional
* **VSb** In VLSET Mode, after testing,
if VSb is set, VL is truncated if the test succeeds. If VSb is clear,
VL is truncated if a test *fails*. Masked-out (skipped)
bits are not considered
- part of testing.
+ part of testing when `sz=0`
* **CTi** CTR inversion. CTR-test Mode normally decrements per element
tested. CTR inversion decrements if a test *fails*. Only relevant
in CTR-test Mode.
LRu and CTR-test modes are where SVP64 Branches subtly differ from
-Scalar v3.0B Branches. `bclr` for example will always update LR, whereas
-`sv.bclr/lru` will only update LR if the branch succeeds.
-
-*Programmer's Note: when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode,
-LR's value will be unconditionally overwritten after the first element,
-such that for execution (testing) of the second element, LR
-has the value `CIA+8`. This is covered in the `bclrl` example, below.
+Scalar v3.0B Branches. `sv.bcl` for example will always update LR, whereas
+`sv.bcl/lru` will only update LR if the branch succeeds.
Of special interest is that when using ALL Mode (Great Big AND
of all Condition Tests), if `VL=0`,
opportunities
* CTR-testing
-Both are outlined below.
+Both are outlined below, in later sections.
# Horizontal-First and Vertical-First Modes
mode bits are set.
In short, Vectorised Branch becomes an extremely powerful tool.
-* **Micro-Architectural Implementation Note**: when implemented on
+**Micro-Architectural Implementation Note**: *when implemented on
top of a Multi-Issue Out-of-Order Engine it is possible to pass
a copy of the predicate and the prerequisite CR Fields to all
Branch Units, as well as the current value of CTR at the time of
CTR would be subtracted, in a fully-deterministic and parallel
fashion. A SIMD-based Branch Unit, receiving and processing
multiple CR Fields covered by multiple predicate bits, would
-do the exact same thing.*
+do the exact same thing. Obviously, however, if CTR is modified
+within any given loop (mtctr) the behaviour of CTR is no longer
+deterministic.*
+
+## Link Register Update
+
+For a Scalar Branch, unconditional updating of the Link Register
+LR is useful and practical. However, if a loop of CR Fields is
+tested, unconditional updating of LR becomes problematic.
+
+For example when using `bclr` with `LRu=1,LK=0` in Horizontal-First Mode,
+LR's value will be unconditionally overwritten after the first element,
+such that for execution (testing) of the second element, LR
+has the value `CIA+8`. This is covered in the `bclrl` example, in
+a later section.
+
+The addition of a LRu bit modifies behaviour in conjunction
+with LK, as follows:
+
+* `sv.bc` When LRu=0,LK=0, Link Register is not updated
+* `sv.bcl` When LRu=0,LK=1, Link Register is updated unconditionally
+* `sv.bcl/lru` When LRu=1,LK=1, Link Register will
+ only be updated if the Branch Condition fails.
+* `sv.bc/lru` When LRu=1,LK=0, Link Register will only be updated if
+ the Branch Condition succeeds.
+
+This avoids
+destruction of LR during loops (particularly Vertical-First
+ones).
## CTR-test
no other side-effects occur: **only** CTR is decremented, i.e. the
rest of the Branch operation is skipped.
-# VLSET Mode
+## VLSET Mode
VLSET Mode truncates the Vector Length so that subsequent instructions
operate on a reduced Vector Length. This is similar to
if ¬predicate_bit & ¬SVRMmode.sz then
if ¬BO[2] & CTRtest & ¬CTi then
CTR = CTR - 1
- stop # instruction finishes here
-if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
-lr_ok <- SVRMmode.LRu
-if ctr_ok & cond_ok then
- if AA then NIA <-iea EXTS(BD || 0b00)
- else NIA <-iea CIA + EXTS(BD || 0b00)
- lr_ok <- ¬lr_ok
-if (LK & lr_ok) | (¬LK & lr_ok) then LR <-iea CIA + 4
+ # instruction finishes here
+else
+ if ¬BO[2] & ¬(CTRtest & (cond_ok ^ CTi)) then CTR <- CTR - 1
+ if VLSET and VSb = (cond_ok & ctr_ok) then
+ if SVRMmode.VLI then SVSTATE.VL = srcstep+1
+ else SVSTATE.VL = srcstep
+ lr_ok <- LK
+ if ctr_ok & cond_ok then
+ if AA then NIA <-iea EXTS(BD || 0b00)
+ else NIA <-iea CIA + EXTS(BD || 0b00)
+ if SVRMmode.LRu then lr_ok <- ¬lr_ok
+ if lr_ok then LR <-iea CIA + 4
```
Below is the pseudocode for SVP64 Branches, which is a little less
obvious but identical to the above. The lack of obviousness is down
to the early-exit opportunities.
-Pseudocode for Horizontal-First Mode:
+Effective pseudocode for Horizontal-First Mode:
```
if (mode_is_64bit) then M <- 0
testbit = CRbits[BI & 0b11]
# testbit = CR[BI+32+srcstep*4]
else if not SVRMmode.sz:
- # inverted CTR test skip mode
- if ¬BO[2] & CTRtest & ¬CTI then
- CTR = CTR - 1
- continue # skip to next element
+ # inverted CTR test skip mode
+ if ¬BO[2] & CTRtest & ¬CTI then
+ CTR = CTR - 1
+ continue # skip to next element
else
testbit = SVRMmode.SNZ
# actual element test here
cond_ok |= (el_cond_ok & ctr_ok)
# test for VL to be set (and exit)
if VLSET and VSb = (el_cond_ok & ctr_ok) then
- if SVRMmode.VLI
- SVSTATE.VL = srcstep+1
- else
- SVSTATE.VL = srcstep
+ if SVRMmode.VLI then SVSTATE.VL = srcstep+1
+ else SVSTATE.VL = srcstep
break
# early exit?
if SVRMmode.ALL != (el_cond_ok & ctr_ok):
if SVCRf.scalar:
break
# loop finally done, now test if branch (and update LR)
-lr_ok <- SVRMmode.LRu
+lr_ok <- LK
if cond_ok then
if AA then NIA <-iea EXTS(BD || 0b00)
else NIA <-iea CIA + EXTS(BD || 0b00)
- lr_ok <- ¬lr_ok
-if (LK & lr_ok) | (¬LK & lr_ok) then LR <-iea CIA + 4
+ if SVRMmode.LRu then lr_ok <- ¬lr_ok
+if lr_ok then LR <-iea CIA + 4
```
Pseudocode for Vertical-First Mode:
```
# TODO LRu example
-show why LRu would be useful in a loop.
+show why LRu would be useful in a loop. Imagine the following
+c code:
+
+```
+for (int i = 0; i < 8; i++) {
+ if (x < y) break;
+}
+```
+
+Under these circumstances exiting from the loop is not only
+based on CTR it has become conditional on a CR result.
+Thus it is desirable that NIA *and* LR only be modified
+if the conditions are met
+
v3.0 pseudocode for `bclrl`:
...
...
cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
+ lr_ok <- LK
if ctr_ok & cond_ok then
NIA <-iea LR[0:61] || 0b00
- lr_ok = ¬lr_ok
- if (LK & lr_ok) | (¬LK & lr_ok) then LR <-iea CIA + 4
+ if SVRMmode.LRu then lr_ok <- ¬lr_ok
+ if lr_ok then LR <-iea CIA + 4
+ # if NIA modified exit loop
```
The reason why should be clear from this being a Vector loop:
-unconditional destruction of LR when LK=1 makes `bclrl`
+unconditional destruction of LR when LK=1 makes `sv.bclrl`
ineffective, because the intention going into the loop is
that the branch should be to the copy of LR set at the *start*
of the loop, not half way through it.