# SVP64 Branch Conditional behaviour Please note: although similar, SVP64 Branch instructions should be considered completely separate and distinct from standard scalar OpenPOWER-approved v3.0B branches. **v3.0B branches are in no way impacted, altered, changed or modified in any way, shape or form by the SVP64 Vectorized Variants**. It is also extremely important to note that Branches are the sole pseudo-exception in SVP64 to `Scalar Identity Behaviour`. SVP64 Branches contain additional modes that are useful for scalar operations (i.e. even when VL=1 or when using single-bit predication). Links * * - fix error where ending on scalar BI source. * * * Branch Divergence * [[openpower/isa/branch]] * [[sv/cr_int_predication]] * [TODO](https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=fa99590eeb61e63b2d2ea81f303b9b4320e3bbe1) ## Rationale Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a Condition Register. However for parallel processing it is simply impossible to perform multiple independent branches: the Program Counter simply cannot branch to multiple destinations based on multiple conditions. The best that can be done is to test multiple Conditions and make a decision of a *single* branch, based on analysis of a *Vector* of CR Fields which have just been calculated from a *Vector* of results. In 3D Shader binaries, which are inherently parallelised and predicated, testing all or some results and branching based on multiple tests is extremely common, and a fundamental part of Shader Compilers. Example: without such multi-condition test-and-branch, if a predicate mask is all zeros a large batch of instructions may be masked out to `nop`, and it would waste CPU cycles to run them. 3D GPU ISAs can test for this scenario and, with the appropriate predicate-analysis instruction, jump over fully-masked-out operations, by spotting that *all* Conditions are false. Unless Branches are aware and capable of such analysis, additional instructions would be required which perform Horizontal Cumulative analysis of Vectorized Condition Register Fields, in order to reduce the Vector of CR Fields down to one single yes or no decision that a Scalar-only v3.0B Branch-Conditional could cope with. Such instructions would be unavoidable, required, and costly by comparison to a single Vector-aware Branch. Therefore, in order to be commercially competitive, `sv.bc` and other Vector-aware Branch Conditional instructions are a high priority for 3D GPU (and OpenCL-style) workloads. Given that Power ISA v3.0B is already quite powerful, particularly the Condition Registers and their interaction with Branches, there are opportunities to create extremely flexible and compact Vectorized Branch behaviour. In addition, the side-effects (updating of CTR, truncation of VL, described below) make it a useful instruction even if the branch points to the next instruction (no actual branch). ## Overview When considering an "array" of branch-tests, there are four primarily-useful modes: AND, OR, NAND and NOR of all Conditions. NAND and NOR may be synthesised from AND and OR by inverting `BO[1]` which just leaves two modes: * Branch takes place on the **first** CR Field test to succeed (a Great Big OR of all condition tests). Exit occurs on the first **successful** test. * Branch takes place only if **all** CR field tests succeed: a Great Big AND of all condition tests. Exit occurs on the first **failed** test. Early-exit is enacted such that the Vectorized Branch does not perform needless extra tests, which will help reduce reads on the Condition Register file. *Note: Early-exit is **MANDATORY** (required) behaviour. Branches **MUST** exit at the first sequentially-encountered failure point, for exactly the same reasons for which it is mandatory in programming languages doing early-exit: to avoid damaging side-effects and to provide deterministic behaviour. Speculative testing of Condition Register Fields is permitted, as is speculative calculation of CTR, as long as, as usual in any Out-of-Order microarchitecture, that speculative testing is cancelled should an early-exit occur. i.e. the speculation must be "precise": Program Order must be preserved* Also note that when early-exit occurs in Horizontal-first Mode, srcstep, dststep etc. are all reset, ready to begin looping from the beginning for the next instruction. However for Vertical-first Mode srcstep etc. are incremented "as usual" i.e. an early-exit has no special impact, regardless of whether the branch occurred or not. This can leave srcstep etc. in what may be considered an unusual state on exit from a loop and it is up to the programmer to reset srcstep, dststep etc. to known-good values *(easily achieved with `setvl`)*. Additional useful behaviour involves two primary Modes (both of which may be enabled and combined): * **VLSET Mode**: identical to Data-Dependent Fail-First Mode for Arithmetic SVP64 operations, with more flexibility and a close interaction and integration into the underlying base Scalar v3.0B Branch instruction. Truncation of VL takes place around the early-exit point. * **CTR-test Mode**: gives much more flexibility over when and why CTR is decremented, including options to decrement if a Condition test succeeds *or if it fails*. With these side-effects, basic Boolean Logic Analysis advises that it is important to provide a means to enact them each based on whether testing succeeds *or fails*. This results in a not-insignificant number of additional Mode Augmentation bits, accompanying VLSET and CTR-test Modes respectively. Predicate skipping or zeroing may, as usual with SVP64, be controlled by `sz`. Where the predicate is masked out and zeroing is enabled, then in such circumstances the same Boolean Logic Analysis dictates that rather than testing only against zero, the option to test against one is also prudent. This introduces a new immediate field, `SNZ`, which works in conjunction with `sz`. Vectorized Branches can be used in either SVP64 Horizontal-First or Vertical-First Mode. Essentially, at an element level, the behaviour is identical in both Modes, although the `ALL` bit is meaningless in Vertical-First Mode. It is also important to bear in mind that, fundamentally, Vectorized Branch-Conditional is still extremely close to the Scalar v3.0B Branch-Conditional instructions, and that the same v3.0B Scalar Branch-Conditional instructions are still *completely separate and independent*, being unaltered and unaffected by their SVP64 variants in every conceivable way. *Programming note: One important point is that SVP64 instructions are 64 bit. (8 bytes not 4). This needs to be taken into consideration when computing branch offsets: the offset is relative to the start of the instruction, which **includes** the SVP64 Prefix* *Programming note: SV Branch-conditional instructions have no destination register, only a source (`BI`). Therefore the looping will occur even on Scalar BI (`sv.bc/all 16, 0, location`). If this is not desirable behaviour and only a single scalar test is required use a single-bit unary predicate mask such as `sm=1<>2) CRbits = CR{SVCRf} testbit = CRbits[BI & 0b11] # testbit = CR[BI+32+srcstep*4] else if not SVRMmode.sz: # inverted CTR test skip mode if ¬BO[2] & CTRtest & ¬CTI then CTR = CTR - 1 continue # skip to next element else testbit = SVRMmode.SNZ # actual element test here ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3]) el_cond_ok <- BO[0] | ¬(testbit ^ BO[1]) # check if CTR dec should occur ctrdec = ¬BO[2] if CTRtest & (el_cond_ok ^ CTi) then ctrdec = 0b0 if ctrdec then CTR <- CTR - 1 # merge in the test if SVRMmode.ALL: cond_ok &= (el_cond_ok & ctr_ok) else cond_ok |= (el_cond_ok & ctr_ok) # test for VL to be set (and exit) if VLSET and VSb = (el_cond_ok & ctr_ok) then if SVRMmode.VLI then SVSTATE.VL = srcstep+1 else SVSTATE.VL = srcstep break # early exit? if SVRMmode.ALL != (el_cond_ok & ctr_ok): break # SVP64 rules about Scalar registers still apply! if SVCRf.scalar: break # loop finally done, now test if branch (and update LR) lr_ok <- LK svlr_ok <- SVRMmode.SL if cond_ok then if AA then NIA <-iea EXTS(BD || 0b00) else NIA <-iea CIA + EXTS(BD || 0b00) if SVRMmode.LRu then lr_ok <- ¬lr_ok if SVRMmode.SLu then svlr_ok <- ¬svlr_ok if lr_ok then LR <-iea CIA + 4 if svlr_ok then SVLR <- SVSTATE ``` Pseudocode for Vertical-First Mode: ``` # get SVP64 extended CR field 0..127 SVCRf = SVP64EXTRA(BI>>2) CRbits = CR{SVCRf} # select predicate bit or zero/one if predicate[srcstep]: if BRc = 1 then # CR0 vectorized CR{SVCRf+srcstep} = CRbits testbit = CRbits[BI & 0b11] else if not SVRMmode.sz: # inverted CTR test skip mode if ¬BO[2] & CTRtest & ¬CTI then CTR = CTR - 1 SVSTATE.srcstep = new_srcstep exit # no branch testing else testbit = SVRMmode.SNZ # actual element test here cond_ok <- BO[0] | ¬(testbit ^ BO[1]) # test for VL to be set (and exit) if VLSET and cond_ok = VSb then if SVRMmode.VLI SVSTATE.VL = new_srcstep+1 else SVSTATE.VL = new_srcstep ``` ### Example Shader code ``` // assume f() g() or h() modify a and/or b while(a > 2) { if(b < 5) f(); else g(); h(); } ``` which compiles to something like: ``` vec a, b; // ... pred loop_pred = a > 2; // loop continues while any of a elements greater than 2 while(loop_pred.any()) { // vector of predicate bits pred if_pred = loop_pred & (b < 5); // only call f() if at least 1 bit set if(if_pred.any()) { f(if_pred); } label1: // loop mask ANDs with inverted if-test pred else_pred = loop_pred & ~if_pred; // only call g() if at least 1 bit set if(else_pred.any()) { g(else_pred); } h(loop_pred); } ``` which will end up as: ``` # start from while loop test point b looptest while_loop: sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none # only calculate loop_pred & pred_b because needed in f() sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b f(CR80.v.SO) skip_f: # illustrate inversion of pred_b. invert r30, test ALL # rather than SOME, but masked-out zero test would FAIL, # therefore masked-out instead is tested against 1 not 0 sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g # else = loop & ~pred_b, need this because used in g() sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT g(CR80.v.SO) skip_g: # conditionally call h(r30) if any loop pred set sv.bclr/m=r30/~ALL/sz BO[1]=1 h() looptest: sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector sv.crweird r30, CR60.GT # transfer GT vector to r30 sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop end: ``` ### LRu example show why LRu would be useful in a loop. Imagine the following c code: ``` for (int i = 0; i < 8; i++) { if (x < y) break; } ``` Under these circumstances exiting from the loop is not only based on CTR it has become conditional on a CR result. Thus it is desirable that NIA *and* LR only be modified if the conditions are met v3.0 pseudocode for `bclrl`: ``` if (mode_is_64bit) then M <- 0 else M <- 32 if ¬BO[2] then CTR <- CTR - 1 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3]) cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1]) if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00 if LK then LR <-iea CIA + 4 ``` the latter part for SVP64 `bclrl` becomes: ``` for i in 0 to VL-1: ... ... cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1]) lr_ok <- LK if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00 if SVRMmode.LRu then lr_ok <- ¬lr_ok if lr_ok then LR <-iea CIA + 4 # if NIA modified exit loop ``` The reason why should be clear from this being a Vector loop: unconditional destruction of LR when LK=1 makes `sv.bclrl` ineffective, because the intention going into the loop is that the branch should be to the copy of LR set at the *start* of the loop, not half way through it. However if the change to LR only occurs if the branch is taken then it becomes a useful instruction. The following pseudocode should **not** be implemented because it violates the fundamental principle of SVP64 which is that SVP64 looping is a thin wrapper around Scalar Instructions. The pseducode below is more an actual Vector ISA Branch and as such is not at all appropriate: ``` for i in 0 to VL-1: ... ... cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1]) if ctr_ok & cond_ok then NIA <-iea LR[0:61] || 0b00 # only at the end of looping is LK checked. # this completely violates the design principle of SVP64 # and would actually need to be a separate (scalar) # instruction "set LR to CIA+4 but retrospectively" # which is clearly impossible if LK then LR <-iea CIA + 4 ``` -------- \newpage{} [[!tag standards]]