1 # SVP64 Branch Conditional behaviour
5 Please note: SVP64 Branch instructions should be
6 considered completely separate and distinct from
7 standard scalar OpenPOWER-approved v3.0B branches.
8 **v3.0B branches are in no way impacted, altered,
9 changed or modified in any way, shape or form by
10 the SVP64 Vectorised Variants**.
14 * <https://bugs.libre-soc.org/show_bug.cgi?id=664>
15 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
16 * [[openpower/isa/branch]]
20 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
21 Condition Register. However for parallel processing it is simply impossible
22 to perform multiple independent branches: the Program Counter simply
23 cannot branch to multiple destinations based on multiple conditions.
24 The best that can be done is
25 to test multiple Conditions and make a decision of a *single* branch,
26 based on analysis of a *Vector* of CR Fields
27 which have just been calculated from a *Vector* of results.
30 binaries, which are inherently parallelised and predicated, testing all or
31 some results and branching based on multiple tests is extremely common,
32 and a fundamental part of Shader Compilers. Example:
33 without such multi-condition
34 test-and-branch, if a predicate mask is all zeros a large batch of
35 instructions may be masked out to `nop`, and it would waste
36 CPU cycles to run them. 3D GPU ISAs can test for this scenario
37 and, with the appropriate predicate-analysis instruction,
38 jump over fully-masked-out operations, by spotting that
39 *all* Conditions are false.
41 Unless Branches are aware and capable of such analysis, additional
42 instructions would be required which perform Horizontal Cumulative
43 analysis of Vectorised Condition Register Fields, in order to
44 reduce the Vector of CR Fields down to one single yes or no
45 decision that a Scalar-only v3.0B Branch-Conditional could cope with.
46 Such instructions would be unavoidable, required, and costly
47 by comparison to a single Vector-aware Branch.
48 Therefore, in order to be commercially competitive, `sv.bc` and
49 other Vector-aware Branch Conditional instructions are a high priority
52 Given that Power ISA v3.0B is already quite powerful, particularly
53 the Condition Registers and their interaction with Branches, there
54 are opportunities to create extremely flexible and compact
55 Vectorised Branch behaviour. In addition, the side-effects (updating
56 of CTR, truncation of VL, described below) make it a useful instruction
57 even if the branch points to the next instruction (no actual branch).
61 When considering an "array" of branch-tests, there are four useful modes:
62 AND, OR, NAND and NOR of all Conditions.
63 NAND and NOR may be synthesised from AND and OR by
64 inverting `BO[2]` which just leaves two modes:
66 * Branch takes place on the first CR Field test to succeed
67 (a Great Big OR of all condition tests)
68 * Branch takes place only if **all** CR field tests succeed:
69 a Great Big AND of all condition tests
70 (including those where the predicate is masked out
71 and the corresponding CR Field is considered to be
74 Early-exit is enacted such that the Vectorised Branch does not
75 perform needless extra tests, which will help reduce reads on
76 the Condition Register file.
78 *Note: Early-exit is **MANDATORY** (required) behaviour.
79 Branches **MUST** exit at the first failure point, for
80 exactly the same reasons for which it is mandatory in
81 programming languages doing early-exit: to avoid
82 damaging side-effects. Speculative testing of Condition
83 Register Fields is permitted, as is speculative updating
84 of CTR, as long as, as usual in any Out-of-Order microarchitecture,
85 that speculative testing is cancelled should an early-exit occur.*
87 Additional useful behaviour involves two primary Modes (both of
88 which may be enabled and combined):
90 * **VLSET Mode**: identical to Data-Dependent Fail-First Mode
91 for Arithmetic SVP64 operations, with more
92 flexibility and a close interaction and integration into the
93 underlying base Scalar v3.0B Branch instruction.
94 * **CTR-test Mode**: gives much more flexibility over when and why
95 CTR is decremented, including options to decrement if a Condition
96 test succeeds *or if it fails*.
98 With these side-effects, basic Boolean Logic Analysis advises that
99 it is important to provide a means
100 to enact them each based on whether testing succeeds *or fails*. This
101 results in a not-insignificant number of additional Mode Augmentation bits,
102 accompanying VLSET and CTR-test Modes respectively.
104 Vectorised Branches can be used
105 in either SVP64 Horizontal-First or Vertical-First Mode. Essentially,
106 at an element level, the behaviour is identical in both Modes,
107 although the `ALL` bit is meaningless in Vertical-First Mode.
110 to bear in mind that, fundamentally, Vectorised Branch-Conditional
111 is still extremely close to the Scalar v3.0B Branch-Conditional
112 instructions, and that the same v3.0B Scalar Branch-Conditional
113 instructions are still
114 *completely separate and independent*, being unaltered and
115 unaffected by their SVP64 variants in every conceivable way.
119 With element-width overrides being meaningless for Condition
120 Register Fields, bits 4 thru 7 of SVP64 RM may be used for additional
123 SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
126 | 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description |
127 | - | - | - | - | -- | -- | --- |---------|----------------- |
128 |ALL|LRu| / | / | 0 | 0 | / | SNZ sz | normal mode |
129 |ALL|LRu| / |VSb| 0 | 1 | VLI | SNZ sz | VLSET mode |
130 |ALL|LRu|CTi| / | 1 | 0 | / | SNZ sz | CTR-test mode |
131 |ALL|LRu|CTi|VSb| 1 | 1 | VLI | SNZ sz | CTR-test+VLSET mode |
133 Brief description of fields:
135 * **sz=1** if predication is enabled and `sz=1` and a predicate
136 element bit is zero, `SNZ` will
137 be substituted in place of the CR bit selected by `BI`,
138 as the Condition tested.
140 normal SVP64 `sz=1` behaviour, where *only* a zero is put in
141 place of masked-out predicate bits.
142 * **sz=0** When `sz=0` skipping occurs as usual on
143 masked-out elements, but unlike all
144 other SVP64 behaviour which entirely skips an element with
145 no related side-effects at all, there are certain
146 special circumstances where CTR
147 may be decremented. See CTR-test Mode, below.
148 * **ALL** when set, all branch conditional tests must pass in order for
149 the branch to succeed. When clear, it is the first sequentially
150 encountered successful test that causes the branch to succeed.
151 This is identical behaviour to how programming languages perform
152 early-exit on Boolean Logic chains.
153 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
154 In VLSET mode, VL is set equal (truncated) to the first point
155 where, assuming Conditions are tested sequentially, the branch succeeds
156 *or fails* depending if VSb is set.
157 If VLI (Vector Length Inclusive) is clear,
158 VL is truncated to *exclude* the current element, otherwise it is
159 included. SVSTATE.MVL is not changed: only VL.
160 * **LRu**: Link Register Update. When set, Link Register will
161 only be updated if the Branch Condition succeeds. This avoids
162 destruction of LR during loops (particularly Vertical-First
164 * **VSb** is most relevant for Vertical-First VLSET Mode. After testing,
165 if VSb is set, VL is truncated if the branch succeeds. If VSb is clear,
166 VL is truncated if the branch did **not** take place.
167 * **CTi** CTR inversion. CTR-test Mode normally decrements per element
168 tested. CTR inversion decrements if a test *fails*. Only relevant
171 # Vectorised CR Field numbering, and Scalar behaviour
173 It is important to keep in mind that just like all SVP64 instructions,
174 the `BI` field of the base v3.0B Branch Conditional instruction
175 may be extended by SVP64 EXTRA augmentation, as well as be marked
176 as either Scalar or Vector.
178 The `BI` field of Branch Conditional operations is five bits, in scalar
179 v3.0B this would select one bit of the 32 bit CR,
180 comprising eight CR Fields of 4 bits each. In SVP64 there are
181 16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
182 `BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
183 are extended to either scalar or vector and to select CR Fields 0..127
184 as specified in SVP64 [[sv/svp64/appendix]].
186 When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
187 then as the usual SVP64 rules apply:
188 the Vector loop ends at the first element tested, after taking
189 predication into consideration. Thus, also as usual, when a predicate mask is
190 given, and `BI` marked as scalar, and `sz` is zero, srcstep
191 skips forward to the first non-zero predicated element, and only that
192 one element is tested.
194 In other words, the fact that this is a Branch
195 Operation (instead of an arithmetic one) does not result, ultimately,
196 in significant changes as to
197 how SVP64 is fundamentally applied, except with respect to
198 the unique properties associated with conditionally
200 Counter (aka "a Branch"), resulting in early-out
201 opportunities and CTR-testing, which are outlined below.
203 # Horizontal-First and Vertical-First Modes
205 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
206 AND) results in early exit: no more updates to CTR occur (if requested);
207 no branch occurs, and LR is not updated (if requested). Likewise for
208 non-ALL mode (Great Big Or) on first success early exit also occurs,
209 however this time with the Branch proceeding. In both cases the testing
210 of the Vector of CRs should be done in linear sequential order (or in
211 REMAP re-sequenced order): such that tests that are sequentially beyond
212 the exit point are *not* carried out. (*Note: it is standard practice in
213 Programming languages to exit early from conditional tests, however
214 a little unusual to consider in an ISA that is designed for Parallel
215 Vector Processing. The reason is to have strictly-defined guaranteed
218 In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
219 behaviour. Given that only one element is being tested at a time
220 in Vertical-First Mode, a test designed to be done on multiple
223 # Description and Modes
225 Predication in both INT and CR modes may be applied to `sv.bc` and other
226 SVP64 Branch Conditional operations, exactly as they may be applied to
227 other SVP64 operations. When `sz` is zero, any masked-out Branch-element
228 operations are not included in condition testing, exactly like all other
229 SVP64 operations, *including* side-effects such as potentially updating
230 LR or CTR, which will also be skipped. There is *one* exception here,
232 `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
233 predicate mask bit is also zero:
234 under these special circumstances CTR will also decrement.
236 When `sz` is non-zero, this normally requests insertion of a zero
237 in place of the input data, when the relevant predicate mask bit is zero.
238 This would mean that a zero is inserted in place of `CR[BI+32]` for
239 testing against `BO`, which may not be desirable in all circumstances.
240 Therefore, an extra field is provided `SNZ`, which, if set, will insert
241 a **one** in place of a masked-out element, instead of a zero.
243 (*Note: Both options are provided because it is useful to deliberately
244 cause the Branch-Conditional Vector testing to fail at a specific point,
245 controlled by the Predicate mask. This is particularly useful in `VLSET`
246 mode, which will truncate SVSTATE.VL at the point of the first failed
249 Normally, CTR mode will decrement once per Condition Test, resulting
250 under normal circumstances that CTR reduces by up to VL in Horizontal-First
251 Mode. Just as when v3.0B Branch-Conditional saves at
252 least one instruction on tight inner loops through auto-decrementation
253 of CTR, likewise it is also possible to save instruction count for
254 SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
255 in circumstances where there is conditional interaction between the
256 element computation and testing, and the continuation (or otherwise)
257 of a given loop. The potential combinations of interactions is why CTR
258 testing options have been added.
260 If both CTR-test and VLSET Modes are requested, then because the CTR decrement is on a per element basis, the total amount that CTR is decremented
261 by will end up being VL *after* truncation (should that occur). In
262 other words, the order is strictly (as can be seen in pseudocode, below):
265 2. (optionally) decrement CTR
266 3. (optionally) truncate VL
267 4. decide (based on step 1) whether to terminate looping
268 (including not executing step 5)
269 5. decide whether to branch.
271 CTR-test mode and CTi interaction is as follows: note that
272 `BO[2]` is still required to be clear for decrements to be
275 * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
276 if `BO[2]` is zero. Masked-out elements when `sz=0` are
277 skipped (i.e. CTR is *not* decremented when the predicate
279 * **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
280 if `BO[2]` is zero and a masked-out element is skipped
281 (`sz=0` and predicate bit is zero). This one special case is the
282 **opposite** of other combinations, as well as being
283 completely different from normal SVP64 `sz=0` behaviour)
284 * **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
285 if `BO[2]` is zero and the Condition Test succeeds.
286 Masked-out elements when `sz=0` are skipped (including
287 not decrementing CTR)
288 * **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
289 if `BO[2]` is zero and the Condition Test *fails*.
290 Masked-out elements when `sz=0` are skipped (including
291 not decrementing CTR)
293 `CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
294 only time in the entirety of SVP64 that has side-effects when
295 a predicate mask bit is clear. **All** other SVP64 operations
296 entirely skip an element when sz=0 and a predicate mask bit is zero.
298 Interestingly, due to the side-effects of `VLSET` mode
299 it is actually useful to use Branch Conditional even
300 to perform no actual branch operation, i.e to point to the instruction
301 after the branch. Truncation of VL would thus conditionally occur yet control
302 flow alteration would not.
304 Also, the unconditional bit `BO[0]` is still relevant when Predication
305 is applied to the Branch because in `ALL` mode all nonmasked bits have
306 to be tested. Even when VLSET mode is not used, CTR
307 may still be decremented by the total number of nonmasked elements.
308 In short, Vectorised Branch becomes an extremely powerful tool.
310 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
311 is designed to be used for explicit looping, where an explicit call to
312 `svstep` is required to move both srcstep and dststep on to
313 the next element, until VL (or other condition) is reached.
314 Vertical-First Looping is expected (required) to terminate if the end
315 of the Vector, VL, is reached. If however that loop is terminated early
316 because VL is truncated, VLSET with Vertical-First becomes meaningless.
317 Resolving this would require two branches: one Conditional, the other
318 branching unconditionally to create the loop, where the Conditional
321 Therefore, with `VSb`, the option to decide whether truncation should occur if the
322 branch succeeds *or* if the branch condition fails allows for the flexibility
323 required. This allows a Vertical-First Branch to *either* be used as
324 a branch-back (loop) *or* as part of a conditional exit or function
325 call from *inside* a loop, and for VLSET to be integrated into both
326 types of decision-making.
328 `VLSET` mode with Horizontal-First when `VSb` is clear is still
329 useful, because it can be used to truncate VL to the first predicated
330 (non-masked-out) element.
332 *Programming note: One important point is that SVP64 instructions are 64 bit.
333 (8 bytes not 4). This needs to be taken into consideration when computing
334 branch offsets: the offset is relative to the start of the instruction,
335 which includes the SVP64 Prefix*
337 # Boolean Logic combinations
339 There are an extraordinary number of different combinations which
340 provide completely different and useful behaviour.
341 Available options to combine:
343 * `BO[0]` to make an unconditional branch would seem irrelevant if
344 it were not for predication and for side-effects (CTR Mode
346 * Enabling CTR-test Mode and setting `BO[2]` can still result in the
348 taking place, not because the Condition Test itself failed, but
349 because CTR reached zero **because**, as required by CTR-test mode,
350 CTR was decremented as a **result** of Condition Tests failing.
351 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
352 * `R30` and `~R30` and other predicate mask options including CR and
353 inverted CR bit testing
354 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
356 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
357 `OR` of all tests, respectively.
358 * Predicate Mask bits, which combine in effect with the CR being
360 * Inversion of Predicate Masks (`~r3` instead of `r3`, or using
361 `NE` rather than `EQ`) which results in an additional
362 level of possible ANDing, ORing etc. that would otherwise
363 need explicit instructions.
365 The most obviously useful combinations here are to set `BO[1]` to zero
366 in order to turn `ALL` into Great-Big-NAND and `ANY` into
367 Great-Big-NOR. Other Mode bits which perform behavioural inversion then
368 have to work round the fact that the Condition Testing is NOR or NAND.
369 The alternative to not having additional behavioural inversion
370 (`SNZ`, `VSb`, `CTi`) would be to have a second (unconditional)
371 branch directly after the first, which the first branch jumps over.
372 This contrived construct is avoided by the behavioural inversion bits.
374 # Pseudocode and examples
376 Pseudocode for Horizontal-First Mode:
379 cond_ok = not SVRMmode.ALL
380 for srcstep in range(VL):
381 # select predicate bit or zero/one
382 if predicate[srcstep]:
383 # get SVP64 extended CR field 0..127
384 SVCRf = SVP64EXTRA(BI>>2)
386 testbit = CRbits[BI & 0b11]
387 # testbit = CR[BI+32+srcstep*4]
388 else if not SVRMmode.sz:
389 # inverted CTR test skip mode
390 if ¬BO[2] & CTRtest & ¬CTI then
394 testbit = SVRMmode.SNZ
395 # actual element test here
396 el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
399 cond_ok &= el_cond_ok
401 cond_ok |= el_cond_ok
402 # test for VL to be set (and exit)
403 if VLSET and VSb = el_cond_ok then
405 SVSTATE.VL = srcstep+1
420 Pseudocode for Vertical-First Mode:
423 # get SVP64 extended CR field 0..127
424 SVCRf = SVP64EXTRA(BI>>2)
426 # select predicate bit or zero/one
427 if predicate[srcstep]:
428 if BRc = 1 then # CR0 vectorised
429 CR{SVCRf+srcstep} = CRbits
430 testbit = CRbits[BI & 0b11]
431 else if not SVRMmode.sz:
432 # inverted CTR test skip mode
433 if ¬BO[2] & CTRtest & ¬CTI then
435 SVSTATE.srcstep = new_srcstep
436 exit # no branch testing
438 testbit = SVRMmode.SNZ
439 # actual element test here
440 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
441 # test for VL to be set (and exit)
442 if VLSET and cond_ok = VSb then
444 SVSTATE.VL = new_srcstep+1
446 SVSTATE.VL = new_srcstep
449 v3.0B branch pseudocode including LRu and CTR skipping
452 if (mode_is_64bit) then M <- 0
454 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
456 if CTRtest & (cond_ok ^ CTi) then
458 if ctrdec then CTR <- CTR - 1
459 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
460 lr_ok <- SVRMmode.LRu
461 if ctr_ok & cond_ok then
462 if AA then NIA <-iea EXTS(BD || 0b00)
463 else NIA <-iea CIA + EXTS(BD || 0b00)
465 if LK & lr_ok then LR <-iea CIA + 4
468 # Example Shader code
480 which compiles to something like:
485 pred loop_pred = a > 2;
486 while(loop_pred.any()) {
487 pred if_pred = loop_pred & (b < 5);
492 pred else_pred = loop_pred & ~if_pred;
493 if(else_pred.any()) {
500 which will end up as:
503 sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
504 sv.crweird r30, CR60.GT # transfer GT vector to r30
506 sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
507 sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
508 # only calculate loop_pred & pred_b because needed in f()
509 sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
512 # illustrate inversion of pred_b. invert r30, test ALL
513 # rather than SOME, but masked-out zero test would FAIL,
514 # therefore masked-out instead is tested against 1 not 0
515 sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
516 # else = loop & ~pred_b, need this because used in g()
517 sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
520 # conditionally call h(r30) if any loop pred set
521 sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
522 sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop