(no commit message)
[libreriscv.git] / openpower / sv / branches.mdwn
1 # SVP64 Branch Conditional behaviour
2
3 **DRAFT STATUS**
4
5 Please note: SVP64 Branch instructions should be
6 considered completely separate and distinct from
7 standard scalar OpenPOWER-approved v3.0B branches.
8 **v3.0B branches are in no way impacted, altered,
9 changed or modified in any way, shape or form by
10 the SVP64 Vectorised Variants**.
11
12 Links
13
14 * <https://bugs.libre-soc.org/show_bug.cgi?id=664>
15 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
16 * [[openpower/isa/branch]]
17
18 # Rationale
19
20 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
21 Condition Register. However for parallel processing it is simply impossible
22 to perform multiple independent branches: the Program Counter simply
23 cannot branch to multiple destinations based on multiple conditions.
24 The best that can be done is
25 to test multiple Conditions and make a decision of a *single* branch,
26 based on analysis of a *Vector* of CR Fields
27 which have just been calculated from a *Vector* of results.
28
29 In 3D Shader
30 binaries, which are inherently parallelised and predicated, testing all or
31 some results and branching based on multiple tests is extremely common,
32 and a fundamental part of Shader Compilers. Example:
33 without such multi-condition
34 test-and-branch, if a predicate mask is all zeros a large batch of
35 instructions may be masked out to `nop`, and it would waste
36 CPU cycles to run them. 3D GPU ISAs can test for this scenario
37 and, with the appropriate predicate-analysis instruction,
38 jump over fully-masked-out operations, by spotting that
39 *all* Conditions are false.
40
41 Unless Branches are aware and capable of such analysis, additional
42 instructions would be required which perform Horizontal Cumulative
43 analysis of Vectorised Condition Register Fields, in order to
44 reduce the Vector of CR Fields down to one single yes or no
45 decision that a Scalar-only v3.0B Branch-Conditional could cope with.
46 Such instructions would be unavoidable, required, and costly
47 by comparison to a single Vector-aware Branch.
48 Therefore, in order to be commercially competitive, `sv.bc` and
49 other Vector-aware Branch Conditional instructions are a high priority
50 for 3D GPU workloads.
51
52 Given that Power ISA v3.0B is already quite powerful, particularly
53 the Condition Registers and their interaction with Branches, there
54 are opportunities to create extremely flexible and compact
55 Vectorised Branch behaviour. In addition, the side-effects (updating
56 of CTR, truncation of VL, described below) make it a useful instruction
57 even if the branch points to the next instruction (no actual branch).
58
59 # Overview
60
61 When considering an "array" of branch-tests, there are four useful modes:
62 AND, OR, NAND and NOR of all Conditions.
63 NAND and NOR may be synthesised from AND and OR by
64 inverting `BO[1]` which just leaves two modes:
65
66 * Branch takes place on the first CR Field test to succeed
67 (a Great Big OR of all condition tests)
68 * Branch takes place only if **all** CR field tests succeed:
69 a Great Big AND of all condition tests
70 (including those where the predicate is masked out
71 and the corresponding CR Field is considered to be
72 set to `SNZ`)
73
74 Early-exit is enacted such that the Vectorised Branch does not
75 perform needless extra tests, which will help reduce reads on
76 the Condition Register file.
77
78 *Note: Early-exit is **MANDATORY** (required) behaviour.
79 Branches **MUST** exit at the first failure point, for
80 exactly the same reasons for which it is mandatory in
81 programming languages doing early-exit: to avoid
82 damaging side-effects. Speculative testing of Condition
83 Register Fields is permitted, as is speculative updating
84 of CTR, as long as, as usual in any Out-of-Order microarchitecture,
85 that speculative testing is cancelled should an early-exit occur.*
86
87 Additional useful behaviour involves two primary Modes (both of
88 which may be enabled and combined):
89
90 * **VLSET Mode**: identical to Data-Dependent Fail-First Mode
91 for Arithmetic SVP64 operations, with more
92 flexibility and a close interaction and integration into the
93 underlying base Scalar v3.0B Branch instruction.
94 * **CTR-test Mode**: gives much more flexibility over when and why
95 CTR is decremented, including options to decrement if a Condition
96 test succeeds *or if it fails*.
97
98 With these side-effects, basic Boolean Logic Analysis advises that
99 it is important to provide a means
100 to enact them each based on whether testing succeeds *or fails*. This
101 results in a not-insignificant number of additional Mode Augmentation bits,
102 accompanying VLSET and CTR-test Modes respectively.
103
104 Vectorised Branches can be used
105 in either SVP64 Horizontal-First or Vertical-First Mode. Essentially,
106 at an element level, the behaviour is identical in both Modes,
107 although the `ALL` bit is meaningless in Vertical-First Mode.
108
109 It is also important
110 to bear in mind that, fundamentally, Vectorised Branch-Conditional
111 is still extremely close to the Scalar v3.0B Branch-Conditional
112 instructions, and that the same v3.0B Scalar Branch-Conditional
113 instructions are still
114 *completely separate and independent*, being unaltered and
115 unaffected by their SVP64 variants in every conceivable way.
116
117 # Format and fields
118
119 With element-width overrides being meaningless for Condition
120 Register Fields, bits 4 thru 7 of SVP64 RM may be used for additional
121 Mode bits.
122
123 SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
124 Conditional:
125
126 | 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description |
127 | - | - | - | - | -- | -- | --- |---------|----------------- |
128 |ALL|LRu| / | / | 0 | 0 | / | SNZ sz | normal mode |
129 |ALL|LRu| / |VSb| 0 | 1 | VLI | SNZ sz | VLSET mode |
130 |ALL|LRu|CTi| / | 1 | 0 | / | SNZ sz | CTR-test mode |
131 |ALL|LRu|CTi|VSb| 1 | 1 | VLI | SNZ sz | CTR-test+VLSET mode |
132
133 Brief description of fields:
134
135 * **sz=1** if predication is enabled and `sz=1` and a predicate
136 element bit is zero, `SNZ` will
137 be substituted in place of the CR bit selected by `BI`,
138 as the Condition tested.
139 Contrast this with
140 normal SVP64 `sz=1` behaviour, where *only* a zero is put in
141 place of masked-out predicate bits.
142 * **sz=0** When `sz=0` skipping occurs as usual on
143 masked-out elements, but unlike all
144 other SVP64 behaviour which entirely skips an element with
145 no related side-effects at all, there are certain
146 special circumstances where CTR
147 may be decremented. See CTR-test Mode, below.
148 * **ALL** when set, all branch conditional tests must pass in order for
149 the branch to succeed. When clear, it is the first sequentially
150 encountered successful test that causes the branch to succeed.
151 This is identical behaviour to how programming languages perform
152 early-exit on Boolean Logic chains.
153 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
154 In VLSET mode, VL is set equal (truncated) to the first point
155 where, assuming Conditions are tested sequentially, the branch succeeds
156 *or fails* depending if VSb is set.
157 If VLI (Vector Length Inclusive) is clear,
158 VL is truncated to *exclude* the current element, otherwise it is
159 included. SVSTATE.MVL is not changed: only VL.
160 * **LRu**: Link Register Update. When set, Link Register will
161 only be updated if the Branch Condition succeeds. This avoids
162 destruction of LR during loops (particularly Vertical-First
163 ones).
164 * **VSb** is most relevant for Vertical-First VLSET Mode. After testing,
165 if VSb is set, VL is truncated if the branch succeeds. If VSb is clear,
166 VL is truncated if the branch did **not** take place.
167 * **CTi** CTR inversion. CTR-test Mode normally decrements per element
168 tested. CTR inversion decrements if a test *fails*. Only relevant
169 in CTR-test Mode.
170
171 # Vectorised CR Field numbering, and Scalar behaviour
172
173 It is important to keep in mind that just like all SVP64 instructions,
174 the `BI` field of the base v3.0B Branch Conditional instruction
175 may be extended by SVP64 EXTRA augmentation, as well as be marked
176 as either Scalar or Vector.
177
178 The `BI` field of Branch Conditional operations is five bits, in scalar
179 v3.0B this would select one bit of the 32 bit CR,
180 comprising eight CR Fields of 4 bits each. In SVP64 there are
181 16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
182 `BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
183 are extended to either scalar or vector and to select CR Fields 0..127
184 as specified in SVP64 [[sv/svp64/appendix]].
185
186 When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
187 then as the usual SVP64 rules apply:
188 the Vector loop ends at the first element tested, after taking
189 predication into consideration. Thus, also as usual, when a predicate mask is
190 given, and `BI` marked as scalar, and `sz` is zero, srcstep
191 skips forward to the first non-zero predicated element, and only that
192 one element is tested.
193
194 In other words, the fact that this is a Branch
195 Operation (instead of an arithmetic one) does not result, ultimately,
196 in significant changes as to
197 how SVP64 is fundamentally applied, except with respect to
198 the unique properties associated with conditionally
199 changing the Program
200 Counter (aka "a Branch"), resulting in early-out
201 opportunities and CTR-testing, which are outlined below.
202
203 # Horizontal-First and Vertical-First Modes
204
205 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
206 AND) results in early exit: no more updates to CTR occur (if requested);
207 no branch occurs, and LR is not updated (if requested). Likewise for
208 non-ALL mode (Great Big Or) on first success early exit also occurs,
209 however this time with the Branch proceeding. In both cases the testing
210 of the Vector of CRs should be done in linear sequential order (or in
211 REMAP re-sequenced order): such that tests that are sequentially beyond
212 the exit point are *not* carried out. (*Note: it is standard practice in
213 Programming languages to exit early from conditional tests, however
214 a little unusual to consider in an ISA that is designed for Parallel
215 Vector Processing. The reason is to have strictly-defined guaranteed
216 behaviour*)
217
218 In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
219 behaviour. Given that only one element is being tested at a time
220 in Vertical-First Mode, a test designed to be done on multiple
221 bits is meaningless.
222
223 # Description and Modes
224
225 Predication in both INT and CR modes may be applied to `sv.bc` and other
226 SVP64 Branch Conditional operations, exactly as they may be applied to
227 other SVP64 operations. When `sz` is zero, any masked-out Branch-element
228 operations are not included in condition testing, exactly like all other
229 SVP64 operations, *including* side-effects such as potentially updating
230 LR or CTR, which will also be skipped. There is *one* exception here,
231 which is when
232 `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
233 predicate mask bit is also zero:
234 under these special circumstances CTR will also decrement.
235
236 When `sz` is non-zero, this normally requests insertion of a zero
237 in place of the input data, when the relevant predicate mask bit is zero.
238 This would mean that a zero is inserted in place of `CR[BI+32]` for
239 testing against `BO`, which may not be desirable in all circumstances.
240 Therefore, an extra field is provided `SNZ`, which, if set, will insert
241 a **one** in place of a masked-out element, instead of a zero.
242
243 (*Note: Both options are provided because it is useful to deliberately
244 cause the Branch-Conditional Vector testing to fail at a specific point,
245 controlled by the Predicate mask. This is particularly useful in `VLSET`
246 mode, which will truncate SVSTATE.VL at the point of the first failed
247 test.*)
248
249 Normally, CTR mode will decrement once per Condition Test, resulting
250 under normal circumstances that CTR reduces by up to VL in Horizontal-First
251 Mode. Just as when v3.0B Branch-Conditional saves at
252 least one instruction on tight inner loops through auto-decrementation
253 of CTR, likewise it is also possible to save instruction count for
254 SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
255 in circumstances where there is conditional interaction between the
256 element computation and testing, and the continuation (or otherwise)
257 of a given loop. The potential combinations of interactions is why CTR
258 testing options have been added.
259
260 If both CTR-test and VLSET Modes are requested, it's important to
261 observe the correct order: testing the Condition and
262 decrementing of CTR is done *before* truncation of VL. In
263 other words, the order is strictly (as can be seen in pseudocode, below):
264
265 1. compute the test
266 2. (optionally) decrement CTR
267 3. (optionally) truncate VL
268 4. decide (based on step 1) whether to terminate looping
269 (including not executing step 5)
270 5. decide whether to branch.
271
272 CTR-test mode and CTi interaction is as follows: note that
273 `BO[2]` is still required to be clear for decrements to be
274 considered.
275
276 * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
277 if `BO[2]` is zero. Masked-out elements when `sz=0` are
278 skipped (i.e. CTR is *not* decremented when the predicate
279 bit is zero and `sz=0`).
280 * **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
281 if `BO[2]` is zero and a masked-out element is skipped
282 (`sz=0` and predicate bit is zero). This one special case is the
283 **opposite** of other combinations, as well as being
284 completely different from normal SVP64 `sz=0` behaviour)
285 * **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
286 if `BO[2]` is zero and the Condition Test succeeds.
287 Masked-out elements when `sz=0` are skipped (including
288 not decrementing CTR)
289 * **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
290 if `BO[2]` is zero and the Condition Test *fails*.
291 Masked-out elements when `sz=0` are skipped (including
292 not decrementing CTR)
293
294 `CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
295 only time in the entirety of SVP64 that has side-effects when
296 a predicate mask bit is clear. **All** other SVP64 operations
297 entirely skip an element when sz=0 and a predicate mask bit is zero.
298
299 Interestingly, due to the side-effects of `VLSET` mode
300 it is actually useful to use Branch Conditional even
301 to perform no actual branch operation, i.e to point to the instruction
302 after the branch. Truncation of VL would thus conditionally occur yet control
303 flow alteration would not.
304
305 Also, the unconditional bit `BO[0]` is still relevant when Predication
306 is applied to the Branch because in `ALL` mode all nonmasked bits have
307 to be tested. Even when VLSET mode is not used, CTR
308 may still be decremented by the total number of nonmasked elements.
309 In short, Vectorised Branch becomes an extremely powerful tool.
310
311 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
312 is designed to be used for explicit looping, where an explicit call to
313 `svstep` is required to move both srcstep and dststep on to
314 the next element, until VL (or other condition) is reached.
315 Vertical-First Looping is expected (required) to terminate if the end
316 of the Vector, VL, is reached. If however that loop is terminated early
317 because VL is truncated, VLSET with Vertical-First becomes meaningless.
318 Resolving this would require two branches: one Conditional, the other
319 branching unconditionally to create the loop, where the Conditional
320 one jumps over it.
321
322 Therefore, with `VSb`, the option to decide whether truncation should occur if the
323 branch succeeds *or* if the branch condition fails allows for the flexibility
324 required. This allows a Vertical-First Branch to *either* be used as
325 a branch-back (loop) *or* as part of a conditional exit or function
326 call from *inside* a loop, and for VLSET to be integrated into both
327 types of decision-making.
328
329 `VLSET` mode with Horizontal-First when `VSb` is clear is still
330 useful, because it can be used to truncate VL to the first predicated
331 (non-masked-out) element.
332
333 *Programming note: One important point is that SVP64 instructions are 64 bit.
334 (8 bytes not 4). This needs to be taken into consideration when computing
335 branch offsets: the offset is relative to the start of the instruction,
336 which includes the SVP64 Prefix*
337
338 # Boolean Logic combinations
339
340 There are an extraordinary number of different combinations which
341 provide completely different and useful behaviour.
342 Available options to combine:
343
344 * `BO[0]` to make an unconditional branch would seem irrelevant if
345 it were not for predication and for side-effects (CTR Mode
346 for example)
347 * Enabling CTR-test Mode and setting `BO[2]` can still result in the
348 Branch
349 taking place, not because the Condition Test itself failed, but
350 because CTR reached zero **because**, as required by CTR-test mode,
351 CTR was decremented as a **result** of Condition Tests failing.
352 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
353 * `R30` and `~R30` and other predicate mask options including CR and
354 inverted CR bit testing
355 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
356 predicate bits
357 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
358 `OR` of all tests, respectively.
359 * Predicate Mask bits, which combine in effect with the CR being
360 tested.
361 * Inversion of Predicate Masks (`~r3` instead of `r3`, or using
362 `NE` rather than `EQ`) which results in an additional
363 level of possible ANDing, ORing etc. that would otherwise
364 need explicit instructions.
365
366 The most obviously useful combinations here are to set `BO[1]` to zero
367 in order to turn `ALL` into Great-Big-NAND and `ANY` into
368 Great-Big-NOR. Other Mode bits which perform behavioural inversion then
369 have to work round the fact that the Condition Testing is NOR or NAND.
370 The alternative to not having additional behavioural inversion
371 (`SNZ`, `VSb`, `CTi`) would be to have a second (unconditional)
372 branch directly after the first, which the first branch jumps over.
373 This contrived construct is avoided by the behavioural inversion bits.
374
375 # Pseudocode and examples
376
377 Pseudocode for Horizontal-First Mode:
378
379 ```
380 cond_ok = not SVRMmode.ALL
381 for srcstep in range(VL):
382 # select predicate bit or zero/one
383 if predicate[srcstep]:
384 # get SVP64 extended CR field 0..127
385 SVCRf = SVP64EXTRA(BI>>2)
386 CRbits = CR{SVCRf}
387 testbit = CRbits[BI & 0b11]
388 # testbit = CR[BI+32+srcstep*4]
389 else if not SVRMmode.sz:
390 # inverted CTR test skip mode
391 if ¬BO[2] & CTRtest & ¬CTI then
392 CTR = CTR - 1
393 continue
394 else
395 testbit = SVRMmode.SNZ
396 # actual element test here
397 el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
398 # merge in the test
399 if SVRMmode.ALL:
400 cond_ok &= el_cond_ok
401 else
402 cond_ok |= el_cond_ok
403 # test for VL to be set (and exit)
404 if VLSET and VSb = el_cond_ok then
405 if SVRMmode.VLI
406 SVSTATE.VL = srcstep+1
407 else
408 SVSTATE.VL = srcstep
409 break
410 # early exit?
411 if SVRMmode.ALL:
412 if ~el_cond_ok:
413 break
414 else
415 if el_cond_ok:
416 break
417 if SVCRf.scalar:
418 break
419 ```
420
421 Pseudocode for Vertical-First Mode:
422
423 ```
424 # get SVP64 extended CR field 0..127
425 SVCRf = SVP64EXTRA(BI>>2)
426 CRbits = CR{SVCRf}
427 # select predicate bit or zero/one
428 if predicate[srcstep]:
429 if BRc = 1 then # CR0 vectorised
430 CR{SVCRf+srcstep} = CRbits
431 testbit = CRbits[BI & 0b11]
432 else if not SVRMmode.sz:
433 # inverted CTR test skip mode
434 if ¬BO[2] & CTRtest & ¬CTI then
435 CTR = CTR - 1
436 SVSTATE.srcstep = new_srcstep
437 exit # no branch testing
438 else
439 testbit = SVRMmode.SNZ
440 # actual element test here
441 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
442 # test for VL to be set (and exit)
443 if VLSET and cond_ok = VSb then
444 if SVRMmode.VLI
445 SVSTATE.VL = new_srcstep+1
446 else
447 SVSTATE.VL = new_srcstep
448 ```
449
450 v3.0B branch pseudocode including LRu and CTR skipping
451
452 ```
453 if (mode_is_64bit) then M <- 0
454 else M <- 32
455 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
456 ctrdec = ¬BO[2]
457 if CTRtest & (cond_ok ^ CTi) then
458 ctrdec = 0b0
459 if ctrdec then CTR <- CTR - 1
460 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
461 lr_ok <- SVRMmode.LRu
462 if ctr_ok & cond_ok then
463 if AA then NIA <-iea EXTS(BD || 0b00)
464 else NIA <-iea CIA + EXTS(BD || 0b00)
465 lr_ok <- 0b1
466 if LK & lr_ok then LR <-iea CIA + 4
467 ```
468
469 # Example Shader code
470
471 ```
472 while(a > 2) {
473 if(b < 5)
474 f();
475 else
476 g();
477 h();
478 }
479 ```
480
481 which compiles to something like:
482
483 ```
484 vec<i32> a, b;
485 // ...
486 pred loop_pred = a > 2;
487 while(loop_pred.any()) {
488 pred if_pred = loop_pred & (b < 5);
489 if(if_pred.any()) {
490 f(if_pred);
491 }
492 label1:
493 pred else_pred = loop_pred & ~if_pred;
494 if(else_pred.any()) {
495 g(else_pred);
496 }
497 h(loop_pred);
498 }
499 ```
500
501 which will end up as:
502
503 ```
504 sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
505 sv.crweird r30, CR60.GT # transfer GT vector to r30
506 while_loop:
507 sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
508 sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
509 # only calculate loop_pred & pred_b because needed in f()
510 sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
511 f(CR80.v.SO)
512 skip_f:
513 # illustrate inversion of pred_b. invert r30, test ALL
514 # rather than SOME, but masked-out zero test would FAIL,
515 # therefore masked-out instead is tested against 1 not 0
516 sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
517 # else = loop & ~pred_b, need this because used in g()
518 sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
519 g(CR80.v.SO)
520 skip_g:
521 # conditionally call h(r30) if any loop pred set
522 sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
523 sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
524 ```