(no commit message)
[libreriscv.git] / openpower / sv / branches.mdwn
1 # SVP64 Branch Conditional behaviour
2
3 **DRAFT STATUS**
4
5 Please note: SVP64 Branch instructions should be
6 considered completely separate and distinct from
7 standard scalar OpenPOWER-approved v3.0B branches.
8 **v3.0B branches are in no way impacted, altered,
9 changed or modified in any way, shape or form by
10 the SVP64 Vectorised Variants**.
11
12 Links
13
14 * <https://bugs.libre-soc.org/show_bug.cgi?id=664>
15 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
16 * [[openpower/isa/branch]]
17
18 # Rationale
19
20 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
21 Condition Register. However for parallel processing it is simply impossible
22 to perform multiple independent branches: the Program Counter simply
23 cannot branch to multiple destinations based on multiple conditions.
24 The best that can be done is
25 to test multiple Conditions and make a decision of a *single* branch,
26 based on analysis of a *Vector* of CR Fields
27 which have just been calculated from a *Vector* of results.
28
29 In 3D Shader
30 binaries, which are inherently parallelised and predicated, testing all or
31 some results and branching based on multiple tests is extremely common,
32 and a fundamental part of Shader Compilers. Example:
33 without such multi-condition
34 test-and-branch, if a predicate mask is all zeros a large batch of
35 instructions may be masked out to `nop`, and it would waste
36 CPU cycles to run them. 3D GPU ISAs can test for this scenario
37 and, with the appropriate predicate-analysis instruction,
38 jump over fully-masked-out operations, by spotting that
39 *all* Conditions are false.
40
41 Unless Branches are aware and capable of such analysis, additional
42 instructions would be required which perform Horizontal Cumulative
43 analysis of Vectorised Condition Register Fields, in order to
44 reduce the Vector of CR Fields down to one single yes or no
45 decision that a Scalar-only v3.0B Branch-Conditional could cope with.
46 Such instructions would be unavoidable, required, and costly
47 by comparison to a single Vector-aware Branch.
48 Therefore, in order to be commercially competitive, `sv.bc` and
49 other Vector-aware Branch Conditional instructions are a high priority
50 for 3D GPU workloads.
51
52 Given that Power ISA v3.0B is already quite powerful, particularly
53 the Condition Registers and their interaction with Branches, there
54 are opportunities to create extremely flexible and compact
55 Vectorised Branch behaviour. In addition, the side-effects (updating
56 of CTR, truncation of VL, described below) make it a useful instruction
57 even if the branch points to the next instruction (no actual branch).
58
59 # Overview
60
61 When considering an "array" of branch-tests, there are four useful modes:
62 AND, OR, NAND and NOR of all Conditions.
63 NAND and NOR may be synthesised from AND and OR by
64 inverting `BO[2]` which just leaves two modes:
65
66 * Branch takes place on the first CR Field test to succeed
67 (a Great Big OR of all condition tests)
68 * Branch takes place only if **all** CR field tests succeed:
69 a Great Big AND of all condition tests
70 (including those where the predicate is masked out
71 and the corresponding CR Field is considered to be
72 set to `SNZ`)
73
74 Early-exit is enacted such that the Vectorised Branch does not
75 perform needless extra tests, which will help reduce reads on
76 the Condition Register file.
77
78 *Note: Early-exit is **MANDATORY** (required) behaviour.
79 Branches **MUST** exit at the first failure point, for
80 exactly the same reasons for which it is mandatory in
81 programming languages doing early-exit: to avoid
82 damaging side-effects. Speculative testing of Condition
83 Register Fields is permitted, as is speculative updating
84 of CTR, as long as, as usual in any Out-of-Order microarchitecture,
85 that speculative testing is cancelled should an early-exit occur.*
86
87 Additional useful behaviour involves two primary Modes (both of
88 which may be enabled and combined):
89
90 * **VLSET Mode**: identical to Data-Dependent Fail-First Mode
91 for Arithmetic SVP64 operations, with more
92 flexibility and a close interaction and integration into the
93 underlying base Scalar v3.0B Branch instruction.
94 * **CTR-test Mode**: gives much more flexibility over when and why
95 CTR is decremented, including options to decrement if a Condition
96 test succeeds *or if it fails*.
97
98 With these side-effects, basic Boolean Logic Analysis advises that
99 it is important to provide a means
100 to enact them each based on whether testing succeeds *or fails*. This
101 results in a not-insignificant number of additional Mode Augmentation bits,
102 accompanying VLSET and CTR-test Modes respectively.
103
104 Vectorised Branches can be used
105 in either SVP64 Horizontal-First or Vertical-First Mode. Essentially,
106 at an element level, the behaviour is identical in both Modes,
107 although the `ALL` bit is meaningless in Vertical-First Mode.
108
109 It is also important
110 to bear in mind that, fundamentally, Vectorised Branch-Conditional
111 is still extremely close to the Scalar v3.0B Branch-Conditional
112 instructions, and that the same v3.0B Scalar Branch-Conditional
113 instructions are still
114 *completely separate and independent*, being unaltered and
115 unaffected by their SVP64 variants in every conceivable way.
116
117 # Format and fields
118
119 With element-width overrides being meaningless for Condition
120 Register Fields, bits 4 thru 7 of SVP64 RM may be used for additional
121 Mode bits.
122
123 SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
124 Conditional:
125
126 | 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description |
127 | - | - | - | - | -- | -- | --- |---------|----------------- |
128 |ALL|LRu| / | / | 0 | 0 | / | SNZ sz | normal mode |
129 |ALL|LRu| / |VSb| 0 | 1 | VLI | SNZ sz | VLSET mode |
130 |ALL|LRu|CTi| / | 1 | 0 | / | SNZ sz | CTR-test mode |
131 |ALL|LRu|CTi|VSb| 1 | 1 | VLI | SNZ sz | CTR-test+VLSET mode |
132
133 Brief description of fields:
134
135 * **sz=1** if predication is enabled and `sz=1` and a predicate
136 element bit is zero, `SNZ` will
137 be substituted in place of the CR bit selected by `BI`,
138 as the Condition tested.
139 Contrast this with
140 normal SVP64 `sz=1` behaviour, where *only* a zero is put in
141 place of masked-out predicate bits.
142 * **sz=0** When `sz=0` skipping occurs as usual on
143 masked-out elements, but unlike all
144 other SVP64 behaviour which entirely skips an element with
145 no related side-effects at all, there are certain
146 special circumstances where CTR
147 may be decremented. See CTR-test Mode, below.
148 * **ALL** when set, all branch conditional tests must pass in order for
149 the branch to succeed. When clear, it is the first sequentially
150 encountered successful test that causes the branch to succeed.
151 This is identical behaviour to how programming languages perform
152 early-exit on Boolean Logic chains.
153 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
154 In VLSET mode, VL is set equal (truncated) to the first point
155 where, assuming Conditions are tested sequentially, the branch succeeds
156 *or fails* depending if VSb is set.
157 If VLI (Vector Length Inclusive) is clear,
158 VL is truncated to *exclude* the current element, otherwise it is
159 included. SVSTATE.MVL is not changed: only VL.
160 * **LRu**: Link Register Update. When set, Link Register will
161 only be updated if the Branch Condition succeeds. This avoids
162 destruction of LR during loops (particularly Vertical-First
163 ones).
164 * **VSb** is most relevant for Vertical-First VLSET Mode. After testing,
165 if VSb is set, VL is truncated if the branch succeeds. If VSb is clear,
166 VL is truncated if the branch did **not** take place.
167 * **CTi** CTR inversion. CTR-test Mode normally decrements per element
168 tested. CTR inversion decrements if a test *fails*. Only relevant
169 in CTR-test Mode.
170
171 # Vectorised CR Field numbering, and Scalar behaviour
172
173 It is important to keep in mind that just like all SVP64 instructions,
174 the `BI` field of the base v3.0B Branch Conditional instruction
175 may be extended by SVP64 EXTRA augmentation, as well as be marked
176 as either Scalar or Vector.
177
178 The `BI` field of Branch Conditional operations is five bits, in scalar
179 v3.0B this would select one bit of the 32 bit CR,
180 comprising eight CR Fields of 4 bits each. In SVP64 there are
181 16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
182 `BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
183 are extended to either scalar or vector and to select CR Fields 0..127
184 as specified in SVP64 [[sv/svp64/appendix]].
185
186 When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
187 then as the usual SVP64 rules apply:
188 the Vector loop ends at the first element tested, after taking
189 predication into consideration. Thus, also as usual, when a predicate mask is
190 given, and `BI` marked as scalar, and `sz` is zero, srcstep
191 skips forward to the first non-zero predicated element, and only that
192 one element is tested.
193
194 In other words, the fact that this is a Branch
195 Operation (instead of an arithmetic one) does not result, ultimately,
196 in significant changes as to
197 how SVP64 is fundamentally applied, except with respect to
198 the unique properties associated with conditionally
199 changing the Program
200 Counter (aka "a Branch"), resulting in early-out
201 opportunities and CTR-testing, which are outlined below.
202
203 # Horizontal-First and Vertical-First Modes
204
205 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
206 AND) results in early exit: no more updates to CTR occur (if requested);
207 no branch occurs, and LR is not updated (if requested). Likewise for
208 non-ALL mode (Great Big Or) on first success early exit also occurs,
209 however this time with the Branch proceeding. In both cases the testing
210 of the Vector of CRs should be done in linear sequential order (or in
211 REMAP re-sequenced order): such that tests that are sequentially beyond
212 the exit point are *not* carried out. (*Note: it is standard practice in
213 Programming languages to exit early from conditional tests, however
214 a little unusual to consider in an ISA that is designed for Parallel
215 Vector Processing. The reason is to have strictly-defined guaranteed
216 behaviour*)
217
218 In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
219 behaviour. Given that only one element is being tested at a time
220 in Vertical-First Mode, a test designed to be done on multiple
221 bits is meaningless.
222
223 # Description and Modes
224
225 Predication in both INT and CR modes may be applied to `sv.bc` and other
226 SVP64 Branch Conditional operations, exactly as they may be applied to
227 other SVP64 operations. When `sz` is zero, any masked-out Branch-element
228 operations are not included in condition testing, exactly like all other
229 SVP64 operations, *including* side-effects such as potentially updating
230 LR or CTR, which will also be skipped. There is *one* exception here,
231 which is when
232 `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
233 predicate mask bit is also zero:
234 under these special circumstances CTR will also decrement.
235
236 When `sz` is non-zero, this normally requests insertion of a zero
237 in place of the input data, when the relevant predicate mask bit is zero.
238 This would mean that a zero is inserted in place of `CR[BI+32]` for
239 testing against `BO`, which may not be desirable in all circumstances.
240 Therefore, an extra field is provided `SNZ`, which, if set, will insert
241 a **one** in place of a masked-out element, instead of a zero.
242
243 (*Note: Both options are provided because it is useful to deliberately
244 cause the Branch-Conditional Vector testing to fail at a specific point,
245 controlled by the Predicate mask. This is particularly useful in `VLSET`
246 mode, which will truncate SVSTATE.VL at the point of the first failed
247 test.*)
248
249 Normally, CTR mode will decrement once per Condition Test, resulting
250 under normal circumstances that CTR reduces by up to VL in Horizontal-First
251 Mode. Just as when v3.0B Branch-Conditional saves at
252 least one instruction on tight inner loops through auto-decrementation
253 of CTR, likewise it is also possible to save instruction count for
254 SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
255 in circumstances where there is conditional interaction between the
256 element computation and testing, and the continuation (or otherwise)
257 of a given loop. The potential combinations of interactions is why CTR
258 testing options have been added.
259
260 If both CTR-test and VLSET Modes are requested, then because the CTR decrement is on a per element basis, the total amount that CTR is decremented
261 by will end up being VL *after* truncation (should that occur). In
262 other words, the order is strictly (as can be seen in pseudocode, below):
263
264 1. compute the test
265 2. (optionally) decrement CTR
266 3. (optionally) truncate VL
267 4. decide (based on step 1) whether to terminate looping
268 (including not executing step 5)
269 5. decide whether to branch.
270
271 CTR-test mode and CTi interaction is as follows: note that
272 `BO[2]` is still required to be clear for decrements to be
273 considered.
274
275 * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
276 if `BO[2]` is zero. Masked-out elements when `sz=0` are
277 skipped (i.e. CTR is *not* decremented when the predicate
278 bit is zero).
279 * **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
280 if `BO[2]` is zero and a masked-out element is skipped
281 (`sz=0` and predicate bit is zero). This one special case is the
282 **opposite** of other combinations, as well as being
283 completely different from normal SVP64 `sz=0` behaviour)
284 * **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
285 if `BO[2]` is zero and the Condition Test succeeds.
286 Masked-out elements when `sz=0` are skipped (including
287 not decrementing CTR)
288 * **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
289 if `BO[2]` is zero and the Condition Test *fails*.
290 Masked-out elements when `sz=0` are skipped (including
291 not decrementing CTR)
292
293 `CTR-test=0, CTi=1, sz=0` requires special emphasis because it is the
294 only time in the entirety of SVP64 that has side-effects when
295 a predicate mask bit is clear. **All** other SVP64 operations
296 entirely skip an element when sz=0 and a predicate mask bit is zero.
297
298 Interestingly, due to the side-effects of `VLSET` mode
299 it is actually useful to use Branch Conditional even
300 to perform no actual branch operation, i.e to point to the instruction
301 after the branch. Truncation of VL would thus conditionally occur yet control
302 flow alteration would not.
303
304 Also, the unconditional bit `BO[0]` is still relevant when Predication
305 is applied to the Branch because in `ALL` mode all nonmasked bits have
306 to be tested. Even when VLSET mode is not used, CTR
307 may still be decremented by the total number of nonmasked elements.
308 In short, Vectorised Branch becomes an extremely powerful tool.
309
310 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
311 is designed to be used for explicit looping, where an explicit call to
312 `svstep` is required to move both srcstep and dststep on to
313 the next element, until VL (or other condition) is reached.
314 Vertical-First Looping is expected (required) to terminate if the end
315 of the Vector, VL, is reached. If however that loop is terminated early
316 because VL is truncated, VLSET with Vertical-First becomes meaningless.
317 Resolving this would require two branches: one Conditional, the other
318 branching unconditionally to create the loop, where the Conditional
319 one jumps over it.
320
321 Therefore, with `VSb`, the option to decide whether truncation should occur if the
322 branch succeeds *or* if the branch condition fails allows for the flexibility
323 required. This allows a Vertical-First Branch to *either* be used as
324 a branch-back (loop) *or* as part of a conditional exit or function
325 call from *inside* a loop, and for VLSET to be integrated into both
326 types of decision-making.
327
328 `VLSET` mode with Horizontal-First when `VSb` is clear is still
329 useful, because it can be used to truncate VL to the first predicated
330 (non-masked-out) element.
331
332 *Programming note: One important point is that SVP64 instructions are 64 bit.
333 (8 bytes not 4). This needs to be taken into consideration when computing
334 branch offsets: the offset is relative to the start of the instruction,
335 which includes the SVP64 Prefix*
336
337 # Boolean Logic combinations
338
339 There are an extraordinary number of different combinations which
340 provide completely different and useful behaviour.
341 Available options to combine:
342
343 * `BO[0]` to make an unconditional branch would seem irrelevant if
344 it were not for predication and for side-effects (CTR Mode
345 for example)
346 * Enabling CTR-test Mode and setting `BO[2]` can still result in the
347 Branch
348 taking place, not because the Condition Test itself failed, but
349 because CTR reached zero **because**, as required by CTR-test mode,
350 CTR was decremented as a **result** of Condition Tests failing.
351 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
352 * `R30` and `~R30` and other predicate mask options including CR and
353 inverted CR bit testing
354 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
355 predicate bits
356 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
357 `OR` of all tests, respectively.
358 * Predicate Mask bits, which combine in effect with the CR being
359 tested.
360 * Inversion of Predicate Masks (`~r3` instead of `r3`, or using
361 `NE` rather than `EQ`) which results in an additional
362 level of possible ANDing, ORing etc. that would otherwise
363 need explicit instructions.
364
365 The most obviously useful combinations here are to set `BO[1]` to zero
366 in order to turn `ALL` into Great-Big-NAND and `ANY` into
367 Great-Big-NOR. Other Mode bits which perform behavioural inversion then
368 have to work round the fact that the Condition Testing is NOR or NAND.
369 The alternative to not having additional behavioural inversion
370 (`SNZ`, `VSb`, `CTi`) would be to have a second (unconditional)
371 branch directly after the first, which the first branch jumps over.
372 This contrived construct is avoided by the behavioural inversion bits.
373
374 # Pseudocode and examples
375
376 Pseudocode for Horizontal-First Mode:
377
378 ```
379 cond_ok = not SVRMmode.ALL
380 for srcstep in range(VL):
381 # select predicate bit or zero/one
382 if predicate[srcstep]:
383 # get SVP64 extended CR field 0..127
384 SVCRf = SVP64EXTRA(BI>>2)
385 CRbits = CR{SVCRf}
386 testbit = CRbits[BI & 0b11]
387 # testbit = CR[BI+32+srcstep*4]
388 else if not SVRMmode.sz:
389 # inverted CTR test skip mode
390 if ¬BO[2] & CTRtest & ¬CTI then
391 CTR = CTR - 1
392 continue
393 else
394 testbit = SVRMmode.SNZ
395 # actual element test here
396 el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
397 # merge in the test
398 if SVRMmode.ALL:
399 cond_ok &= el_cond_ok
400 else
401 cond_ok |= el_cond_ok
402 # test for VL to be set (and exit)
403 if VLSET and VSb = el_cond_ok then
404 if SVRMmode.VLI
405 SVSTATE.VL = srcstep+1
406 else
407 SVSTATE.VL = srcstep
408 break
409 # early exit?
410 if SVRMmode.ALL:
411 if ~el_cond_ok:
412 break
413 else
414 if el_cond_ok:
415 break
416 if SVCRf.scalar:
417 break
418 ```
419
420 Pseudocode for Vertical-First Mode:
421
422 ```
423 # get SVP64 extended CR field 0..127
424 SVCRf = SVP64EXTRA(BI>>2)
425 CRbits = CR{SVCRf}
426 # select predicate bit or zero/one
427 if predicate[srcstep]:
428 if BRc = 1 then # CR0 vectorised
429 CR{SVCRf+srcstep} = CRbits
430 testbit = CRbits[BI & 0b11]
431 else if not SVRMmode.sz:
432 # inverted CTR test skip mode
433 if ¬BO[2] & CTRtest & ¬CTI then
434 CTR = CTR - 1
435 SVSTATE.srcstep = new_srcstep
436 exit # no branch testing
437 else
438 testbit = SVRMmode.SNZ
439 # actual element test here
440 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
441 # test for VL to be set (and exit)
442 if VLSET and cond_ok = VSb then
443 if SVRMmode.VLI
444 SVSTATE.VL = new_srcstep+1
445 else
446 SVSTATE.VL = new_srcstep
447 ```
448
449 v3.0B branch pseudocode including LRu and CTR skipping
450
451 ```
452 if (mode_is_64bit) then M <- 0
453 else M <- 32
454 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
455 ctrdec = ¬BO[2]
456 if CTRtest & (cond_ok ^ CTi) then
457 ctrdec = 0b0
458 if ctrdec then CTR <- CTR - 1
459 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
460 lr_ok <- SVRMmode.LRu
461 if ctr_ok & cond_ok then
462 if AA then NIA <-iea EXTS(BD || 0b00)
463 else NIA <-iea CIA + EXTS(BD || 0b00)
464 lr_ok <- 0b1
465 if LK & lr_ok then LR <-iea CIA + 4
466 ```
467
468 # Example Shader code
469
470 ```
471 while(a > 2) {
472 if(b < 5)
473 f();
474 else
475 g();
476 h();
477 }
478 ```
479
480 which compiles to something like:
481
482 ```
483 vec<i32> a, b;
484 // ...
485 pred loop_pred = a > 2;
486 while(loop_pred.any()) {
487 pred if_pred = loop_pred & (b < 5);
488 if(if_pred.any()) {
489 f(if_pred);
490 }
491 label1:
492 pred else_pred = loop_pred & ~if_pred;
493 if(else_pred.any()) {
494 g(else_pred);
495 }
496 h(loop_pred);
497 }
498 ```
499
500 which will end up as:
501
502 ```
503 sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
504 sv.crweird r30, CR60.GT # transfer GT vector to r30
505 while_loop:
506 sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
507 sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
508 # only calculate loop_pred & pred_b because needed in f()
509 sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
510 f(CR80.v.SO)
511 skip_f:
512 # illustrate inversion of pred_b. invert r30, test ALL
513 # rather than SOME, but masked-out zero test would FAIL,
514 # therefore masked-out instead is tested against 1 not 0
515 sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
516 # else = loop & ~pred_b, need this because used in g()
517 sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
518 g(CR80.v.SO)
519 skip_g:
520 # conditionally call h(r30) if any loop pred set
521 sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
522 sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
523 ```