(no commit message)
[libreriscv.git] / openpower / sv / branches.mdwn
1 # SVP64 Branch Conditional behaviour
2
3 **DRAFT STATUS**
4
5 Please note: SVP64 Branch instructions should be
6 considered completely separate and distinct from
7 standard scalar OpenPOWER-approved v3.0B branches.
8 **v3.0B branches are in no way impacted, altered,
9 changed or modified in any way, shape or form by
10 the SVP64 Vectorised Variants**.
11
12 Links
13
14 * <https://bugs.libre-soc.org/show_bug.cgi?id=664>
15 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
16 * [[openpower/isa/branch]]
17
18 # Rationale
19
20 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
21 Condition Register. However for parallel processing it is simply impossible
22 to perform multiple independent branches: the Program Counter simply
23 cannot branch to multiple destinations based on multiple conditions.
24 The best that can be done is
25 to test multiple Conditions and make a decision of a *single* branch,
26 based on analysis of a *Vector* of CR Fields
27 which have just been calculated from a *Vector* of results.
28
29 In 3D Shader
30 binaries, which are inherently parallelised and predicated, testing all or
31 some results and branching based on multiple tests is extremely common,
32 and a fundamental part of Shader Compilers. Example:
33 without such multi-condition
34 test-and-branch, if a predicate mask is all zeros a large batch of
35 instructions may be masked out to `nop`, and it would waste
36 CPU cycles not only to run them but also to load the predicate
37 mask repeatedly for each one. 3D GPU ISAs can test for this scenario
38 and jump over the fully-masked-out operations, by spotting that
39 *all* Conditions are false. Or, conversely, they only call the function if at least
40 one Condition is set.
41 Therefore, in order to be commercially competitive, `sv.bc` and
42 other Vector-aware Branch Conditional instructions are a high priority
43 for 3D GPU workloads.
44
45 Given that Power ISA v3.0B is already quite powerful, particularly
46 the Condition Registers and their interaction with Branches, there
47 are opportunities to create an extremely flexible and compact
48 Vectorised Branch behaviour. In addition, the side-effects (updating
49 of CTR, truncation of VL, described below) make it a useful instruction
50 even if the branch points to the next instruction (no actual branch).
51
52 # Overview
53
54 When considering an "array" of branch-tests, there are four useful modes:
55 AND, OR, NAND and NOR of all Conditions.
56 NAND and NOR may be synthesised by
57 inverting `BO[2]` which just leaves two modes:
58
59 * Branch takes place on the first CR Field test to succeed
60 (a Great Big OR of all condition tests)
61 * Branch takes place only if **all** CR field tests succeed:
62 a Great Big AND of all condition tests
63 (including those where the predicate is masked out
64 and the corresponding CR Field is considered to be
65 set to `SNZ`)
66
67 Early-exit is enacted such that the Vectorised Branch does not
68 perform needless extra tests, which will help reduce reads on
69 the Condition Register file.
70
71 Additional useful behaviour involves two primary Modes (both of
72 which may be enabled and combined):
73
74 * **VLSET Mode**: identical to Data-Dependent Fail-First Mode
75 for Arithmetic SVP64 operations, with more
76 flexibility and a close interaction and integration into the
77 underlying base Scalar v3.0B Branch instruction.
78 * **CTR-test Mode**: gives much more flexibility over when and why
79 CTR is decremented, including options to decrement if a Condition
80 test succeeds *or if it fails*.
81
82 It is also important to note that Vectorised Branches can be used
83 in either SVP64 Horizontal-First or Vertical-First Mode. Essentially
84 the behaviour is identical in both Modes. It is also important
85 to bear in mind that, fundamentally, Vectorised Branch-Conditional
86 is still extremely close to the Scalar v3.0B Branch-Conditional
87 instructions, and that the same v3.0B Scalar Branch-Conditional
88 instructions are still
89 *completely separate and independent*, being unaltered and
90 ubaffected by their SVP64 variants in every conceivable way.
91
92 # Format and fields
93
94 SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
95 Conditional:
96
97 | 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description |
98 | - | - | - | - | -- | -- | --- |---------|----------------- |
99 |ALL|LRu| / | / | 0 | 0 | / | SNZ sz | normal mode |
100 |ALL|LRu| / |VSb| 0 | 1 | VLI | SNZ sz | VLSET mode |
101 |ALL|LRu|CTi| / | 1 | 0 | / | SNZ sz | CTR-test mode |
102 |ALL|LRu|CTi|VSb| 1 | 1 | VLI | SNZ sz | CTR-test+VLSET mode |
103
104 Brief description of fields:
105
106 * **sz** if predication is enabled will put 4 copies of `SNZ` in place of
107 the src CR Field when the predicate bit is zero. otherwise the element
108 is ignored or skipped, depending on context.
109 * **ALL** when set, all branch conditional tests must pass in order for
110 the branch to succeed. When clear, it is the first sequentially
111 encountered successful test that causes the branch to succeed.
112 This is identical behaviour to how programming languages perform
113 early-exit on Boolean Logic chains.
114 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
115 In VLSET mode, VL is set equal (truncated) to the first point
116 where, assuming Conditions are tested sequentially, the branch succeeds
117 *or fails* depending if VSb is set.
118 If VLI (Vector Length Inclusive) is clear,
119 VL is truncated to *exclude* the current element, otherwise it is
120 included. SVSTATE.MVL is not changed: only VL.
121 * **LRu**: Link Register Update. When set, Link Register will
122 only be updated if the Branch Condition succeeds. This avoids
123 destruction of LR during loops (particularly Vertical-First
124 ones).
125 * **VSb** is most relevant for Vertical-First VLSET Mode. After testing,
126 if VSb is set, VL is truncated if the branch succeeds. If VSb is clear,
127 VL is truncated if the branch did **not** take place.
128 * **CTi** CTR inversion. CTR-test Mode normally decrements per element
129 tested. CTR inversion decrements if a test *fails*. Only relevant
130 in CTR-test Mode.
131
132 # Vectorised CR Field numbering, and Scalar behaviour
133
134 It is important to keep in mind that just like all SVP64 instructions,
135 the `BI` field of the base v3.0B Branch Conditional instruction
136 may be extended by SVP64 EXTRA augmentation, as well as be marked
137 as either Scalar or Vector.
138
139 The `BI` field of Branch Conditional operations is five bits, in scalar
140 v3.0B this would select one bit of the 32 bit CR,
141 comprising eight CR Fields of 4 bits each. In SVP64 there are
142 16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
143 `BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
144 are extended to either scalar or vector and to select CR Fields 0..127
145 as specified in SVP64 [[sv/svp64/appendix]].
146
147 When the CR Fields selected by SVP64-Augmented `BI` is marked as scalar,
148 then as the usual SVP64 rules apply:
149 the loop ends at the first element tested, after taking
150 predication into consideration. Thus, also as usual, when a predicate mask is
151 given, and `BI` marked as scalar, and `sz` is zero, srcstep
152 skips forward to the first non-zero predicated element, and only that
153 one element is tested.
154
155 In other words, the fact that this is a Branch
156 Operation (instead of an arithmetic one) does not result, ultimately,
157 in significant changes as to
158 how SVP64 is fundamentally applied, except with respect to early-out
159 opportunities and CTR-testing, which are outlined below.
160
161 # Description and Modes
162
163 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
164 AND) results in early exit: no more updates to CTR occur (if requested);
165 no branch occurs, and LR is not updated (if requested). Likewise for
166 non-ALL mode (Great Big Or) on first success early exit also occurs,
167 however this time with the Branch proceeding. In both cases the testing
168 of the Vector of CRs should be done in linear sequential order (or in
169 REMAP re-sequenced order): such that tests that are sequentially beyond
170 the exit point are *not* carried out. (*Note: it is standard practice in
171 Programming languages to exit early from conditional tests, however
172 a little unusual to consider in an ISA that is designed for Parallel
173 Vector Processing. The reason is to have strictly-defined guaranteed
174 behaviour*)
175
176 In Vertical-First Mode, setting the `ALL` bit results in `UNDEFINED`
177 behaviour. Given that only one element is being tested at a time
178 in Vertical-First Mode, a test designed to be done on multiple
179 bits is meaningless.
180
181 Predication in both INT and CR modes may be applied to `sv.bc` and other
182 SVP64 Branch Conditional operations, exactly as they may be applied to
183 other SVP64 operations. When `sz` is zero, any masked-out Branch-element
184 operations are not included in condition testing, exactly like all other
185 SVP64 operations, *including* side-effects such as potentially updating
186 LR or CTR, which will also be skipped. There is *one* exception here,
187 which is when
188 `BO[2]=0, sz=0, CTR-test=0, CTi=1` and the relevant element
189 predicate mask bit is also zero:
190 under these special circumstances CTR will also decrement.
191
192 When `sz` is non-zero, this normally requests insertion of a zero
193 in place of the input data, when the relevant predicate mask bit is zero.
194 This would mean that a zero is inserted in place of `CR[BI+32]` for
195 testing against `BO`, which may not be desirable in all circumstances.
196 Therefore, an extra field is provided `SNZ`, which, if set, will insert
197 a **one** in place of a masked-out element, instead of a zero.
198
199 (*Note: Both options are provided because it is useful to deliberately
200 cause the Branch-Conditional Vector testing to fail at a specific point,
201 controlled by the Predicate mask. This is particularly useful in `VLSET`
202 mode, which will truncate SVSTATE.VL at the point of the first failed
203 test.*)
204
205
206 Normally, CTR mode will decrement once per Condition Test, resulting
207 under normal circumstances that CTR reduces by up to VL in Horizontal-First
208 Mode. Just as when v3.0B Branch-Conditional saves at
209 least one instruction on tight inner loops through auto-decrementation
210 of CTR, likewise it is also possible to save instruction count for
211 SVP64 loops in both Vertical-First and Horizontal-First Mode, particularly
212 in circumstances where there is conditional interaction between the
213 element computation and testing, and the continuation (or otherwise)
214 of a given loop. The potential combinations of interactions is why CTR
215 testing options have been added.
216
217 If both CTR-test and VLSET Modes are requested, then because the CTR decrement is on a per element basis, the total amount that CTR is decremented
218 by will end up being VL *after* truncation (should that occur). In
219 other words, the order is strictly (as can be seen in pseudocode, below):
220
221 1. compute the test
222 2. (optionally) decrement CTR
223 3. (optionally) truncate VL
224 4. decide (based on step 1) whether to terminate looping
225 (including not executing step 5)
226 5. decide whether to branch.
227
228 CTR-test mode and CTi interaction is as follows: note that
229 `BO[2]` is still required to be clear for decrements to be
230 considered.
231
232 * **CTR-test=0, CTi=0**: CTR decrements on a per-element basis
233 if `BO[2]` is zero. Masked-out elements when `sz=0` are
234 skipped.
235 * **CTR-test=0, CTi=1**: CTR decrements on a per-element basis
236 if `BO[2]` is zero and a masked-out element is skipped
237 (`sz=0` and predicate bit is zero). This one special case is the
238 **opposite** of other combinations.
239 * **CTR-test=1, CTi=0**: CTR decrements on a per-element basis
240 if `BO[2]` is zero and the Condition Test succeeds.
241 Masked-out elements when `sz=0` are skipped.
242 * **CTR-test=1, CTi=1**: CTR decrements on a per-element basis
243 if `BO[2]` is zero and the Condition Test *fails*.
244 Masked-out elements when `sz=0` are skipped.
245
246 Note that, interestingly, due to the side-effects of `VLSET` mode
247 it is actually useful to use Branch Conditional even
248 to perform no actual branch operation, i.e to point to the instruction
249 after the branch. Truncation of VL would thus conditionally occur yet control
250 flow alteration would not.
251
252 Also, the unconditional bit `BO[0]` is still relevant when Predication
253 is applied to the Branch because in `ALL` mode all nonmasked bits have
254 to be tested. Even when svstep mode or VLSET mode are not used, CTR
255 may still be decremented by the total number of nonmasked elements.
256 In short, Vectorised Branch becomes an extremely powerful tool.
257
258 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
259 is used for explicit looping, where the looping is to terminate if the end
260 of the Vector, VL, is reached. If however that loop is terminated early
261 because VL is truncated, VLSET with Vertical-First becomes meaningless.
262 Therefore, with `VSb`, the option to decide whether truncation should occur if the
263 branch succeeds *or* if the branch condition fails allows for flexibility
264 required.
265
266 `VLSET` mode with Horizontal-First when `VSb` is clear is still
267 useful, because it can be used to truncate VL to the first predicated
268 (non-masked-out) element.
269
270 Available options to combine:
271
272 * `BO[0]` to make an unconditional branch would seem irrelevant if
273 it were not for predication and for side-effects.
274 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
275 * `R30` and `~R30` and other predicate mask options including CR and
276 inverted CR bit testing
277 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
278 predicate bits
279 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
280 `OR` of all tests, respectively.
281
282 In addition to the above, it is necessary to select whether, in `svstep`
283 mode, the Vector CR Field is to be overwritten or not: in some cases it
284 is useful to know but in others all that is needed is the branch itself.
285
286 *Programming note: One important point is that SVP64 instructions are 64 bit.
287 (8 bytes not 4). This needs to be taken into consideration when computing
288 branch offsets: the offset is relative to the start of the instruction,
289 which includes the SVP64 Prefix*
290
291 # Pseudocode and examples
292
293 Pseudocode for Horizontal-First Mode:
294
295 ```
296 cond_ok = not SVRMmode.ALL
297 for srcstep in range(VL):
298 # select predicate bit or zero/one
299 if predicate[srcstep]:
300 # get SVP64 extended CR field 0..127
301 SVCRf = SVP64EXTRA(BI>>2)
302 CRbits = CR{SVCRf}
303 testbit = CRbits[BI & 0b11]
304 # testbit = CR[BI+32+srcstep*4]
305 else if not SVRMmode.sz:
306 # inverted CTR test skip mode
307 if ¬BO[2] & CTRtest & ¬CTI then
308 CTR = CTR - 1
309 continue
310 else
311 testbit = SVRMmode.SNZ
312 # actual element test here
313 el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
314 # merge in the test
315 if SVRMmode.ALL:
316 cond_ok &= el_cond_ok
317 else
318 cond_ok |= el_cond_ok
319 # test for VL to be set (and exit)
320 if VLSET and VSb = el_cond_ok then
321 if SVRMmode.VLI
322 SVSTATE.VL = srcstep+1
323 else
324 SVSTATE.VL = srcstep
325 break
326 # early exit?
327 if SVRMmode.ALL:
328 if ~el_cond_ok:
329 break
330 else
331 if el_cond_ok:
332 break
333 if SVCRf.scalar:
334 break
335 ```
336
337 Pseudocode for Vertical-First Mode:
338
339 ```
340 # get SVP64 extended CR field 0..127
341 SVCRf = SVP64EXTRA(BI>>2)
342 CRbits = CR{SVCRf}
343 # select predicate bit or zero/one
344 if predicate[srcstep]:
345 if BRc = 1 then # CR0 vectorised
346 CR{SVCRf+srcstep} = CRbits
347 testbit = CRbits[BI & 0b11]
348 else if not SVRMmode.sz:
349 # inverted CTR test skip mode
350 if ¬BO[2] & CTRtest & ¬CTI then
351 CTR = CTR - 1
352 SVSTATE.srcstep = new_srcstep
353 exit # no branch testing
354 else
355 testbit = SVRMmode.SNZ
356 # actual element test here
357 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
358 # test for VL to be set (and exit)
359 if VLSET and cond_ok = VSb then
360 if SVRMmode.VLI
361 SVSTATE.VL = new_srcstep+1
362 else
363 SVSTATE.VL = new_srcstep
364 ```
365
366 v3.0B branch pseudocode including LRu and CTR skipping
367
368 ```
369 if (mode_is_64bit) then M <- 0
370 else M <- 32
371 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
372 ctrdec = ¬BO[2]
373 if CTRtest & (cond_ok ^ CTi) then
374 ctrdec = 0b0
375 if ctrdec then CTR <- CTR - 1
376 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
377 lr_ok <- SVRMmode.LRu
378 if ctr_ok & cond_ok then
379 if AA then NIA <-iea EXTS(BD || 0b00)
380 else NIA <-iea CIA + EXTS(BD || 0b00)
381 lr_ok <- 0b1
382 if LK & lr_ok then LR <-iea CIA + 4
383 ```
384
385 # Example Shader code
386
387 ```
388 while(a > 2) {
389 if(b < 5)
390 f();
391 else
392 g();
393 h();
394 }
395 ```
396
397 which compiles to something like:
398
399 ```
400 vec<i32> a, b;
401 // ...
402 pred loop_pred = a > 2;
403 while(loop_pred.any()) {
404 pred if_pred = loop_pred & (b < 5);
405 if(if_pred.any()) {
406 f(if_pred);
407 }
408 label1:
409 pred else_pred = loop_pred & ~if_pred;
410 if(else_pred.any()) {
411 g(else_pred);
412 }
413 h(loop_pred);
414 }
415 ```
416
417 which will end up as:
418
419 ```
420 sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
421 sv.crweird r30, CR60.GT # transfer GT vector to r30
422 while_loop:
423 sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
424 sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
425 # only calculate loop_pred & pred_b because needed in f()
426 sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
427 f(CR80.v.SO)
428 skip_f:
429 # illustrate inversion of pred_b. invert r30, test ALL
430 # rather than SOME, but masked-out zero test would FAIL,
431 # therefore masked-out instead is tested against 1 not 0
432 sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
433 # else = loop & ~pred_b, need this because used in g()
434 sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
435 g(CR80.v.SO)
436 skip_g:
437 # conditionally call h(r30) if any loop pred set
438 sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
439 sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
440 ```