(no commit message)
[libreriscv.git] / openpower / sv / branches.mdwn
1 # SVP64 Branch Conditional behaviour
2
3 **DRAFT STATUS**
4
5 Please note: SVP64 Branch instructions should be
6 considered completely separate and distinct from
7 standard scalar OpenPOWER-approved v3.0B branches.
8 **v3.0B branches are in no way impacted, altered,
9 changed or modified in any way, shape or form by
10 the SVP64 Vectorised Variants**.
11
12 Links
13
14 * <https://bugs.libre-soc.org/show_bug.cgi?id=664>
15 * <http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-August/003416.html>
16 * [[openpower/isa/branch]]
17
18 Scalar 3.0B Branch Conditional operations, `bc`, `bctar` etc. test a
19 Condition Register. When doing so in a Vector Context, it is quite
20 reasonable and logical to test and Branch on a *Vector* of CR Fields
21 which have just been calculated from a *Vector* of results. In 3D Shader
22 binaries, which are inherently parallelised and predicated, testing all or
23 some results and branching based on multiple tests is extremely common,
24 and a fundamental part of Shader Compilers. Therefore, `sv.bc` and
25 other Vector-aware Branch Conditional instructions are a high priority
26 for 3D GPUs.
27
28 The `BI` field of Branch Conditional operations is five bits, in scalar
29 v3.0B this would select one bit of the 32 bit CR. In SVP64 there are
30 16 32 bit CRs, containing 128 4-bit CR Fields. Therefore, the 2 LSBs of
31 `BI` select the bit from the CR Field (EQ LT GT SO), and the top 3 bits
32 are extended to either scalar or vector and to select CR Fields 0..127
33 as specified in SVP64 [[sv/svp64/appendix]].
34
35 When considering an "array" of branches, there are four useful modes:
36 AND, OR, NAND and NOR of all Conditions.
37 NAND and NOR may be synthesised by
38 inverting `BO[2]` which just leaves two modes:
39
40 * Branch takes place on the first CR test to succeed
41 (a Great Big OR of all condition tests)
42 * Branch takes place only if **all** CR tests succeed:
43 a Great Big AND of all condition tests
44 (including those where the predicate is masked out
45 and the corresponding CR Field is considered to be
46 set to `SNZ`)
47
48 When the CR Fields selected by SVP64 Augmented `BI` is marked as scalar,
49 then as usual the loop ends at the first element tested, after taking
50 predication into consideration. Thus, as usual, when `sz` is zero, srcstep
51 skips forward to the first non-zero predicated element, and only that
52 one element is tested.
53
54 In SVP64 Horizontal-First Mode, the first failure in ALL mode (Great Big
55 AND) results in early exit: no more updates to CTR occur (if requested);
56 no branch occurs, and LR is not updated (if requested). Likewise for
57 non-ALL mode (Great Big Or) on first success early exit also occurs,
58 however this time with the Branch proceeding. In both cases the testing
59 of the Vector of CRs should be done in linear sequential order (or in
60 REMAP re-sequenced order): such that tests that are sequentially beyond
61 the exit point are *not* carried out. (*Note: it is standard practice in
62 Programming languages to exit early from conditional tests, however
63 a little unusual to consider in an ISA that is designed for Parallel
64 Vector Processing. The reason is to have strictly-defined guaranteed
65 behaviour*)
66
67 In Vertical-First Mode, the `ALL` bit still applies, but to the elements
68 that are executed up to the Hint length, in parallel batches. See
69 [[sv/setvl]] for the definition of Vertical-First Hint.
70
71 Predication in both INT and CR modes may be applied to `sv.bc` and other
72 SVP64 Branch Conditional operations, exactly as they may be applied to
73 other SVP64 operations. When `sz` is zero, any masked-out Branch-element
74 operations are not included in condition testing, exactly like all other
75 SVP64 operations. This *includes* side-effects such as decrementing of
76 CTR, which is also skipped on masked-out CR Field elements, when `sz`
77 is zero.
78
79 However when `sz` is non-zero, this normally requests insertion of a zero
80 in place of the input data, when the relevant predicate mask bit is zero.
81 This would mean that a zero is inserted in place of `CR[BI+32]` for
82 testing against `BO`, which may not be desirable in all circumstances.
83 Therefore, an extra field is provided `SNZ`, which, if set, will insert
84 a **one** in place of a masked-out element, instead of a zero.
85
86 (*Note: Both options are provided because it is useful to deliberately
87 cause the Branch-Conditional Vector testing to fail at a specific point,
88 controlled by the Predicate mask. This is particularly useful in `VLSET`
89 mode, which will truncate SVSTATE.VL at the point of the first failed
90 test.*)
91
92 SVP64 RM `MODE` (includes `ELWIDTH` and `ELWIDTH_SRC` bits) for Branch
93 Conditional:
94
95 | 4 | 5 | 6 | 7 | 19 | 20 | 21 | 22 23 | description |
96 | - | - | - | - | -- | -- | --- |---------|-------------------- |
97 |ALL|LRu| / | / | 0 | 0 | / | SNZ sz | normal mode |
98 |ALL|LRu| / |VSb| 0 | 1 | VLI | SNZ sz | VLSET mode |
99 |ALL|LRu|CVh| / | 1 | 0 | / | SNZ sz | CTR mode |
100 |ALL|LRu|CVh|VSb| 1 | 1 | VLI | SNZ sz | CTR+VLSET mode |
101
102 Fields:
103
104 * **sz** if predication is enabled will put 4 copies of `SNZ` in place of
105 the src CR Field when the predicate bit is zero. otherwise the element
106 is ignored or skipped, depending on context.
107 * **ALL** when set, all branch conditional tests must pass in order for
108 the branch to succeed. When clear, it is the first sequentially
109 encountered successful test that causes the branch to succeed.
110 * **VLI** VLSET is identical to Data-dependent Fail-First mode.
111 In VLSET mode, VL is set equal (truncated) to the first point
112 where, assuming Conditions are tested sequentially, the branch succeeds
113 *or fails* depending if VSb is set.
114 If VLI (Vector Length Inclusive) is clear,
115 VL is truncated to *exclude* the current element, otherwise it is
116 included. SVSTATE.MVL is not changed: only VL.
117 * **LRu**: Link Register Update. When set, Link Register will
118 only be updated if the Branch Condition succeeds. This avoids
119 destruction of LR during loops (particularly Vertical-First
120 ones).
121 * **VSb** is most relevant for Vertical-First VLSET Mode. After testing,
122 if VSb is set, VL is truncated if the branch succeeds. If VSb is clear,
123 VL is truncated if the branch did **not** take place.
124
125 CTR mode will subtract VL (or VLHint) from CTR rather than just decrement
126 CTR by one. Just as when v3.0B Branch-Conditional saves at
127 least one instruction on tight inner loops through auto-decrementation
128 of CTR, likewise it is also possible to save instruction count for
129 SVP64 loops in both Vertical-First and Horizontal-First Mode.
130
131 Note that, interestingly, due to the useful side-effects of `VLSET` mode
132 it is actually useful to use Branch Conditional even
133 to perform no actual branch operation, i.e to point to the instruction
134 after the branch.
135 If VLSET mode was requested with REMAP, VL will have been set to the
136 length of one of the loop endpoints, as specified by the bit from
137 the Branch `BI` field.
138
139 Also, the unconditional bit `BO[0]` is still relevant when Predication
140 is applied to the Branch because in `ALL` mode all nonmasked bits have
141 to be tested. Even when svstep mode or VLSET mode are not used, CTR
142 may still be decremented by the total number of nonmasked elements.
143 In short, Vectorised Branch becomes an extremely powerful tool.
144
145 `VLSET` mode with Vertical-First is particularly unusual. Vertical-First
146 is used for explicit looping, where the looping is to terminate if the end
147 of the Vector, VL, is reached. If however that loop is terminated early
148 because VL is truncated, VLSET with Vertical-First becomes meaningless.
149 Therefore, the option to decide whether truncation should occur if the
150 branch succeeds *or* if the branch condition fails allows for flexibility
151 required.
152
153 `VLSET` mode with Horizontal-First when `VSb` is clear is still
154 useful, because it can be used to truncate VL to the first predicated
155 (non-masked-out) element.
156
157 Available options to combine:
158
159 * `BO[0]` to make an unconditional branch would seem irrelevant if
160 it were not for predication and for side-effects.
161 * `BO[1]` to select whether the CR bit being tested is zero or nonzero
162 * `R30` and `~R30` and other predicate mask options including CR and
163 inverted CR bit testing
164 * `sz` and `SNZ` to insert either zeros or ones in place of masked-out
165 predicate bits
166 * `ALL` or `ANY` behaviour corresponding to `AND` of all tests and
167 `OR` of all tests, respectively.
168
169 In addition to the above, it is necessary to select whether, in `svstep`
170 mode, the Vector CR Field is to be overwritten or not: in some cases it
171 is useful to know but in others all that is needed is the branch itself.
172
173 Pseudocode for Horizontal-First Mode:
174
175 ```
176 cond_ok = not SVRMmode.ALL
177 for srcstep in range(VL):
178 # select predicate bit or zero/one
179 if predicate[srcstep]:
180 # get SVP64 extended CR field 0..127
181 SVCRf = SVP64EXTRA(BI>>2)
182 CRbits = CR{SVCRf}
183 testbit = CRbits[BI & 0b11]
184 # testbit = CR[BI+32+srcstep*4]
185 else if not SVRMmode.sz:
186 continue
187 else
188 testbit = SVRMmode.SNZ
189 # actual element test here
190 el_cond_ok <- BO[0] | ¬(testbit ^ BO[1])
191 # merge in the test
192 if SVRMmode.ALL:
193 cond_ok &= el_cond_ok
194 else
195 cond_ok |= el_cond_ok
196 # test for VL to be set (and exit)
197 if VLSET and VSb = el_cond_ok then
198 if SVRMmode.VLI
199 SVSTATE.VL = srcstep+1
200 else
201 SVSTATE.VL = srcstep
202 break
203 # early exit?
204 if SVRMmode.ALL:
205 if ~el_cond_ok:
206 break
207 else
208 if el_cond_ok:
209 break
210 if SVCRf.scalar:
211 break
212 ```
213
214 Pseudocode for Vertical-First Mode:
215
216 ```
217 # get SVP64 extended CR field 0..127
218 SVCRf = SVP64EXTRA(BI>>2)
219 CRbits = CR{SVCRf}
220 # select predicate bit or zero/one
221 if predicate[srcstep]:
222 if BRc = 1 then # CR0 vectorised
223 CR{SVCRf+srcstep} = CRbits
224 testbit = CRbits[BI & 0b11]
225 else if not SVRMmode.sz:
226 SVSTATE.srcstep = new_srcstep
227 exit # no branch testing
228 else
229 testbit = SVRMmode.SNZ
230 # actual element test here
231 cond_ok <- BO[0] | ¬(testbit ^ BO[1])
232 # test for VL to be set (and exit)
233 if VLSET and cond_ok = VSb then
234 if SVRMmode.VLI
235 SVSTATE.VL = new_srcstep+1
236 else
237 SVSTATE.VL = new_srcstep
238 ```
239
240 v3.0B branch pseudocode including LRu
241
242 ```
243 if (mode_is_64bit) then M <- 0
244 else M <- 32
245 if ¬BO[2] then CTR <- CTR - 1
246 ctr_ok <- BO[2] | ((CTR[M:63] != 0) ^ BO[3])
247 cond_ok <- BO[0] | ¬(CR[BI+32] ^ BO[1])
248 lr_ok <- SVRMmode.LRu
249 if ctr_ok & cond_ok then
250 if AA then NIA <-iea EXTS(BD || 0b00)
251 else NIA <-iea CIA + EXTS(BD || 0b00)
252 lr_ok <- 0b1
253 if LK & lr_ok then LR <-iea CIA + 4
254 ```
255
256 # Example Shader code
257
258 ```
259 while(a > 2) {
260 if(b < 5)
261 f();
262 else
263 g();
264 h();
265 }
266 ```
267
268 which compiles to something like:
269
270 ```
271 vec<i32> a, b;
272 // ...
273 pred loop_pred = a > 2;
274 while(loop_pred.any()) {
275 pred if_pred = loop_pred & (b < 5);
276 if(if_pred.any()) {
277 f(if_pred);
278 }
279 label1:
280 pred else_pred = loop_pred & ~if_pred;
281 if(else_pred.any()) {
282 g(else_pred);
283 }
284 h(loop_pred);
285 }
286 ```
287
288 which will end up as:
289
290 ```
291 sv.cmpi CR60.v a.v, 2 # vector compare a into CR60 vector
292 sv.crweird r30, CR60.GT # transfer GT vector to r30
293 while_loop:
294 sv.cmpi CR80.v, b.v, 5 # vector compare b into CR64 Vector
295 sv.bc/m=r30/~ALL/sz CR80.v.LT skip_f # skip when none
296 # only calculate loop_pred & pred_b because needed in f()
297 sv.crand CR80.v.SO, CR60.v.GT, CR80.V.LT # if = loop & pred_b
298 f(CR80.v.SO)
299 skip_f:
300 # illustrate inversion of pred_b. invert r30, test ALL
301 # rather than SOME, but masked-out zero test would FAIL,
302 # therefore masked-out instead is tested against 1 not 0
303 sv.bc/m=~r30/ALL/SNZ CR80.v.LT skip_g
304 # else = loop & ~pred_b, need this because used in g()
305 sv.crternari(A&~B) CR80.v.SO, CR60.v.GT, CR80.V.LT
306 g(CR80.v.SO)
307 skip_g:
308 # conditionally call h(r30) if any loop pred set
309 sv.bclr/m=r30/~ALL/sz BO[1]=1 h()
310 sv.bc/m=r30/~ALL/sz BO[1]=1 while_loop
311 ```