Merge remote-tracking branch 'mesa-public/master' into vulkan
[mesa.git] / src / gallium / drivers / r600 / sb / notes.markdown
1 r600-sb
2 =======
3
4 * * * * *
5
6 Debugging
7 ---------
8
9 ### Environment variables
10
11 - **R600\_DEBUG**
12
13 There are new flags:
14
15 - **sb** - Enable optimization of graphics shaders
16 - **sbcl** - Enable optimization of compute shaders (experimental)
17 - **sbdry** - Dry run, optimize but use source bytecode -
18 useful if you only want to check shader dumps
19 without the risk of lockups and other problems
20 - **sbstat** - Print optimization statistics (only time so far)
21 - **sbdump** - Print IR after some passes.
22
23 ### Regression debugging
24
25 If there are any regressions as compared to the default backend
26 (R600\_SB=0), it's possible to use the following environment variables
27 to find the incorrectly optimized shader that causes the regression.
28
29 - **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
30 shaders
31 - 0 - disabled (default)
32 - 1 - skip optimization for the shaders in the range
33 [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
34 optimize only the shaders that are not in this range
35 - 2 - optimize only the shaders in the range
36 [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
37
38 - **R600\_SB\_DSKIP\_START** - start of the range (1-based)
39
40 - **R600\_SB\_DSKIP\_END** - end of the range (1-based)
41
42 Example - optimize only the shaders 5, 6, and 7:
43
44 R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
45
46 All shaders compiled by the application are numbered starting from 1,
47 the number of shaders used by the application may be obtained by running
48 it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
49 for each compiled shader.
50
51 After figuring out the total number of shaders used by the application,
52 the variables above allow to use bisection to find the shader that is
53 the cause of regression. E.g. if the application uses 100 shaders, we
54 can divide the range [1; 100] and run the application with the
55 optimization enabled only for the first half of the shaders:
56
57 R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
58
59 If the regression is reproduced with these parameters, then the failing
60 shader is in the range [1; 50], if it's not reproduced - then it's in
61 the range [51; 100]. Then we can divide the new range again and repeat
62 the testing, until we'll reduce the range to a single failing shader.
63
64 *NOTE: This method relies on the assumption that the application
65 produces the same sequence of the shaders on each run. It's not always
66 true - some applications may produce different sequences of the shaders,
67 in such cases the tools like apitrace may be used to record the trace
68 with the application, then this method may be applied when replaying the
69 trace - also this may be faster and/or more convenient than testing the
70 application itself.*
71
72 * * * * *
73
74 Intermediate Representation
75 ---------------------------
76
77 ### Values
78
79 All kinds of the operands (literal constants, references to kcache
80 constants, references to GPRs, etc) are currently represented by the
81 **value** class (possibly it makes sense to switch to hierarchy of
82 classes derived from **value** instead, to save some memory).
83
84 All values (except some pseudo values like the exec\_mask or predicate
85 register) represent 32bit scalar values - there are no vector values,
86 CF/FETCH instructions use groups of 4 values for src and dst operands.
87
88 ### Nodes
89
90 Shader programs are represented using the tree data structure, some
91 nodes contain a list of subnodes.
92
93 #### Control flow nodes
94
95 Control flow information is represented using four special node types
96 (based on the ideas from [[1]](#references) )
97
98 - **region\_node** - single-entry, single-exit region.
99
100 All loops and if's in the program are enclosed in region nodes.
101 Region nodes have two containers for phi nodes -
102 region\_node::loop\_phi contains the phi expressions to be executed
103 at the region entry, region\_node::phi contains the phi expressions
104 to be executed at the region exit. It's the only type of the node
105 that contains associated phi expressions.
106
107 - **depart\_node** - "depart region \$id after { ... }"
108
109 Depart target region (jump to exit point) after executing contained
110 code.
111
112 - **repeat\_node** - "repeat region \$id after { ... }"
113
114 Repeat target region (jump to entry point) after executing contained
115 code.
116
117 - **if\_node** - "if (cond) { ... }"
118
119 Execute contained code if condition is true. The difference from
120 [[1]](#references) is that we don't have associated phi expressions
121 for the **if\_node**, we enclose **if\_node** in the
122 **region\_node** and store corresponding phi's in the
123 **region\_node**, this allows more uniform handling.
124
125 The target region of depart and repeat nodes is always the region where
126 they are located (possibly in the nested region), there are no arbitrary
127 jumps/goto's - control flow in the program is always structured.
128
129 Typical control flow constructs can be represented as in the following
130 examples:
131
132 GLSL:
133
134 if (cond) {
135 < 1 >
136 } else {
137 < 2 >
138 }
139
140 IR:
141
142 region #0 {
143 depart region #0 after {
144 if (cond) {
145 depart region #0 after {
146 < 1 >
147 }
148 }
149 < 2 >
150 }
151 <region #0 phi nodes >
152 }
153
154 GLSL:
155
156 while (cond) {
157 < 1 >
158 }
159
160 IR:
161
162 region #0 {
163 <region #0 loop_phi nodes>
164 repeat region #0 after {
165 region #1 {
166 depart region #1 after {
167 if (!cond) {
168 depart region #0
169 }
170 }
171 }
172 < 1 >
173 }
174 <region #0 phi nodes>
175 }
176
177 'Break' and 'continue' inside the loops are directly translated to the
178 depart and repeat nodes for the corresponding loop region.
179
180 This may look a bit too complicated, but in fact this allows more simple
181 and uniform handling of the control flow.
182
183 All loop\_phi and phi nodes for some region always have the same number
184 of source operands. The number of source operands for
185 region\_node::loop\_phi nodes is 1 + number of repeat nodes that
186 reference this region as a target. The number of source operands for
187 region\_node::phi nodes is equal to the number of depart nodes that
188 reference this region as a target. All depart/repeat nodes for the
189 region have unique indices equal to the index of source operand for
190 phi/loop\_phi nodes.
191
192 First source operand for region\_node::loop\_phi nodes (src[0]) is an
193 incoming value that enters the region from the outside. Each remaining
194 source operand comes from the corresponding repeat node.
195
196 More complex example:
197
198 GLSL:
199
200 a = 1;
201 while (a < 5) {
202 a = a * 2;
203 if (b == 3) {
204 continue;
205 } else {
206 a = 6;
207 }
208 if (c == 4)
209 break;
210 a = a + 1;
211 }
212
213 IR with SSA form:
214
215 a.1 = 1;
216 region #0 {
217 // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
218 region#0 loop_phi: a.2 = phi a.1, a.6, a.3
219
220 repeat_1 region #0 after {
221 a.3 = a.2 * 2;
222 cond1 = (b == 3);
223 region #1 {
224 depart_0 region #1 after {
225 if (cond1) {
226 repeat_2 region #0;
227 }
228 }
229 a.4 = 6;
230
231 region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
232 }
233 cond2 = (c == 4);
234 region #2 {
235 depart_0 region #2 after {
236 if (cond2) {
237 depart_0 region #0;
238 }
239 }
240 }
241 a.6 = a.5 + 1;
242 }
243
244 region #0 phi: a.7 = phi a.5 // src[0] from depart_0
245 }
246
247 Phi nodes with single source operand are just copies, they are not
248 really necessary, but this allows to handle all **depart\_node**s in the
249 uniform way.
250
251 #### Instruction nodes
252
253 Instruction nodes represent different kinds of instructions -
254 **alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
255 the "bc" structure where all fields of the bytecode are stored (the type
256 is **bc\_alu** for **alu\_node**, etc). The operands are represented
257 using the vectors of pointers to **value** class (node::src, node::dst)
258
259 #### SSA-specific nodes
260
261 Phi nodes currently don't have special node class, they are stored as
262 **node**. Destination vector contains a single destination value, source
263 vector contains 1 or more source values.
264
265 Psi nodes [[5], [6]](#references) also don't have a special node class
266 and stored as **node**. Source vector contains 3 values for each source
267 operand - the **value** of predicate, **value** of corresponding
268 PRED\_SEL field, and the source **value** itself.
269
270 ### Indirect addressing
271
272 Special kind of values (VLK\_RELREG) is used to represent indirect
273 operands. These values don't have SSA versions. The representation is
274 mostly based on the [[2]](#references). Indirect operand contains the
275 "offset/address" value (value::rel), (e.g. some SSA version of the AR
276 register value, though after some passes it may be any value - constant,
277 register, etc), also it contains the maydef and mayuse vectors of
278 pointers to **value**s (similar to dst/src vectors in the **node**) to
279 represent the effects of aliasing in the SSA form.
280
281 E.g. if we have the array R5.x ... R8.x and the following instruction :
282
283 MOV R0.x, R[5 + AR].x
284
285 then source indirect operand is represented with the VLK\_RELREG value,
286 value::rel is AR, value::maydef is empty (in fact it always contain the
287 same number of elements as mayuse to simplify the handling, but they are
288 NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
289 corresponding SSA versions after ssa\_rename).
290
291 Additional "virtual variables" as in [HSSA [2]](#references) are not
292 used, also there is no special handling for "zero versions". Typical
293 programs in our case are small, indirect addressing is rare, array sizes
294 are limited by max gpr number, so we don't really need to use special
295 tricks to avoid the explosion of value versions. Also this allows more
296 precise liveness computation for array elements without modifications to
297 the algorithms.
298
299 With the following instruction:
300
301 MOV R[5+AR].x, R0.x
302
303 we'll have both maydef and mayuse vectors for dst operand filled with
304 array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
305 pass mayuse will contain previous versions, maydef will contain new
306 potentially-defined versions.
307
308 * * * * *
309
310 Passes
311 ------
312
313 - **bc\_parser** - creates the IR from the source bytecode,
314 initializes src and dst value vectors for instruction nodes. Most
315 ALU nodes have one dst operand and the number of source operands is
316 equal to the number of source operands for the ISA instruction.
317 Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
318 gpr as in the original instruction, other two are pseudo-operands
319 that represent possibly updated predicate and exec\_mask. Predicate
320 values are used in the predicated alu instructions (node::pred),
321 exec\_mask values are used in the if\_nodes (if\_node::cond). Each
322 vector operand in the CF/TEX/VTX instructions is represented with 4
323 values - components of the vector.
324
325 - **ssa\_prepare** - creates phi expressions.
326
327 - **ssa\_rename** - renames the values (assigns versions).
328
329 - **liveness** - liveness computation, sets 'dead' flag for unused
330 nodes and values, optionally computes interference information for
331 the values.
332
333 - **dce\_cleanup** - eliminates 'dead' nodes, also removes some
334 unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
335 instructions in the source, containers for ALU groups (they were
336 only needed for the ssa\_rename pass)
337
338 - **if\_conversion** - converts control flow with if\_nodes to the
339 data flow in cases where it can improve performance (small alu-only
340 branches). Both branches are executed speculatively and the phi
341 expressions are replaced with conditional moves (CNDxx) to select
342 the final value using the same condition predicate as was used by
343 the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
344 instruction, CNDxx now uses dst[0] from the same PREDSETxx
345 instruction.
346
347 - **peephole** - peephole optimizations
348
349 - **gvn** - Global Value Numbering [[2]](#references),
350 [[3]](#references)
351
352 - **gcm** - Global Code Motion [[3]](#references). Also performs
353 grouping of the instructions of the same kind (CF/FETCH/ALU).
354
355 - register allocation passes, some ideas are used from
356 [[4]](#references), but implementation is simplified to make it more
357 efficient in terms of the compilation speed (e.g. no recursive
358 recoloring) while achieving good enough results.
359
360 - **ra\_split** - prepares the program to register allocation.
361 Splits live ranges for constrained values by inserting the
362 copies to/from temporary values, so that the live range of the
363 constrained values becomes minimal.
364
365 - **ra\_coalesce** - performs global allocation on registers used
366 in CF/FETCH instructions. It's performed first to make sure they
367 end up in the same GPR. Also tries to allocate all values
368 involved in copies (inserted by the ra\_split pass) to the same
369 register, so that the copies may be eliminated.
370
371 - **ra\_init** - allocates gpr arrays (if indirect addressing is
372 used), and remaining values.
373
374 - **post\_scheduler** - ALU scheduler, handles VLIW packing and
375 performs the final register allocation for local values inside ALU
376 clauses. Eliminates all coalesced copies (if src and dst of the copy
377 are allocated to the same register).
378
379 - **ra\_checker** - optional debugging pass that tries to catch basic
380 errors of the scheduler or regalloc,
381
382 - **bc\_finalize** - propagates the regalloc information from values
383 in node::src and node::dst vectors to the bytecode fields, converts
384 control flow structure (region/depart/repeat) to the target
385 instructions (JUMP/ELSE/POP,
386 LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
387
388 - **bc\_builder** - builds final bytecode,
389
390 * * * * *
391
392 References
393 ----------
394
395 [1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
396 McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
397
398 [2] ["Effective Representation of Aliases and Indirect Memory Operations
399 in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
400 Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
401
402 [3] ["Global Code Motion. Global Value Numbering.", Cliff
403 Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
404
405 [4] ["Register Allocation for Programs in SSA Form", Sebastian
406 Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
407
408 [5] ["An extension to the SSA representation for predicated code",
409 Francois de
410 Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
411
412 [6] ["Improvements to the Psi-SSA Representation", F. de
413 Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)