gallium: remove PIPE_CAP_USER_CONSTANT_BUFFERS
[mesa.git] / src / gallium / drivers / r600 / sb / notes.markdown
1 r600-sb
2 =======
3
4 * * * * *
5
6 Debugging
7 ---------
8
9 ### Environment variables
10
11 - **R600\_DEBUG**
12
13 There are new flags:
14
15 - **nosb** - Disable sb backend for graphics shaders
16 - **sbcl** - Enable optimization of compute shaders (experimental)
17 - **sbdry** - Dry run, optimize but use source bytecode -
18 useful if you only want to check shader dumps
19 without the risk of lockups and other problems
20 - **sbstat** - Print optimization statistics (only time so far)
21 - **sbdump** - Print IR after some passes.
22 - **sbnofallback** - Abort on errors instead of fallback
23 - **sbdisasm** - Use sb disassembler for shader dumps
24 - **sbsafemath** - Disable unsafe math optimizations
25
26 ### Regression debugging
27
28 If there are any regressions as compared to the default backend
29 (R600\_SB=0), it's possible to use the following environment variables
30 to find the incorrectly optimized shader that causes the regression.
31
32 - **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
33 shaders
34 - 0 - disabled (default)
35 - 1 - skip optimization for the shaders in the range
36 [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
37 optimize only the shaders that are not in this range
38 - 2 - optimize only the shaders in the range
39 [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
40
41 - **R600\_SB\_DSKIP\_START** - start of the range (1-based)
42
43 - **R600\_SB\_DSKIP\_END** - end of the range (1-based)
44
45 Example - optimize only the shaders 5, 6, and 7:
46
47 R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
48
49 All shaders compiled by the application are numbered starting from 1,
50 the number of shaders used by the application may be obtained by running
51 it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
52 for each compiled shader.
53
54 After figuring out the total number of shaders used by the application,
55 the variables above allow to use bisection to find the shader that is
56 the cause of regression. E.g. if the application uses 100 shaders, we
57 can divide the range [1; 100] and run the application with the
58 optimization enabled only for the first half of the shaders:
59
60 R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
61
62 If the regression is reproduced with these parameters, then the failing
63 shader is in the range [1; 50], if it's not reproduced - then it's in
64 the range [51; 100]. Then we can divide the new range again and repeat
65 the testing, until we'll reduce the range to a single failing shader.
66
67 *NOTE: This method relies on the assumption that the application
68 produces the same sequence of the shaders on each run. It's not always
69 true - some applications may produce different sequences of the shaders,
70 in such cases the tools like apitrace may be used to record the trace
71 with the application, then this method may be applied when replaying the
72 trace - also this may be faster and/or more convenient than testing the
73 application itself.*
74
75 * * * * *
76
77 Intermediate Representation
78 ---------------------------
79
80 ### Values
81
82 All kinds of the operands (literal constants, references to kcache
83 constants, references to GPRs, etc) are currently represented by the
84 **value** class (possibly it makes sense to switch to hierarchy of
85 classes derived from **value** instead, to save some memory).
86
87 All values (except some pseudo values like the exec\_mask or predicate
88 register) represent 32bit scalar values - there are no vector values,
89 CF/FETCH instructions use groups of 4 values for src and dst operands.
90
91 ### Nodes
92
93 Shader programs are represented using the tree data structure, some
94 nodes contain a list of subnodes.
95
96 #### Control flow nodes
97
98 Control flow information is represented using four special node types
99 (based on the ideas from [[1]](#references) )
100
101 - **region\_node** - single-entry, single-exit region.
102
103 All loops and if's in the program are enclosed in region nodes.
104 Region nodes have two containers for phi nodes -
105 region\_node::loop\_phi contains the phi expressions to be executed
106 at the region entry, region\_node::phi contains the phi expressions
107 to be executed at the region exit. It's the only type of the node
108 that contains associated phi expressions.
109
110 - **depart\_node** - "depart region \$id after { ... }"
111
112 Depart target region (jump to exit point) after executing contained
113 code.
114
115 - **repeat\_node** - "repeat region \$id after { ... }"
116
117 Repeat target region (jump to entry point) after executing contained
118 code.
119
120 - **if\_node** - "if (cond) { ... }"
121
122 Execute contained code if condition is true. The difference from
123 [[1]](#references) is that we don't have associated phi expressions
124 for the **if\_node**, we enclose **if\_node** in the
125 **region\_node** and store corresponding phi's in the
126 **region\_node**, this allows more uniform handling.
127
128 The target region of depart and repeat nodes is always the region where
129 they are located (possibly in the nested region), there are no arbitrary
130 jumps/goto's - control flow in the program is always structured.
131
132 Typical control flow constructs can be represented as in the following
133 examples:
134
135 GLSL:
136
137 if (cond) {
138 < 1 >
139 } else {
140 < 2 >
141 }
142
143 IR:
144
145 region #0 {
146 depart region #0 after {
147 if (cond) {
148 depart region #0 after {
149 < 1 >
150 }
151 }
152 < 2 >
153 }
154 <region #0 phi nodes >
155 }
156
157 GLSL:
158
159 while (cond) {
160 < 1 >
161 }
162
163 IR:
164
165 region #0 {
166 <region #0 loop_phi nodes>
167 repeat region #0 after {
168 region #1 {
169 depart region #1 after {
170 if (!cond) {
171 depart region #0
172 }
173 }
174 }
175 < 1 >
176 }
177 <region #0 phi nodes>
178 }
179
180 'Break' and 'continue' inside the loops are directly translated to the
181 depart and repeat nodes for the corresponding loop region.
182
183 This may look a bit too complicated, but in fact this allows more simple
184 and uniform handling of the control flow.
185
186 All loop\_phi and phi nodes for some region always have the same number
187 of source operands. The number of source operands for
188 region\_node::loop\_phi nodes is 1 + number of repeat nodes that
189 reference this region as a target. The number of source operands for
190 region\_node::phi nodes is equal to the number of depart nodes that
191 reference this region as a target. All depart/repeat nodes for the
192 region have unique indices equal to the index of source operand for
193 phi/loop\_phi nodes.
194
195 First source operand for region\_node::loop\_phi nodes (src[0]) is an
196 incoming value that enters the region from the outside. Each remaining
197 source operand comes from the corresponding repeat node.
198
199 More complex example:
200
201 GLSL:
202
203 a = 1;
204 while (a < 5) {
205 a = a * 2;
206 if (b == 3) {
207 continue;
208 } else {
209 a = 6;
210 }
211 if (c == 4)
212 break;
213 a = a + 1;
214 }
215
216 IR with SSA form:
217
218 a.1 = 1;
219 region #0 {
220 // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
221 region#0 loop_phi: a.2 = phi a.1, a.6, a.3
222
223 repeat_1 region #0 after {
224 a.3 = a.2 * 2;
225 cond1 = (b == 3);
226 region #1 {
227 depart_0 region #1 after {
228 if (cond1) {
229 repeat_2 region #0;
230 }
231 }
232 a.4 = 6;
233
234 region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
235 }
236 cond2 = (c == 4);
237 region #2 {
238 depart_0 region #2 after {
239 if (cond2) {
240 depart_0 region #0;
241 }
242 }
243 }
244 a.6 = a.5 + 1;
245 }
246
247 region #0 phi: a.7 = phi a.5 // src[0] from depart_0
248 }
249
250 Phi nodes with single source operand are just copies, they are not
251 really necessary, but this allows to handle all **depart\_node**s in the
252 uniform way.
253
254 #### Instruction nodes
255
256 Instruction nodes represent different kinds of instructions -
257 **alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
258 the "bc" structure where all fields of the bytecode are stored (the type
259 is **bc\_alu** for **alu\_node**, etc). The operands are represented
260 using the vectors of pointers to **value** class (node::src, node::dst)
261
262 #### SSA-specific nodes
263
264 Phi nodes currently don't have special node class, they are stored as
265 **node**. Destination vector contains a single destination value, source
266 vector contains 1 or more source values.
267
268 Psi nodes [[5], [6]](#references) also don't have a special node class
269 and stored as **node**. Source vector contains 3 values for each source
270 operand - the **value** of predicate, **value** of corresponding
271 PRED\_SEL field, and the source **value** itself.
272
273 ### Indirect addressing
274
275 Special kind of values (VLK\_RELREG) is used to represent indirect
276 operands. These values don't have SSA versions. The representation is
277 mostly based on the [[2]](#references). Indirect operand contains the
278 "offset/address" value (value::rel), (e.g. some SSA version of the AR
279 register value, though after some passes it may be any value - constant,
280 register, etc), also it contains the maydef and mayuse vectors of
281 pointers to **value**s (similar to dst/src vectors in the **node**) to
282 represent the effects of aliasing in the SSA form.
283
284 E.g. if we have the array R5.x ... R8.x and the following instruction :
285
286 MOV R0.x, R[5 + AR].x
287
288 then source indirect operand is represented with the VLK\_RELREG value,
289 value::rel is AR, value::maydef is empty (in fact it always contain the
290 same number of elements as mayuse to simplify the handling, but they are
291 NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
292 corresponding SSA versions after ssa\_rename).
293
294 Additional "virtual variables" as in [HSSA [2]](#references) are not
295 used, also there is no special handling for "zero versions". Typical
296 programs in our case are small, indirect addressing is rare, array sizes
297 are limited by max gpr number, so we don't really need to use special
298 tricks to avoid the explosion of value versions. Also this allows more
299 precise liveness computation for array elements without modifications to
300 the algorithms.
301
302 With the following instruction:
303
304 MOV R[5+AR].x, R0.x
305
306 we'll have both maydef and mayuse vectors for dst operand filled with
307 array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
308 pass mayuse will contain previous versions, maydef will contain new
309 potentially-defined versions.
310
311 * * * * *
312
313 Passes
314 ------
315
316 - **bc\_parser** - creates the IR from the source bytecode,
317 initializes src and dst value vectors for instruction nodes. Most
318 ALU nodes have one dst operand and the number of source operands is
319 equal to the number of source operands for the ISA instruction.
320 Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
321 gpr as in the original instruction, other two are pseudo-operands
322 that represent possibly updated predicate and exec\_mask. Predicate
323 values are used in the predicated alu instructions (node::pred),
324 exec\_mask values are used in the if\_nodes (if\_node::cond). Each
325 vector operand in the CF/TEX/VTX instructions is represented with 4
326 values - components of the vector.
327
328 - **ssa\_prepare** - creates phi expressions.
329
330 - **ssa\_rename** - renames the values (assigns versions).
331
332 - **liveness** - liveness computation, sets 'dead' flag for unused
333 nodes and values, optionally computes interference information for
334 the values.
335
336 - **dce\_cleanup** - eliminates 'dead' nodes, also removes some
337 unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
338 instructions in the source, containers for ALU groups (they were
339 only needed for the ssa\_rename pass)
340
341 - **if\_conversion** - converts control flow with if\_nodes to the
342 data flow in cases where it can improve performance (small alu-only
343 branches). Both branches are executed speculatively and the phi
344 expressions are replaced with conditional moves (CNDxx) to select
345 the final value using the same condition predicate as was used by
346 the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
347 instruction, CNDxx now uses dst[0] from the same PREDSETxx
348 instruction.
349
350 - **peephole** - peephole optimizations
351
352 - **gvn** - Global Value Numbering [[2]](#references),
353 [[3]](#references)
354
355 - **gcm** - Global Code Motion [[3]](#references). Also performs
356 grouping of the instructions of the same kind (CF/FETCH/ALU).
357
358 - register allocation passes, some ideas are used from
359 [[4]](#references), but implementation is simplified to make it more
360 efficient in terms of the compilation speed (e.g. no recursive
361 recoloring) while achieving good enough results.
362
363 - **ra\_split** - prepares the program to register allocation.
364 Splits live ranges for constrained values by inserting the
365 copies to/from temporary values, so that the live range of the
366 constrained values becomes minimal.
367
368 - **ra\_coalesce** - performs global allocation on registers used
369 in CF/FETCH instructions. It's performed first to make sure they
370 end up in the same GPR. Also tries to allocate all values
371 involved in copies (inserted by the ra\_split pass) to the same
372 register, so that the copies may be eliminated.
373
374 - **ra\_init** - allocates gpr arrays (if indirect addressing is
375 used), and remaining values.
376
377 - **post\_scheduler** - ALU scheduler, handles VLIW packing and
378 performs the final register allocation for local values inside ALU
379 clauses. Eliminates all coalesced copies (if src and dst of the copy
380 are allocated to the same register).
381
382 - **ra\_checker** - optional debugging pass that tries to catch basic
383 errors of the scheduler or regalloc,
384
385 - **bc\_finalize** - propagates the regalloc information from values
386 in node::src and node::dst vectors to the bytecode fields, converts
387 control flow structure (region/depart/repeat) to the target
388 instructions (JUMP/ELSE/POP,
389 LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
390
391 - **bc\_builder** - builds final bytecode,
392
393 * * * * *
394
395 References
396 ----------
397
398 [1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
399 McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
400
401 [2] ["Effective Representation of Aliases and Indirect Memory Operations
402 in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
403 Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
404
405 [3] ["Global Code Motion. Global Value Numbering.", Cliff
406 Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
407
408 [4] ["Register Allocation for Programs in SSA Form", Sebastian
409 Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
410
411 [5] ["An extension to the SSA representation for predicated code",
412 Francois de
413 Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
414
415 [6] ["Improvements to the Psi-SSA Representation", F. de
416 Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)