9 ### Environment variables
15 - **sb** - Enable optimization of graphics shaders
16 - **sbcl** - Enable optimization of compute shaders (experimental)
17 - **sbdry** - Dry run, optimize but use source bytecode -
18 useful if you only want to check shader dumps
19 without the risk of lockups and other problems
20 - **sbstat** - Print optimization statistics (only time so far)
21 - **sbdump** - Print IR after some passes.
23 ### Regression debugging
25 If there are any regressions as compared to the default backend
26 (R600\_SB=0), it's possible to use the following environment variables
27 to find the incorrectly optimized shader that causes the regression.
29 - **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
31 - 0 - disabled (default)
32 - 1 - skip optimization for the shaders in the range
33 [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
34 optimize only the shaders that are not in this range
35 - 2 - optimize only the shaders in the range
36 [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
38 - **R600\_SB\_DSKIP\_START** - start of the range (1-based)
40 - **R600\_SB\_DSKIP\_END** - end of the range (1-based)
42 Example - optimize only the shaders 5, 6, and 7:
44 R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
46 All shaders compiled by the application are numbered starting from 1,
47 the number of shaders used by the application may be obtained by running
48 it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
49 for each compiled shader.
51 After figuring out the total number of shaders used by the application,
52 the variables above allow to use bisection to find the shader that is
53 the cause of regression. E.g. if the application uses 100 shaders, we
54 can divide the range [1; 100] and run the application with the
55 optimization enabled only for the first half of the shaders:
57 R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
59 If the regression is reproduced with these parameters, then the failing
60 shader is in the range [1; 50], if it's not reproduced - then it's in
61 the range [51; 100]. Then we can divide the new range again and repeat
62 the testing, until we'll reduce the range to a single failing shader.
64 *NOTE: This method relies on the assumption that the application
65 produces the same sequence of the shaders on each run. It's not always
66 true - some applications may produce different sequences of the shaders,
67 in such cases the tools like apitrace may be used to record the trace
68 with the application, then this method may be applied when replaying the
69 trace - also this may be faster and/or more convenient than testing the
74 Intermediate Representation
75 ---------------------------
79 All kinds of the operands (literal constants, references to kcache
80 constants, references to GPRs, etc) are currently represented by the
81 **value** class (possibly it makes sense to switch to hierarchy of
82 classes derived from **value** instead, to save some memory).
84 All values (except some pseudo values like the exec\_mask or predicate
85 register) represent 32bit scalar values - there are no vector values,
86 CF/FETCH instructions use groups of 4 values for src and dst operands.
90 Shader programs are represented using the tree data structure, some
91 nodes contain a list of subnodes.
93 #### Control flow nodes
95 Control flow information is represented using four special node types
96 (based on the ideas from [[1]](#references) )
98 - **region\_node** - single-entry, single-exit region.
100 All loops and if's in the program are enclosed in region nodes.
101 Region nodes have two containers for phi nodes -
102 region\_node::loop\_phi contains the phi expressions to be executed
103 at the region entry, region\_node::phi contains the phi expressions
104 to be executed at the region exit. It's the only type of the node
105 that contains associated phi expressions.
107 - **depart\_node** - "depart region \$id after { ... }"
109 Depart target region (jump to exit point) after executing contained
112 - **repeat\_node** - "repeat region \$id after { ... }"
114 Repeat target region (jump to entry point) after executing contained
117 - **if\_node** - "if (cond) { ... }"
119 Execute contained code if condition is true. The difference from
120 [[1]](#references) is that we don't have associated phi expressions
121 for the **if\_node**, we enclose **if\_node** in the
122 **region\_node** and store corresponding phi's in the
123 **region\_node**, this allows more uniform handling.
125 The target region of depart and repeat nodes is always the region where
126 they are located (possibly in the nested region), there are no arbitrary
127 jumps/goto's - control flow in the program is always structured.
129 Typical control flow constructs can be represented as in the following
143 depart region #0 after {
145 depart region #0 after {
151 <region #0 phi nodes >
163 <region #0 loop_phi nodes>
164 repeat region #0 after {
166 depart region #1 after {
174 <region #0 phi nodes>
177 'Break' and 'continue' inside the loops are directly translated to the
178 depart and repeat nodes for the corresponding loop region.
180 This may look a bit too complicated, but in fact this allows more simple
181 and uniform handling of the control flow.
183 All loop\_phi and phi nodes for some region always have the same number
184 of source operands. The number of source operands for
185 region\_node::loop\_phi nodes is 1 + number of repeat nodes that
186 reference this region as a target. The number of source operands for
187 region\_node::phi nodes is equal to the number of depart nodes that
188 reference this region as a target. All depart/repeat nodes for the
189 region have unique indices equal to the index of source operand for
192 First source operand for region\_node::loop\_phi nodes (src[0]) is an
193 incoming value that enters the region from the outside. Each remaining
194 source operand comes from the corresponding repeat node.
196 More complex example:
217 // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
218 region#0 loop_phi: a.2 = phi a.1, a.6, a.3
220 repeat_1 region #0 after {
224 depart_0 region #1 after {
231 region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
235 depart_0 region #2 after {
244 region #0 phi: a.7 = phi a.5 // src[0] from depart_0
247 Phi nodes with single source operand are just copies, they are not
248 really necessary, but this allows to handle all **depart\_node**s in the
251 #### Instruction nodes
253 Instruction nodes represent different kinds of instructions -
254 **alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
255 the "bc" structure where all fields of the bytecode are stored (the type
256 is **bc\_alu** for **alu\_node**, etc). The operands are represented
257 using the vectors of pointers to **value** class (node::src, node::dst)
259 #### SSA-specific nodes
261 Phi nodes currently don't have special node class, they are stored as
262 **node**. Destination vector contains a single destination value, source
263 vector contains 1 or more source values.
265 Psi nodes [[5], [6]](#references) also don't have a special node class
266 and stored as **node**. Source vector contains 3 values for each source
267 operand - the **value** of predicate, **value** of corresponding
268 PRED\_SEL field, and the source **value** itself.
270 ### Indirect addressing
272 Special kind of values (VLK\_RELREG) is used to represent indirect
273 operands. These values don't have SSA versions. The representation is
274 mostly based on the [[2]](#references). Indirect operand contains the
275 "offset/address" value (value::rel), (e.g. some SSA version of the AR
276 register value, though after some passes it may be any value - constant,
277 register, etc), also it contains the maydef and mayuse vectors of
278 pointers to **value**s (similar to dst/src vectors in the **node**) to
279 represent the effects of aliasing in the SSA form.
281 E.g. if we have the array R5.x ... R8.x and the following instruction :
283 MOV R0.x, R[5 + AR].x
285 then source indirect operand is represented with the VLK\_RELREG value,
286 value::rel is AR, value::maydef is empty (in fact it always contain the
287 same number of elements as mayuse to simplify the handling, but they are
288 NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
289 corresponding SSA versions after ssa\_rename).
291 Additional "virtual variables" as in [HSSA [2]](#references) are not
292 used, also there is no special handling for "zero versions". Typical
293 programs in our case are small, indirect addressing is rare, array sizes
294 are limited by max gpr number, so we don't really need to use special
295 tricks to avoid the explosion of value versions. Also this allows more
296 precise liveness computation for array elements without modifications to
299 With the following instruction:
303 we'll have both maydef and mayuse vectors for dst operand filled with
304 array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
305 pass mayuse will contain previous versions, maydef will contain new
306 potentially-defined versions.
313 - **bc\_parser** - creates the IR from the source bytecode,
314 initializes src and dst value vectors for instruction nodes. Most
315 ALU nodes have one dst operand and the number of source operands is
316 equal to the number of source operands for the ISA instruction.
317 Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
318 gpr as in the original instruction, other two are pseudo-operands
319 that represent possibly updated predicate and exec\_mask. Predicate
320 values are used in the predicated alu instructions (node::pred),
321 exec\_mask values are used in the if\_nodes (if\_node::cond). Each
322 vector operand in the CF/TEX/VTX instructions is represented with 4
323 values - components of the vector.
325 - **ssa\_prepare** - creates phi expressions.
327 - **ssa\_rename** - renames the values (assigns versions).
329 - **liveness** - liveness computation, sets 'dead' flag for unused
330 nodes and values, optionally computes interference information for
333 - **dce\_cleanup** - eliminates 'dead' nodes, also removes some
334 unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
335 instructions in the source, containers for ALU groups (they were
336 only needed for the ssa\_rename pass)
338 - **if\_conversion** - converts control flow with if\_nodes to the
339 data flow in cases where it can improve performance (small alu-only
340 branches). Both branches are executed speculatively and the phi
341 expressions are replaced with conditional moves (CNDxx) to select
342 the final value using the same condition predicate as was used by
343 the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
344 instruction, CNDxx now uses dst[0] from the same PREDSETxx
347 - **peephole** - peephole optimizations
349 - **gvn** - Global Value Numbering [[2]](#references),
352 - **gcm** - Global Code Motion [[3]](#references). Also performs
353 grouping of the instructions of the same kind (CF/FETCH/ALU).
355 - register allocation passes, some ideas are used from
356 [[4]](#references), but implementation is simplified to make it more
357 efficient in terms of the compilation speed (e.g. no recursive
358 recoloring) while achieving good enough results.
360 - **ra\_split** - prepares the program to register allocation.
361 Splits live ranges for constrained values by inserting the
362 copies to/from temporary values, so that the live range of the
363 constrained values becomes minimal.
365 - **ra\_coalesce** - performs global allocation on registers used
366 in CF/FETCH instructions. It's performed first to make sure they
367 end up in the same GPR. Also tries to allocate all values
368 involved in copies (inserted by the ra\_split pass) to the same
369 register, so that the copies may be eliminated.
371 - **ra\_init** - allocates gpr arrays (if indirect addressing is
372 used), and remaining values.
374 - **post\_scheduler** - ALU scheduler, handles VLIW packing and
375 performs the final register allocation for local values inside ALU
376 clauses. Eliminates all coalesced copies (if src and dst of the copy
377 are allocated to the same register).
379 - **ra\_checker** - optional debugging pass that tries to catch basic
380 errors of the scheduler or regalloc,
382 - **bc\_finalize** - propagates the regalloc information from values
383 in node::src and node::dst vectors to the bytecode fields, converts
384 control flow structure (region/depart/repeat) to the target
385 instructions (JUMP/ELSE/POP,
386 LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
388 - **bc\_builder** - builds final bytecode,
395 [1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
396 McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
398 [2] ["Effective Representation of Aliases and Indirect Memory Operations
399 in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
400 Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
402 [3] ["Global Code Motion. Global Value Numbering.", Cliff
403 Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
405 [4] ["Register Allocation for Programs in SSA Form", Sebastian
406 Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
408 [5] ["An extension to the SSA representation for predicated code",
410 Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
412 [6] ["Improvements to the Psi-SSA Representation", F. de
413 Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)