9 ### Environment variables
15 - **nosb** - Disable sb backend for graphics shaders
16 - **sbcl** - Enable optimization of compute shaders (experimental)
17 - **sbdry** - Dry run, optimize but use source bytecode -
18 useful if you only want to check shader dumps
19 without the risk of lockups and other problems
20 - **sbstat** - Print optimization statistics (only time so far)
21 - **sbdump** - Print IR after some passes.
22 - **sbnofallback** - Abort on errors instead of fallback
23 - **sbdisasm** - Use sb disassembler for shader dumps
24 - **sbsafemath** - Disable unsafe math optimizations
26 ### Regression debugging
28 If there are any regressions as compared to the default backend
29 (R600\_SB=0), it's possible to use the following environment variables
30 to find the incorrectly optimized shader that causes the regression.
32 - **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
34 - 0 - disabled (default)
35 - 1 - skip optimization for the shaders in the range
36 [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
37 optimize only the shaders that are not in this range
38 - 2 - optimize only the shaders in the range
39 [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
41 - **R600\_SB\_DSKIP\_START** - start of the range (1-based)
43 - **R600\_SB\_DSKIP\_END** - end of the range (1-based)
45 Example - optimize only the shaders 5, 6, and 7:
47 R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
49 All shaders compiled by the application are numbered starting from 1,
50 the number of shaders used by the application may be obtained by running
51 it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
52 for each compiled shader.
54 After figuring out the total number of shaders used by the application,
55 the variables above allow to use bisection to find the shader that is
56 the cause of regression. E.g. if the application uses 100 shaders, we
57 can divide the range [1; 100] and run the application with the
58 optimization enabled only for the first half of the shaders:
60 R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
62 If the regression is reproduced with these parameters, then the failing
63 shader is in the range [1; 50], if it's not reproduced - then it's in
64 the range [51; 100]. Then we can divide the new range again and repeat
65 the testing, until we'll reduce the range to a single failing shader.
67 *NOTE: This method relies on the assumption that the application
68 produces the same sequence of the shaders on each run. It's not always
69 true - some applications may produce different sequences of the shaders,
70 in such cases the tools like apitrace may be used to record the trace
71 with the application, then this method may be applied when replaying the
72 trace - also this may be faster and/or more convenient than testing the
77 Intermediate Representation
78 ---------------------------
82 All kinds of the operands (literal constants, references to kcache
83 constants, references to GPRs, etc) are currently represented by the
84 **value** class (possibly it makes sense to switch to hierarchy of
85 classes derived from **value** instead, to save some memory).
87 All values (except some pseudo values like the exec\_mask or predicate
88 register) represent 32bit scalar values - there are no vector values,
89 CF/FETCH instructions use groups of 4 values for src and dst operands.
93 Shader programs are represented using the tree data structure, some
94 nodes contain a list of subnodes.
96 #### Control flow nodes
98 Control flow information is represented using four special node types
99 (based on the ideas from [[1]](#references) )
101 - **region\_node** - single-entry, single-exit region.
103 All loops and if's in the program are enclosed in region nodes.
104 Region nodes have two containers for phi nodes -
105 region\_node::loop\_phi contains the phi expressions to be executed
106 at the region entry, region\_node::phi contains the phi expressions
107 to be executed at the region exit. It's the only type of the node
108 that contains associated phi expressions.
110 - **depart\_node** - "depart region \$id after { ... }"
112 Depart target region (jump to exit point) after executing contained
115 - **repeat\_node** - "repeat region \$id after { ... }"
117 Repeat target region (jump to entry point) after executing contained
120 - **if\_node** - "if (cond) { ... }"
122 Execute contained code if condition is true. The difference from
123 [[1]](#references) is that we don't have associated phi expressions
124 for the **if\_node**, we enclose **if\_node** in the
125 **region\_node** and store corresponding phi's in the
126 **region\_node**, this allows more uniform handling.
128 The target region of depart and repeat nodes is always the region where
129 they are located (possibly in the nested region), there are no arbitrary
130 jumps/goto's - control flow in the program is always structured.
132 Typical control flow constructs can be represented as in the following
146 depart region #0 after {
148 depart region #0 after {
154 <region #0 phi nodes >
166 <region #0 loop_phi nodes>
167 repeat region #0 after {
169 depart region #1 after {
177 <region #0 phi nodes>
180 'Break' and 'continue' inside the loops are directly translated to the
181 depart and repeat nodes for the corresponding loop region.
183 This may look a bit too complicated, but in fact this allows more simple
184 and uniform handling of the control flow.
186 All loop\_phi and phi nodes for some region always have the same number
187 of source operands. The number of source operands for
188 region\_node::loop\_phi nodes is 1 + number of repeat nodes that
189 reference this region as a target. The number of source operands for
190 region\_node::phi nodes is equal to the number of depart nodes that
191 reference this region as a target. All depart/repeat nodes for the
192 region have unique indices equal to the index of source operand for
195 First source operand for region\_node::loop\_phi nodes (src[0]) is an
196 incoming value that enters the region from the outside. Each remaining
197 source operand comes from the corresponding repeat node.
199 More complex example:
220 // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
221 region#0 loop_phi: a.2 = phi a.1, a.6, a.3
223 repeat_1 region #0 after {
227 depart_0 region #1 after {
234 region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
238 depart_0 region #2 after {
247 region #0 phi: a.7 = phi a.5 // src[0] from depart_0
250 Phi nodes with single source operand are just copies, they are not
251 really necessary, but this allows to handle all **depart\_node**s in the
254 #### Instruction nodes
256 Instruction nodes represent different kinds of instructions -
257 **alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
258 the "bc" structure where all fields of the bytecode are stored (the type
259 is **bc\_alu** for **alu\_node**, etc). The operands are represented
260 using the vectors of pointers to **value** class (node::src, node::dst)
262 #### SSA-specific nodes
264 Phi nodes currently don't have special node class, they are stored as
265 **node**. Destination vector contains a single destination value, source
266 vector contains 1 or more source values.
268 Psi nodes [[5], [6]](#references) also don't have a special node class
269 and stored as **node**. Source vector contains 3 values for each source
270 operand - the **value** of predicate, **value** of corresponding
271 PRED\_SEL field, and the source **value** itself.
273 ### Indirect addressing
275 Special kind of values (VLK\_RELREG) is used to represent indirect
276 operands. These values don't have SSA versions. The representation is
277 mostly based on the [[2]](#references). Indirect operand contains the
278 "offset/address" value (value::rel), (e.g. some SSA version of the AR
279 register value, though after some passes it may be any value - constant,
280 register, etc), also it contains the maydef and mayuse vectors of
281 pointers to **value**s (similar to dst/src vectors in the **node**) to
282 represent the effects of aliasing in the SSA form.
284 E.g. if we have the array R5.x ... R8.x and the following instruction :
286 MOV R0.x, R[5 + AR].x
288 then source indirect operand is represented with the VLK\_RELREG value,
289 value::rel is AR, value::maydef is empty (in fact it always contain the
290 same number of elements as mayuse to simplify the handling, but they are
291 NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
292 corresponding SSA versions after ssa\_rename).
294 Additional "virtual variables" as in [HSSA [2]](#references) are not
295 used, also there is no special handling for "zero versions". Typical
296 programs in our case are small, indirect addressing is rare, array sizes
297 are limited by max gpr number, so we don't really need to use special
298 tricks to avoid the explosion of value versions. Also this allows more
299 precise liveness computation for array elements without modifications to
302 With the following instruction:
306 we'll have both maydef and mayuse vectors for dst operand filled with
307 array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
308 pass mayuse will contain previous versions, maydef will contain new
309 potentially-defined versions.
316 - **bc\_parser** - creates the IR from the source bytecode,
317 initializes src and dst value vectors for instruction nodes. Most
318 ALU nodes have one dst operand and the number of source operands is
319 equal to the number of source operands for the ISA instruction.
320 Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
321 gpr as in the original instruction, other two are pseudo-operands
322 that represent possibly updated predicate and exec\_mask. Predicate
323 values are used in the predicated alu instructions (node::pred),
324 exec\_mask values are used in the if\_nodes (if\_node::cond). Each
325 vector operand in the CF/TEX/VTX instructions is represented with 4
326 values - components of the vector.
328 - **ssa\_prepare** - creates phi expressions.
330 - **ssa\_rename** - renames the values (assigns versions).
332 - **liveness** - liveness computation, sets 'dead' flag for unused
333 nodes and values, optionally computes interference information for
336 - **dce\_cleanup** - eliminates 'dead' nodes, also removes some
337 unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
338 instructions in the source, containers for ALU groups (they were
339 only needed for the ssa\_rename pass)
341 - **if\_conversion** - converts control flow with if\_nodes to the
342 data flow in cases where it can improve performance (small alu-only
343 branches). Both branches are executed speculatively and the phi
344 expressions are replaced with conditional moves (CNDxx) to select
345 the final value using the same condition predicate as was used by
346 the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
347 instruction, CNDxx now uses dst[0] from the same PREDSETxx
350 - **peephole** - peephole optimizations
352 - **gvn** - Global Value Numbering [[2]](#references),
355 - **gcm** - Global Code Motion [[3]](#references). Also performs
356 grouping of the instructions of the same kind (CF/FETCH/ALU).
358 - register allocation passes, some ideas are used from
359 [[4]](#references), but implementation is simplified to make it more
360 efficient in terms of the compilation speed (e.g. no recursive
361 recoloring) while achieving good enough results.
363 - **ra\_split** - prepares the program to register allocation.
364 Splits live ranges for constrained values by inserting the
365 copies to/from temporary values, so that the live range of the
366 constrained values becomes minimal.
368 - **ra\_coalesce** - performs global allocation on registers used
369 in CF/FETCH instructions. It's performed first to make sure they
370 end up in the same GPR. Also tries to allocate all values
371 involved in copies (inserted by the ra\_split pass) to the same
372 register, so that the copies may be eliminated.
374 - **ra\_init** - allocates gpr arrays (if indirect addressing is
375 used), and remaining values.
377 - **post\_scheduler** - ALU scheduler, handles VLIW packing and
378 performs the final register allocation for local values inside ALU
379 clauses. Eliminates all coalesced copies (if src and dst of the copy
380 are allocated to the same register).
382 - **ra\_checker** - optional debugging pass that tries to catch basic
383 errors of the scheduler or regalloc,
385 - **bc\_finalize** - propagates the regalloc information from values
386 in node::src and node::dst vectors to the bytecode fields, converts
387 control flow structure (region/depart/repeat) to the target
388 instructions (JUMP/ELSE/POP,
389 LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
391 - **bc\_builder** - builds final bytecode,
398 [1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
399 McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
401 [2] ["Effective Representation of Aliases and Indirect Memory Operations
402 in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
403 Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
405 [3] ["Global Code Motion. Global Value Numbering.", Cliff
406 Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
408 [4] ["Register Allocation for Programs in SSA Form", Sebastian
409 Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
411 [5] ["An extension to the SSA representation for predicated code",
413 Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
415 [6] ["Improvements to the Psi-SSA Representation", F. de
416 Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)