src/gallium/drivers/r600/sb/notes.markdown

   1 r600-sb
   2 =======
   3
   4 * * * * *
   5
   6 Debugging
   7 ---------
   8
   9 ### Environment variables
  10
  11 -   **R600\_DEBUG**
  12
  13     There are new flags:
  14
  15     -   **sb** - Enable optimization of graphics shaders
  16     -   **sbcl** - Enable optimization of compute shaders (experimental)
  17     -   **sbdry** - Dry run, optimize but use source bytecode -
  18         useful if you only want to check shader dumps
  19         without the risk of lockups and other problems
  20     -   **sbstat** - Print optimization statistics (only time so far)
  21     -   **sbdump** - Print IR after some passes.
  22
  23 ### Regression debugging
  24
  25 If there are any regressions as compared to the default backend
  26 (R600\_SB=0), it's possible to use the following environment variables
  27 to find the incorrectly optimized shader that causes the regression.
  28
  29 -   **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
  30     shaders
  31     -   0 - disabled (default)
  32     -   1 - skip optimization for the shaders in the range
  33         [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
  34         optimize only the shaders that are not in this range
  35     -   2 - optimize only the shaders in the range
  36         [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
  37
  38 -   **R600\_SB\_DSKIP\_START** - start of the range (1-based)
  39
  40 -   **R600\_SB\_DSKIP\_END** - end of the range (1-based)
  41
  42 Example - optimize only the shaders 5, 6, and 7:
  43
  44     R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
  45
  46 All shaders compiled by the application are numbered starting from 1,
  47 the number of shaders used by the application may be obtained by running
  48 it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
  49 for each compiled shader.
  50
  51 After figuring out the total number of shaders used by the application,
  52 the variables above allow to use bisection to find the shader that is
  53 the cause of regression. E.g. if the application uses 100 shaders, we
  54 can divide the range [1; 100] and run the application with the
  55 optimization enabled only for the first half of the shaders:
  56
  57     R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
  58
  59 If the regression is reproduced with these parameters, then the failing
  60 shader is in the range [1; 50], if it's not reproduced - then it's in
  61 the range [51; 100]. Then we can divide the new range again and repeat
  62 the testing, until we'll reduce the range to a single failing shader.
  63
  64 *NOTE: This method relies on the assumption that the application
  65 produces the same sequence of the shaders on each run. It's not always
  66 true - some applications may produce different sequences of the shaders,
  67 in such cases the tools like apitrace may be used to record the trace
  68 with the application, then this method may be applied when replaying the
  69 trace - also this may be faster and/or more convenient than testing the
  70 application itself.*
  71
  72 * * * * *
  73
  74 Intermediate Representation
  75 ---------------------------
  76
  77 ### Values
  78
  79 All kinds of the operands (literal constants, references to kcache
  80 constants, references to GPRs, etc) are currently represented by the
  81 **value** class (possibly it makes sense to switch to hierarchy of
  82 classes derived from **value** instead, to save some memory).
  83
  84 All values (except some pseudo values like the exec\_mask or predicate
  85 register) represent 32bit scalar values - there are no vector values,
  86 CF/FETCH instructions use groups of 4 values for src and dst operands.
  87
  88 ### Nodes
  89
  90 Shader programs are represented using the tree data structure, some
  91 nodes contain a list of subnodes.
  92
  93 #### Control flow nodes
  94
  95 Control flow information is represented using four special node types
  96 (based on the ideas from [[1]](#references) )
  97
  98 -   **region\_node** - single-entry, single-exit region.
  99
 100     All loops and if's in the program are enclosed in region nodes.
 101     Region nodes have two containers for phi nodes -
 102     region\_node::loop\_phi contains the phi expressions to be executed
 103     at the region entry, region\_node::phi contains the phi expressions
 104     to be executed at the region exit. It's the only type of the node
 105     that contains associated phi expressions.
 106
 107 -   **depart\_node** - "depart region \$id after { ... }"
 108
 109     Depart target region (jump to exit point) after executing contained
 110     code.
 111
 112 -   **repeat\_node** - "repeat region \$id after { ... }"
 113
 114     Repeat target region (jump to entry point) after executing contained
 115     code.
 116
 117 -   **if\_node** - "if (cond) { ... }"
 118
 119     Execute contained code if condition is true. The difference from
 120     [[1]](#references) is that we don't have associated phi expressions
 121     for the **if\_node**, we enclose **if\_node** in the
 122     **region\_node** and store corresponding phi's in the
 123     **region\_node**, this allows more uniform handling.
 124
 125 The target region of depart and repeat nodes is always the region where
 126 they are located (possibly in the nested region), there are no arbitrary
 127 jumps/goto's - control flow in the program is always structured.
 128
 129 Typical control flow constructs can be represented as in the following
 130 examples:
 131
 132 GLSL:
 133
 134     if (cond) {
 135         < 1 >
 136     } else {
 137         < 2 >
 138     }
 139
 140 IR:
 141
 142     region #0 {
 143         depart region #0 after {
 144             if (cond) {
 145                 depart region #0 after {
 146                     < 1 >
 147                 }
 148             }
 149             < 2 >
 150         }
 151         <region #0 phi nodes >
 152     }
 153
 154 GLSL:
 155
 156     while (cond) {
 157         < 1 >
 158     }
 159
 160 IR:
 161
 162     region #0 {
 163         <region #0 loop_phi nodes>
 164         repeat region #0 after {
 165             region #1 {
 166                 depart region #1 after {
 167                     if (!cond) {
 168                         depart region #0
 169                     }
 170                 }
 171             }
 172             < 1 >
 173         }
 174         <region #0 phi nodes>
 175     }
 176
 177 'Break' and 'continue' inside the loops are directly translated to the
 178 depart and repeat nodes for the corresponding loop region.
 179
 180 This may look a bit too complicated, but in fact this allows more simple
 181 and uniform handling of the control flow.
 182
 183 All loop\_phi and phi nodes for some region always have the same number
 184 of source operands. The number of source operands for
 185 region\_node::loop\_phi nodes is 1 + number of repeat nodes that
 186 reference this region as a target. The number of source operands for
 187 region\_node::phi nodes is equal to the number of depart nodes that
 188 reference this region as a target. All depart/repeat nodes for the
 189 region have unique indices equal to the index of source operand for
 190 phi/loop\_phi nodes.
 191
 192 First source operand for region\_node::loop\_phi nodes (src[0]) is an
 193 incoming value that enters the region from the outside. Each remaining
 194 source operand comes from the corresponding repeat node.
 195
 196 More complex example:
 197
 198 GLSL:
 199
 200     a = 1;
 201     while (a < 5) {
 202         a = a * 2;
 203         if (b == 3) {
 204             continue;
 205         } else {
 206             a = 6;
 207         }
 208         if (c == 4)
 209             break;
 210         a = a + 1;
 211     }
 212
 213 IR with SSA form:
 214
 215     a.1 = 1;
 216     region #0 {
 217         // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
 218         region#0 loop_phi: a.2 = phi a.1, a.6, a.3
 219
 220         repeat_1 region #0 after {
 221             a.3 = a.2 * 2;
 222             cond1 = (b == 3);
 223             region #1 {
 224                 depart_0 region #1 after {
 225                     if (cond1) {
 226                         repeat_2 region #0;
 227                     }
 228                 }
 229                 a.4 = 6;
 230
 231                 region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
 232             }
 233             cond2 = (c == 4);
 234             region #2 {
 235                 depart_0 region #2 after {
 236                     if (cond2) {
 237                         depart_0 region #0;
 238                     }
 239                 }
 240             }
 241             a.6 = a.5 + 1;
 242         }
 243
 244         region #0 phi: a.7 = phi a.5 // src[0] from depart_0
 245     }
 246
 247 Phi nodes with single source operand are just copies, they are not
 248 really necessary, but this allows to handle all **depart\_node**s in the
 249 uniform way.
 250
 251 #### Instruction nodes
 252
 253 Instruction nodes represent different kinds of instructions -
 254 **alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
 255 the "bc" structure where all fields of the bytecode are stored (the type
 256 is **bc\_alu** for **alu\_node**, etc). The operands are represented
 257 using the vectors of pointers to **value** class (node::src, node::dst)
 258
 259 #### SSA-specific nodes
 260
 261 Phi nodes currently don't have special node class, they are stored as
 262 **node**. Destination vector contains a single destination value, source
 263 vector contains 1 or more source values.
 264
 265 Psi nodes [[5], [6]](#references) also don't have a special node class
 266 and stored as **node**. Source vector contains 3 values for each source
 267 operand - the **value** of predicate, **value** of corresponding
 268 PRED\_SEL field, and the source **value** itself.
 269
 270 ### Indirect addressing
 271
 272 Special kind of values (VLK\_RELREG) is used to represent indirect
 273 operands. These values don't have SSA versions. The representation is
 274 mostly based on the [[2]](#references). Indirect operand contains the
 275 "offset/address" value (value::rel), (e.g. some SSA version of the AR
 276 register value, though after some passes it may be any value - constant,
 277 register, etc), also it contains the maydef and mayuse vectors of
 278 pointers to **value**s (similar to dst/src vectors in the **node**) to
 279 represent the effects of aliasing in the SSA form.
 280
 281 E.g. if we have the array R5.x ... R8.x and the following instruction :
 282
 283     MOV R0.x, R[5 + AR].x
 284
 285 then source indirect operand is represented with the VLK\_RELREG value,
 286 value::rel is AR, value::maydef is empty (in fact it always contain the
 287 same number of elements as mayuse to simplify the handling, but they are
 288 NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
 289 corresponding SSA versions after ssa\_rename).
 290
 291 Additional "virtual variables" as in [HSSA [2]](#references) are not
 292 used, also there is no special handling for "zero versions". Typical
 293 programs in our case are small, indirect addressing is rare, array sizes
 294 are limited by max gpr number, so we don't really need to use special
 295 tricks to avoid the explosion of value versions. Also this allows more
 296 precise liveness computation for array elements without modifications to
 297 the algorithms.
 298
 299 With the following instruction:
 300
 301     MOV R[5+AR].x, R0.x
 302
 303 we'll have both maydef and mayuse vectors for dst operand filled with
 304 array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
 305 pass mayuse will contain previous versions, maydef will contain new
 306 potentially-defined versions.
 307
 308 * * * * *
 309
 310 Passes
 311 ------
 312
 313 -   **bc\_parser** - creates the IR from the source bytecode,
 314     initializes src and dst value vectors for instruction nodes. Most
 315     ALU nodes have one dst operand and the number of source operands is
 316     equal to the number of source operands for the ISA instruction.
 317     Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
 318     gpr as in the original instruction, other two are pseudo-operands
 319     that represent possibly updated predicate and exec\_mask. Predicate
 320     values are used in the predicated alu instructions (node::pred),
 321     exec\_mask values are used in the if\_nodes (if\_node::cond). Each
 322     vector operand in the CF/TEX/VTX instructions is represented with 4
 323     values - components of the vector.
 324
 325 -   **ssa\_prepare** - creates phi expressions.
 326
 327 -   **ssa\_rename** - renames the values (assigns versions).
 328
 329 -   **liveness** - liveness computation, sets 'dead' flag for unused
 330     nodes and values, optionally computes interference information for
 331     the values.
 332
 333 -   **dce\_cleanup** - eliminates 'dead' nodes, also removes some
 334     unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
 335     instructions in the source, containers for ALU groups (they were
 336     only needed for the ssa\_rename pass)
 337
 338 -   **if\_conversion** - converts control flow with if\_nodes to the
 339     data flow in cases where it can improve performance (small alu-only
 340     branches). Both branches are executed speculatively and the phi
 341     expressions are replaced with conditional moves (CNDxx) to select
 342     the final value using the same condition predicate as was used by
 343     the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
 344     instruction, CNDxx now uses dst[0] from the same PREDSETxx
 345     instruction.
 346
 347 -   **peephole** - peephole optimizations
 348
 349 -   **gvn** - Global Value Numbering [[2]](#references),
 350     [[3]](#references)
 351
 352 -   **gcm** - Global Code Motion [[3]](#references). Also performs
 353     grouping of the instructions of the same kind (CF/FETCH/ALU).
 354
 355 -   register allocation passes, some ideas are used from
 356     [[4]](#references), but implementation is simplified to make it more
 357     efficient in terms of the compilation speed (e.g. no recursive
 358     recoloring) while achieving good enough results.
 359
 360     -   **ra\_split** - prepares the program to register allocation.
 361         Splits live ranges for constrained values by inserting the
 362         copies to/from temporary values, so that the live range of the
 363         constrained values becomes minimal.
 364
 365     -   **ra\_coalesce** - performs global allocation on registers used
 366         in CF/FETCH instructions. It's performed first to make sure they
 367         end up in the same GPR. Also tries to allocate all values
 368         involved in copies (inserted by the ra\_split pass) to the same
 369         register, so that the copies may be eliminated.
 370
 371     -   **ra\_init** - allocates gpr arrays (if indirect addressing is
 372         used), and remaining values.
 373
 374 -   **post\_scheduler** - ALU scheduler, handles VLIW packing and
 375     performs the final register allocation for local values inside ALU
 376     clauses. Eliminates all coalesced copies (if src and dst of the copy
 377     are allocated to the same register).
 378
 379 -   **ra\_checker** - optional debugging pass that tries to catch basic
 380     errors of the scheduler or regalloc,
 381
 382 -   **bc\_finalize** - propagates the regalloc information from values
 383     in node::src and node::dst vectors to the bytecode fields, converts
 384     control flow structure (region/depart/repeat) to the target
 385     instructions (JUMP/ELSE/POP,
 386     LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
 387
 388 -   **bc\_builder** - builds final bytecode,
 389
 390 * * * * *
 391
 392 References
 393 ----------
 394
 395 [1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
 396 McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
 397
 398 [2] ["Effective Representation of Aliases and Indirect Memory Operations
 399 in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
 400 Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
 401
 402 [3] ["Global Code Motion. Global Value Numbering.", Cliff
 403 Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
 404
 405 [4] ["Register Allocation for Programs in SSA Form", Sebastian
 406 Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
 407
 408 [5] ["An extension to the SSA representation for predicated code",
 409 Francois de
 410 Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
 411
 412 [6] ["Improvements to the Psi-SSA Representation", F. de
 413 Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)