src/gallium/drivers/r600/sb/notes.markdown

   1 r600-sb
   2 =======
   3
   4 * * * * *
   5
   6 Debugging
   7 ---------
   8
   9 ### Environment variables
  10
  11 -   **R600\_DEBUG**
  12
  13     There are new flags:
  14
  15     -   **nosb** - Disable sb backend for graphics shaders
  16     -   **sbcl** - Enable optimization of compute shaders (experimental)
  17     -   **sbdry** - Dry run, optimize but use source bytecode -
  18         useful if you only want to check shader dumps
  19         without the risk of lockups and other problems
  20     -   **sbstat** - Print optimization statistics (only time so far)
  21     -   **sbdump** - Print IR after some passes.
  22     -   **sbnofallback** - Abort on errors instead of fallback
  23     -   **sbdisasm** - Use sb disassembler for shader dumps
  24     -   **sbsafemath** - Disable unsafe math optimizations
  25
  26 ### Regression debugging
  27
  28 If there are any regressions as compared to the default backend
  29 (R600\_SB=0), it's possible to use the following environment variables
  30 to find the incorrectly optimized shader that causes the regression.
  31
  32 -   **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
  33     shaders
  34     -   0 - disabled (default)
  35     -   1 - skip optimization for the shaders in the range
  36         [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
  37         optimize only the shaders that are not in this range
  38     -   2 - optimize only the shaders in the range
  39         [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
  40
  41 -   **R600\_SB\_DSKIP\_START** - start of the range (1-based)
  42
  43 -   **R600\_SB\_DSKIP\_END** - end of the range (1-based)
  44
  45 Example - optimize only the shaders 5, 6, and 7:
  46
  47     R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
  48
  49 All shaders compiled by the application are numbered starting from 1,
  50 the number of shaders used by the application may be obtained by running
  51 it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
  52 for each compiled shader.
  53
  54 After figuring out the total number of shaders used by the application,
  55 the variables above allow to use bisection to find the shader that is
  56 the cause of regression. E.g. if the application uses 100 shaders, we
  57 can divide the range [1; 100] and run the application with the
  58 optimization enabled only for the first half of the shaders:
  59
  60     R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
  61
  62 If the regression is reproduced with these parameters, then the failing
  63 shader is in the range [1; 50], if it's not reproduced - then it's in
  64 the range [51; 100]. Then we can divide the new range again and repeat
  65 the testing, until we'll reduce the range to a single failing shader.
  66
  67 *NOTE: This method relies on the assumption that the application
  68 produces the same sequence of the shaders on each run. It's not always
  69 true - some applications may produce different sequences of the shaders,
  70 in such cases the tools like apitrace may be used to record the trace
  71 with the application, then this method may be applied when replaying the
  72 trace - also this may be faster and/or more convenient than testing the
  73 application itself.*
  74
  75 * * * * *
  76
  77 Intermediate Representation
  78 ---------------------------
  79
  80 ### Values
  81
  82 All kinds of the operands (literal constants, references to kcache
  83 constants, references to GPRs, etc) are currently represented by the
  84 **value** class (possibly it makes sense to switch to hierarchy of
  85 classes derived from **value** instead, to save some memory).
  86
  87 All values (except some pseudo values like the exec\_mask or predicate
  88 register) represent 32bit scalar values - there are no vector values,
  89 CF/FETCH instructions use groups of 4 values for src and dst operands.
  90
  91 ### Nodes
  92
  93 Shader programs are represented using the tree data structure, some
  94 nodes contain a list of subnodes.
  95
  96 #### Control flow nodes
  97
  98 Control flow information is represented using four special node types
  99 (based on the ideas from [[1]](#references) )
 100
 101 -   **region\_node** - single-entry, single-exit region.
 102
 103     All loops and if's in the program are enclosed in region nodes.
 104     Region nodes have two containers for phi nodes -
 105     region\_node::loop\_phi contains the phi expressions to be executed
 106     at the region entry, region\_node::phi contains the phi expressions
 107     to be executed at the region exit. It's the only type of the node
 108     that contains associated phi expressions.
 109
 110 -   **depart\_node** - "depart region \$id after { ... }"
 111
 112     Depart target region (jump to exit point) after executing contained
 113     code.
 114
 115 -   **repeat\_node** - "repeat region \$id after { ... }"
 116
 117     Repeat target region (jump to entry point) after executing contained
 118     code.
 119
 120 -   **if\_node** - "if (cond) { ... }"
 121
 122     Execute contained code if condition is true. The difference from
 123     [[1]](#references) is that we don't have associated phi expressions
 124     for the **if\_node**, we enclose **if\_node** in the
 125     **region\_node** and store corresponding phi's in the
 126     **region\_node**, this allows more uniform handling.
 127
 128 The target region of depart and repeat nodes is always the region where
 129 they are located (possibly in the nested region), there are no arbitrary
 130 jumps/goto's - control flow in the program is always structured.
 131
 132 Typical control flow constructs can be represented as in the following
 133 examples:
 134
 135 GLSL:
 136
 137     if (cond) {
 138         < 1 >
 139     } else {
 140         < 2 >
 141     }
 142
 143 IR:
 144
 145     region #0 {
 146         depart region #0 after {
 147             if (cond) {
 148                 depart region #0 after {
 149                     < 1 >
 150                 }
 151             }
 152             < 2 >
 153         }
 154         <region #0 phi nodes >
 155     }
 156
 157 GLSL:
 158
 159     while (cond) {
 160         < 1 >
 161     }
 162
 163 IR:
 164
 165     region #0 {
 166         <region #0 loop_phi nodes>
 167         repeat region #0 after {
 168             region #1 {
 169                 depart region #1 after {
 170                     if (!cond) {
 171                         depart region #0
 172                     }
 173                 }
 174             }
 175             < 1 >
 176         }
 177         <region #0 phi nodes>
 178     }
 179
 180 'Break' and 'continue' inside the loops are directly translated to the
 181 depart and repeat nodes for the corresponding loop region.
 182
 183 This may look a bit too complicated, but in fact this allows more simple
 184 and uniform handling of the control flow.
 185
 186 All loop\_phi and phi nodes for some region always have the same number
 187 of source operands. The number of source operands for
 188 region\_node::loop\_phi nodes is 1 + number of repeat nodes that
 189 reference this region as a target. The number of source operands for
 190 region\_node::phi nodes is equal to the number of depart nodes that
 191 reference this region as a target. All depart/repeat nodes for the
 192 region have unique indices equal to the index of source operand for
 193 phi/loop\_phi nodes.
 194
 195 First source operand for region\_node::loop\_phi nodes (src[0]) is an
 196 incoming value that enters the region from the outside. Each remaining
 197 source operand comes from the corresponding repeat node.
 198
 199 More complex example:
 200
 201 GLSL:
 202
 203     a = 1;
 204     while (a < 5) {
 205         a = a * 2;
 206         if (b == 3) {
 207             continue;
 208         } else {
 209             a = 6;
 210         }
 211         if (c == 4)
 212             break;
 213         a = a + 1;
 214     }
 215
 216 IR with SSA form:
 217
 218     a.1 = 1;
 219     region #0 {
 220         // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
 221         region#0 loop_phi: a.2 = phi a.1, a.6, a.3
 222
 223         repeat_1 region #0 after {
 224             a.3 = a.2 * 2;
 225             cond1 = (b == 3);
 226             region #1 {
 227                 depart_0 region #1 after {
 228                     if (cond1) {
 229                         repeat_2 region #0;
 230                     }
 231                 }
 232                 a.4 = 6;
 233
 234                 region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
 235             }
 236             cond2 = (c == 4);
 237             region #2 {
 238                 depart_0 region #2 after {
 239                     if (cond2) {
 240                         depart_0 region #0;
 241                     }
 242                 }
 243             }
 244             a.6 = a.5 + 1;
 245         }
 246
 247         region #0 phi: a.7 = phi a.5 // src[0] from depart_0
 248     }
 249
 250 Phi nodes with single source operand are just copies, they are not
 251 really necessary, but this allows to handle all **depart\_node**s in the
 252 uniform way.
 253
 254 #### Instruction nodes
 255
 256 Instruction nodes represent different kinds of instructions -
 257 **alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
 258 the "bc" structure where all fields of the bytecode are stored (the type
 259 is **bc\_alu** for **alu\_node**, etc). The operands are represented
 260 using the vectors of pointers to **value** class (node::src, node::dst)
 261
 262 #### SSA-specific nodes
 263
 264 Phi nodes currently don't have special node class, they are stored as
 265 **node**. Destination vector contains a single destination value, source
 266 vector contains 1 or more source values.
 267
 268 Psi nodes [[5], [6]](#references) also don't have a special node class
 269 and stored as **node**. Source vector contains 3 values for each source
 270 operand - the **value** of predicate, **value** of corresponding
 271 PRED\_SEL field, and the source **value** itself.
 272
 273 ### Indirect addressing
 274
 275 Special kind of values (VLK\_RELREG) is used to represent indirect
 276 operands. These values don't have SSA versions. The representation is
 277 mostly based on the [[2]](#references). Indirect operand contains the
 278 "offset/address" value (value::rel), (e.g. some SSA version of the AR
 279 register value, though after some passes it may be any value - constant,
 280 register, etc), also it contains the maydef and mayuse vectors of
 281 pointers to **value**s (similar to dst/src vectors in the **node**) to
 282 represent the effects of aliasing in the SSA form.
 283
 284 E.g. if we have the array R5.x ... R8.x and the following instruction :
 285
 286     MOV R0.x, R[5 + AR].x
 287
 288 then source indirect operand is represented with the VLK\_RELREG value,
 289 value::rel is AR, value::maydef is empty (in fact it always contain the
 290 same number of elements as mayuse to simplify the handling, but they are
 291 NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
 292 corresponding SSA versions after ssa\_rename).
 293
 294 Additional "virtual variables" as in [HSSA [2]](#references) are not
 295 used, also there is no special handling for "zero versions". Typical
 296 programs in our case are small, indirect addressing is rare, array sizes
 297 are limited by max gpr number, so we don't really need to use special
 298 tricks to avoid the explosion of value versions. Also this allows more
 299 precise liveness computation for array elements without modifications to
 300 the algorithms.
 301
 302 With the following instruction:
 303
 304     MOV R[5+AR].x, R0.x
 305
 306 we'll have both maydef and mayuse vectors for dst operand filled with
 307 array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
 308 pass mayuse will contain previous versions, maydef will contain new
 309 potentially-defined versions.
 310
 311 * * * * *
 312
 313 Passes
 314 ------
 315
 316 -   **bc\_parser** - creates the IR from the source bytecode,
 317     initializes src and dst value vectors for instruction nodes. Most
 318     ALU nodes have one dst operand and the number of source operands is
 319     equal to the number of source operands for the ISA instruction.
 320     Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
 321     gpr as in the original instruction, other two are pseudo-operands
 322     that represent possibly updated predicate and exec\_mask. Predicate
 323     values are used in the predicated alu instructions (node::pred),
 324     exec\_mask values are used in the if\_nodes (if\_node::cond). Each
 325     vector operand in the CF/TEX/VTX instructions is represented with 4
 326     values - components of the vector.
 327
 328 -   **ssa\_prepare** - creates phi expressions.
 329
 330 -   **ssa\_rename** - renames the values (assigns versions).
 331
 332 -   **liveness** - liveness computation, sets 'dead' flag for unused
 333     nodes and values, optionally computes interference information for
 334     the values.
 335
 336 -   **dce\_cleanup** - eliminates 'dead' nodes, also removes some
 337     unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
 338     instructions in the source, containers for ALU groups (they were
 339     only needed for the ssa\_rename pass)
 340
 341 -   **if\_conversion** - converts control flow with if\_nodes to the
 342     data flow in cases where it can improve performance (small alu-only
 343     branches). Both branches are executed speculatively and the phi
 344     expressions are replaced with conditional moves (CNDxx) to select
 345     the final value using the same condition predicate as was used by
 346     the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
 347     instruction, CNDxx now uses dst[0] from the same PREDSETxx
 348     instruction.
 349
 350 -   **peephole** - peephole optimizations
 351
 352 -   **gvn** - Global Value Numbering [[2]](#references),
 353     [[3]](#references)
 354
 355 -   **gcm** - Global Code Motion [[3]](#references). Also performs
 356     grouping of the instructions of the same kind (CF/FETCH/ALU).
 357
 358 -   register allocation passes, some ideas are used from
 359     [[4]](#references), but implementation is simplified to make it more
 360     efficient in terms of the compilation speed (e.g. no recursive
 361     recoloring) while achieving good enough results.
 362
 363     -   **ra\_split** - prepares the program to register allocation.
 364         Splits live ranges for constrained values by inserting the
 365         copies to/from temporary values, so that the live range of the
 366         constrained values becomes minimal.
 367
 368     -   **ra\_coalesce** - performs global allocation on registers used
 369         in CF/FETCH instructions. It's performed first to make sure they
 370         end up in the same GPR. Also tries to allocate all values
 371         involved in copies (inserted by the ra\_split pass) to the same
 372         register, so that the copies may be eliminated.
 373
 374     -   **ra\_init** - allocates gpr arrays (if indirect addressing is
 375         used), and remaining values.
 376
 377 -   **post\_scheduler** - ALU scheduler, handles VLIW packing and
 378     performs the final register allocation for local values inside ALU
 379     clauses. Eliminates all coalesced copies (if src and dst of the copy
 380     are allocated to the same register).
 381
 382 -   **ra\_checker** - optional debugging pass that tries to catch basic
 383     errors of the scheduler or regalloc,
 384
 385 -   **bc\_finalize** - propagates the regalloc information from values
 386     in node::src and node::dst vectors to the bytecode fields, converts
 387     control flow structure (region/depart/repeat) to the target
 388     instructions (JUMP/ELSE/POP,
 389     LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
 390
 391 -   **bc\_builder** - builds final bytecode,
 392
 393 * * * * *
 394
 395 References
 396 ----------
 397
 398 [1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
 399 McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
 400
 401 [2] ["Effective Representation of Aliases and Indirect Memory Operations
 402 in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
 403 Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
 404
 405 [3] ["Global Code Motion. Global Value Numbering.", Cliff
 406 Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
 407
 408 [4] ["Register Allocation for Programs in SSA Form", Sebastian
 409 Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
 410
 411 [5] ["An extension to the SSA representation for predicated code",
 412 Francois de
 413 Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
 414
 415 [6] ["Improvements to the Psi-SSA Representation", F. de
 416 Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)