freedreno/ir3: show input/output wrmask's in disasm
[mesa.git] / .gitlab-ci / README.md
1 ## Mesa testing using gitlab-runner
2
3 The goal of the "test" stage of the .gitlab-ci.yml is to do pre-merge
4 testing of Mesa drivers on various platforms, so that we can ensure no
5 regressions are merged, as long as developers are merging code using
6 the "Merge when pipeline completes" button.
7
8 This document only covers the CI from .gitlab-ci.yml and this
9 directory. For other CI systems, see Intel's [Mesa
10 CI](https://gitlab.freedesktop.org/Mesa_CI) or panfrost's LAVA-based
11 CI (`src/gallium/drivers/panfrost/ci/`)
12
13 ### Software architecture
14
15 For freedreno and llvmpipe CI, we're using gitlab-runner on the test
16 devices (DUTs), cached docker containers with VK-GL-CTS, and the
17 normal shared x86_64 runners to build the Mesa drivers to be run
18 inside of those containers on the DUTs.
19
20 The docker containers are rebuilt from the debian-install.sh script
21 when DEBIAN\_TAG is changed in .gitlab-ci.yml, and
22 debian-test-install.sh when DEBIAN\_ARM64\_TAG is changed in
23 .gitlab-ci.yml. The resulting images are around 500MB, and are
24 expected to change approximately weekly (though an individual
25 developer working on them may produce many more images while trying to
26 come up with a working MR!).
27
28 gitlab-runner is a client that polls gitlab.freedesktop.org for
29 available jobs, with no inbound networking requirements. Jobs can
30 have tags, so we can have DUT-specific jobs that only run on runners
31 with that tag marked in the gitlab UI.
32
33 Since dEQP takes a long time to run, we mark the job as "parallel" at
34 some level, which spawns multiple jobs from one definition, and then
35 deqp-runner.sh takes the corresponding fraction of the test list for
36 that job.
37
38 To reduce dEQP runtime (or avoid tests with unreliable results), a
39 deqp-runner.sh invocation can provide a list of tests to skip. If
40 your driver is not yet conformant, you can pass a list of expected
41 failures, and the job will only fail on tests that aren't listed (look
42 at the job's log for which specific tests failed).
43
44 ### DUT requirements
45
46 #### DUTs must have a stable kernel and GPU reset.
47
48 If the system goes down during a test run, that job will eventually
49 time out and fail (default 1 hour). However, if the kernel can't
50 reliably reset the GPU on failure, bugs in one MR may leak into
51 spurious failures in another MR. This would be an unacceptable impact
52 on Mesa developers working on other drivers.
53
54 #### DUTs must be able to run docker
55
56 The Mesa gitlab-runner based test architecture is built around docker,
57 so that we can cache the debian package installation and CTS build
58 step across multiple test runs. Since the images are large and change
59 approximately weekly, the DUTs also need to be running some script to
60 prune stale docker images periodically in order to not run out of disk
61 space as we rev those containers (perhaps [this
62 script](https://gitlab.com/gitlab-org/gitlab-runner/issues/2980#note_169233611)).
63
64 Note that docker doesn't allow containers to be stored on NFS, and
65 doesn't allow multiple docker daemons to interact with the same
66 network block device, so you will probably need some sort of physical
67 storage on your DUTs.
68
69 #### DUTs must be public
70
71 By including your device in .gitlab-ci.yml, you're effectively letting
72 anyone on the internet run code on your device. docker containers may
73 provide some limited protection, but how much you trust that and what
74 you do to mitigate hostile access is up to you.
75
76 #### DUTs must expose the dri device nodes to the containers.
77
78 Obviously, to get access to the HW, we need to pass the render node
79 through. This is done by adding `devices = ["/dev/dri"]` to the
80 `runners.docker` section of /etc/gitlab-runner/config.toml.
81
82 ### HW CI farm expectations
83
84 To make sure that testing of one vendor's drivers doesn't block
85 unrelated work by other vendors, we require that a given driver's test
86 farm produces a spurious failure no more than once a week. If every
87 driver had CI and failed once a week, we would be seeing someone's
88 code getting blocked on a spurious failure daily, which is an
89 unacceptable cost to the project.
90
91 Additionally, the test farm needs to be able to provide a short enough
92 turnaround time that people can regularly use the "Merge when pipeline
93 succeeds" button successfully (until we get
94 [marge-bot](https://github.com/smarkets/marge-bot) in place on
95 freedesktop.org). As a result, we require that the test farm be able
96 to handle a whole pipeline's worth of jobs in less than 5 minutes (to
97 compare, the build stage is about 10 minutes, if you could get all
98 your jobs scheduled on the shared runners in time.).
99
100 If a test farm is short the HW to provide these guarantees, consider
101 dropping tests to reduce runtime.
102 `VK-GL-CTS/scripts/log/bottleneck_report.py` can help you find what
103 tests were slow in a `results.qpa` file. Or, you can have a job with
104 no `parallel` field set and:
105
106 ```
107 variables:
108 CI_NODE_INDEX: 1
109 CI_NODE_TOTAL: 10
110 ```
111
112 to just run 1/10th of the test list.
113
114 If a HW CI farm goes offline (network dies and all CI pipelines end up
115 stalled) or its runners are consistenly spuriously failing (disk
116 full?), and the maintainer is not immediately available to fix the
117 issue, please push through an MR disabling that farm's jobs by adding
118 '.' to the front of the jobs names until the maintainer can bring
119 things back up. If this happens, the farm maintainer should provide a
120 report to mesa-dev@lists.freedesktop.org after the fact explaining
121 what happened and what the mitigation plan is for that failure next
122 time.