ci: Fix installation of firmware for db410c's nic.
[mesa.git] / .gitlab-ci / README.md
1 # Mesa testing
2
3 The goal of the "test" stage of the .gitlab-ci.yml is to do pre-merge
4 testing of Mesa drivers on various platforms, so that we can ensure no
5 regressions are merged, as long as developers are merging code using
6 marge-bot.
7
8 There are currently 3 main automated testing systems deployed for
9 Mesa. LAVA and gitlab-runner on the DUTs are used in pre-merge
10 testing and are described in this document, while Intel has a
11 jenkins-based CI system with restricted access that isn't connected to
12 gitlab.
13
14 ## Mesa testing using LAVA
15
16 [LAVA](https://lavasoftware.org/) is a system for functional testing
17 of boards including deploying custom bootloaders and kernels. This is
18 particularly relevant to testing Mesa because we often need to change
19 kernels for UAPI changes (and this lets us do full testing of a new
20 kernel during development), and our workloads can easily take down
21 boards when mistakes are made (kernel oopses, OOMs that take out
22 critical system services).
23
24 ### Mesa-LAVA software architecture
25
26 The gitlab-runner will run on some host that has access to the LAVA
27 lab, with tags like "lava-mesa-boardname" to control only taking in
28 jobs for the hardware that the LAVA lab contains. The gitlab-runner
29 spawns a docker container with lava-cli in it, and connects to the
30 LAVA lab using a predefined token to submit jobs under a specific
31 device type.
32
33 The LAVA instance manages scheduling those jobs to the boards present.
34 For a job, it will deploy the kernel, device tree, and the ramdisk
35 containing the CTS.
36
37 ### Deploying a new Mesa-LAVA lab
38
39 You'll want to start with setting up your LAVA instance and getting
40 some boards booting using test jobs. Start with the stock QEMU
41 examples to make sure your instance works at all. Then, you'll need
42 to define your actual boards.
43
44 The device type in lava-gitlab-ci.yml is the device type you create in
45 your LAVA instance, which doesn't have to match the board's name in
46 `/etc/lava-dispatcher/device-types`. You create your boards under
47 that device type and the Mesa jobs will be scheduled to any of them.
48 Instantiate your boards by creating them in the UI or at the command
49 line attached to that device type, then populate their dictionary
50 (using an "extends" line probably referencing the board's template in
51 `/etc/lava-dispatcher/device-types`). Now, go find a relevant
52 healthcheck job for your board as a test job definition, or cobble
53 something together from a board that boots using the same boot_method
54 and some public images, and figure out how to get your boards booting.
55
56 Once you can boot your board using a custom job definition, it's time
57 to connect Mesa CI to it. Install gitlab-runner and register as a
58 shared runner (you'll need a gitlab admin for help with this). The
59 runner *must* have a tag (like "mesa-lava-db410c") to restrict the
60 jobs it takes or it will grab random jobs from tasks across fd.o, and
61 your runner isn't ready for that.
62
63 The runner will be running an ARM docker image (we haven't done any
64 x86 LAVA yet, so that isn't documented). If your host for the
65 gitlab-runner is x86, then you'll need to install qemu-user-static and
66 the binfmt support.
67
68 The docker image will need access to the lava instance. If it's on a
69 public network it should be fine. If you're running the LAVA instance
70 on localhost, you'll need to set `network_mode="host"` in
71 `/etc/gitlab-runner/config.toml` so it can access localhost. Create a
72 gitlab-runner user in your LAVA instance, log in under that user on
73 the web interface, and create an API token. Copy that into a
74 `lavacli.yaml`:
75
76 ```
77 default:
78 token: <token contents>
79 uri: <url to the instance>
80 username: gitlab-runner
81 ```
82
83 Add a volume mount of that `lavacli.yaml` to
84 `/etc/gitlab-runner/config.toml` so that the docker container can
85 access it. You probably have a `volumes = ["/cache"]` already, so now it would be
86
87 ```
88 volumes = ["/home/anholt/lava-config/lavacli.yaml:/root/.config/lavacli.yaml", "/cache"]
89 ```
90
91 Note that this token is visible to anybody that can submit MRs to
92 Mesa! It is not an actual secret. We could just bake it into the
93 gitlab CI yml, but this way the current method of connecting to the
94 LAVA instance is separated from the Mesa branches (particularly
95 relevant as we have many stable branches all using CI).
96
97 Now it's time to define your test runner in
98 `.gitlab-ci/lava-gitlab-ci.yml`.
99
100 ## Mesa testing using gitlab-runner on DUTs
101
102 ### Software architecture
103
104 For freedreno and llvmpipe CI, we're using gitlab-runner on the test
105 devices (DUTs), cached docker containers with VK-GL-CTS, and the
106 normal shared x86_64 runners to build the Mesa drivers to be run
107 inside of those containers on the DUTs.
108
109 The docker containers are rebuilt from the debian-install.sh script
110 when DEBIAN\_TAG is changed in .gitlab-ci.yml, and
111 debian-test-install.sh when DEBIAN\_ARM64\_TAG is changed in
112 .gitlab-ci.yml. The resulting images are around 500MB, and are
113 expected to change approximately weekly (though an individual
114 developer working on them may produce many more images while trying to
115 come up with a working MR!).
116
117 gitlab-runner is a client that polls gitlab.freedesktop.org for
118 available jobs, with no inbound networking requirements. Jobs can
119 have tags, so we can have DUT-specific jobs that only run on runners
120 with that tag marked in the gitlab UI.
121
122 Since dEQP takes a long time to run, we mark the job as "parallel" at
123 some level, which spawns multiple jobs from one definition, and then
124 deqp-runner.sh takes the corresponding fraction of the test list for
125 that job.
126
127 To reduce dEQP runtime (or avoid tests with unreliable results), a
128 deqp-runner.sh invocation can provide a list of tests to skip. If
129 your driver is not yet conformant, you can pass a list of expected
130 failures, and the job will only fail on tests that aren't listed (look
131 at the job's log for which specific tests failed).
132
133 ### DUT requirements
134
135 #### DUTs must have a stable kernel and GPU reset.
136
137 If the system goes down during a test run, that job will eventually
138 time out and fail (default 1 hour). However, if the kernel can't
139 reliably reset the GPU on failure, bugs in one MR may leak into
140 spurious failures in another MR. This would be an unacceptable impact
141 on Mesa developers working on other drivers.
142
143 #### DUTs must be able to run docker
144
145 The Mesa gitlab-runner based test architecture is built around docker,
146 so that we can cache the debian package installation and CTS build
147 step across multiple test runs. Since the images are large and change
148 approximately weekly, the DUTs also need to be running some script to
149 prune stale docker images periodically in order to not run out of disk
150 space as we rev those containers (perhaps [this
151 script](https://gitlab.com/gitlab-org/gitlab-runner/issues/2980#note_169233611)).
152
153 Note that docker doesn't allow containers to be stored on NFS, and
154 doesn't allow multiple docker daemons to interact with the same
155 network block device, so you will probably need some sort of physical
156 storage on your DUTs.
157
158 #### DUTs must be public
159
160 By including your device in .gitlab-ci.yml, you're effectively letting
161 anyone on the internet run code on your device. docker containers may
162 provide some limited protection, but how much you trust that and what
163 you do to mitigate hostile access is up to you.
164
165 #### DUTs must expose the dri device nodes to the containers.
166
167 Obviously, to get access to the HW, we need to pass the render node
168 through. This is done by adding `devices = ["/dev/dri"]` to the
169 `runners.docker` section of /etc/gitlab-runner/config.toml.
170
171 ### HW CI farm expectations
172
173 To make sure that testing of one vendor's drivers doesn't block
174 unrelated work by other vendors, we require that a given driver's test
175 farm produces a spurious failure no more than once a week. If every
176 driver had CI and failed once a week, we would be seeing someone's
177 code getting blocked on a spurious failure daily, which is an
178 unacceptable cost to the project.
179
180 Additionally, the test farm needs to be able to provide a short enough
181 turnaround time that people can regularly use the "Merge when pipeline
182 succeeds" button successfully (until we get
183 [marge-bot](https://github.com/smarkets/marge-bot) in place on
184 freedesktop.org). As a result, we require that the test farm be able
185 to handle a whole pipeline's worth of jobs in less than 5 minutes (to
186 compare, the build stage is about 10 minutes, if you could get all
187 your jobs scheduled on the shared runners in time.).
188
189 If a test farm is short the HW to provide these guarantees, consider
190 dropping tests to reduce runtime.
191 `VK-GL-CTS/scripts/log/bottleneck_report.py` can help you find what
192 tests were slow in a `results.qpa` file. Or, you can have a job with
193 no `parallel` field set and:
194
195 ```
196 variables:
197 CI_NODE_INDEX: 1
198 CI_NODE_TOTAL: 10
199 ```
200
201 to just run 1/10th of the test list.
202
203 If a HW CI farm goes offline (network dies and all CI pipelines end up
204 stalled) or its runners are consistenly spuriously failing (disk
205 full?), and the maintainer is not immediately available to fix the
206 issue, please push through an MR disabling that farm's jobs by adding
207 '.' to the front of the jobs names until the maintainer can bring
208 things back up. If this happens, the farm maintainer should provide a
209 report to mesa-dev@lists.freedesktop.org after the fact explaining
210 what happened and what the mitigation plan is for that failure next
211 time.