.gitlab-ci/README.md

   1 # Mesa testing
   2
   3 The goal of the "test" stage of the .gitlab-ci.yml is to do pre-merge
   4 testing of Mesa drivers on various platforms, so that we can ensure no
   5 regressions are merged, as long as developers are merging code using
   6 marge-bot.
   7
   8 There are currently 4 automated testing systems deployed for Mesa.
   9 LAVA and gitlab-runner on the DUTs are used in pre-merge testing and
  10 are described in this document.  Managing bare metal using
  11 gitlab-runner is described under [bare-metal/README.md].  Intel also
  12 has a jenkins-based CI system with restricted access that isn't
  13 connected to gitlab.
  14
  15 ## Mesa testing using LAVA
  16
  17 [LAVA](https://lavasoftware.org/) is a system for functional testing
  18 of boards including deploying custom bootloaders and kernels.  This is
  19 particularly relevant to testing Mesa because we often need to change
  20 kernels for UAPI changes (and this lets us do full testing of a new
  21 kernel during development), and our workloads can easily take down
  22 boards when mistakes are made (kernel oopses, OOMs that take out
  23 critical system services).
  24
  25 ### Mesa-LAVA software architecture
  26
  27 The gitlab-runner will run on some host that has access to the LAVA
  28 lab, with tags like "lava-mesa-boardname" to control only taking in
  29 jobs for the hardware that the LAVA lab contains.  The gitlab-runner
  30 spawns a docker container with lava-cli in it, and connects to the
  31 LAVA lab using a predefined token to submit jobs under a specific
  32 device type.
  33
  34 The LAVA instance manages scheduling those jobs to the boards present.
  35 For a job, it will deploy the kernel, device tree, and the ramdisk
  36 containing the CTS.
  37
  38 ### Deploying a new Mesa-LAVA lab
  39
  40 You'll want to start with setting up your LAVA instance and getting
  41 some boards booting using test jobs.  Start with the stock QEMU
  42 examples to make sure your instance works at all.  Then, you'll need
  43 to define your actual boards.
  44
  45 The device type in lava-gitlab-ci.yml is the device type you create in
  46 your LAVA instance, which doesn't have to match the board's name in
  47 `/etc/lava-dispatcher/device-types`.  You create your boards under
  48 that device type and the Mesa jobs will be scheduled to any of them.
  49 Instantiate your boards by creating them in the UI or at the command
  50 line attached to that device type, then populate their dictionary
  51 (using an "extends" line probably referencing the board's template in
  52 `/etc/lava-dispatcher/device-types`).  Now, go find a relevant
  53 healthcheck job for your board as a test job definition, or cobble
  54 something together from a board that boots using the same boot_method
  55 and some public images, and figure out how to get your boards booting.
  56
  57 Once you can boot your board using a custom job definition, it's time
  58 to connect Mesa CI to it.  Install gitlab-runner and register as a
  59 shared runner (you'll need a gitlab admin for help with this).  The
  60 runner *must* have a tag (like "mesa-lava-db410c") to restrict the
  61 jobs it takes or it will grab random jobs from tasks across fd.o, and
  62 your runner isn't ready for that.
  63
  64 The runner will be running an ARM docker image (we haven't done any
  65 x86 LAVA yet, so that isn't documented).  If your host for the
  66 gitlab-runner is x86, then you'll need to install qemu-user-static and
  67 the binfmt support.
  68
  69 The docker image will need access to the lava instance.  If it's on a
  70 public network it should be fine.  If you're running the LAVA instance
  71 on localhost, you'll need to set `network_mode="host"` in
  72 `/etc/gitlab-runner/config.toml` so it can access localhost.  Create a
  73 gitlab-runner user in your LAVA instance, log in under that user on
  74 the web interface, and create an API token.  Copy that into a
  75 `lavacli.yaml`:
  76
  77 ```
  78 default:
  79   token: <token contents>
  80   uri: <url to the instance>
  81   username: gitlab-runner
  82 ```
  83
  84 Add a volume mount of that `lavacli.yaml` to
  85 `/etc/gitlab-runner/config.toml` so that the docker container can
  86 access it.  You probably have a `volumes = ["/cache"]` already, so now it would be
  87
  88 ```
  89   volumes = ["/home/anholt/lava-config/lavacli.yaml:/root/.config/lavacli.yaml", "/cache"]
  90 ```
  91
  92 Note that this token is visible to anybody that can submit MRs to
  93 Mesa!  It is not an actual secret.  We could just bake it into the
  94 gitlab CI yml, but this way the current method of connecting to the
  95 LAVA instance is separated from the Mesa branches (particularly
  96 relevant as we have many stable branches all using CI).
  97
  98 Now it's time to define your test runner in
  99 `.gitlab-ci/lava-gitlab-ci.yml`.
 100
 101 ## Mesa testing using gitlab-runner on DUTs
 102
 103 ### Software architecture
 104
 105 For freedreno and llvmpipe CI, we're using gitlab-runner on the test
 106 devices (DUTs), cached docker containers with VK-GL-CTS, and the
 107 normal shared x86_64 runners to build the Mesa drivers to be run
 108 inside of those containers on the DUTs.
 109
 110 The docker containers are rebuilt from the debian-install.sh script
 111 when DEBIAN\_TAG is changed in .gitlab-ci.yml, and
 112 debian-test-install.sh when DEBIAN\_ARM64\_TAG is changed in
 113 .gitlab-ci.yml.  The resulting images are around 500MB, and are
 114 expected to change approximately weekly (though an individual
 115 developer working on them may produce many more images while trying to
 116 come up with a working MR!).
 117
 118 gitlab-runner is a client that polls gitlab.freedesktop.org for
 119 available jobs, with no inbound networking requirements.  Jobs can
 120 have tags, so we can have DUT-specific jobs that only run on runners
 121 with that tag marked in the gitlab UI.
 122
 123 Since dEQP takes a long time to run, we mark the job as "parallel" at
 124 some level, which spawns multiple jobs from one definition, and then
 125 deqp-runner.sh takes the corresponding fraction of the test list for
 126 that job.
 127
 128 To reduce dEQP runtime (or avoid tests with unreliable results), a
 129 deqp-runner.sh invocation can provide a list of tests to skip.  If
 130 your driver is not yet conformant, you can pass a list of expected
 131 failures, and the job will only fail on tests that aren't listed (look
 132 at the job's log for which specific tests failed).
 133
 134 ### DUT requirements
 135
 136 #### DUTs must have a stable kernel and GPU reset.
 137
 138 If the system goes down during a test run, that job will eventually
 139 time out and fail (default 1 hour).  However, if the kernel can't
 140 reliably reset the GPU on failure, bugs in one MR may leak into
 141 spurious failures in another MR.  This would be an unacceptable impact
 142 on Mesa developers working on other drivers.
 143
 144 #### DUTs must be able to run docker
 145
 146 The Mesa gitlab-runner based test architecture is built around docker,
 147 so that we can cache the debian package installation and CTS build
 148 step across multiple test runs.  Since the images are large and change
 149 approximately weekly, the DUTs also need to be running some script to
 150 prune stale docker images periodically in order to not run out of disk
 151 space as we rev those containers (perhaps [this
 152 script](https://gitlab.com/gitlab-org/gitlab-runner/issues/2980#note_169233611)).
 153
 154 Note that docker doesn't allow containers to be stored on NFS, and
 155 doesn't allow multiple docker daemons to interact with the same
 156 network block device, so you will probably need some sort of physical
 157 storage on your DUTs.
 158
 159 #### DUTs must be public
 160
 161 By including your device in .gitlab-ci.yml, you're effectively letting
 162 anyone on the internet run code on your device.  docker containers may
 163 provide some limited protection, but how much you trust that and what
 164 you do to mitigate hostile access is up to you.
 165
 166 #### DUTs must expose the dri device nodes to the containers.
 167
 168 Obviously, to get access to the HW, we need to pass the render node
 169 through.  This is done by adding `devices = ["/dev/dri"]` to the
 170 `runners.docker` section of /etc/gitlab-runner/config.toml.
 171
 172 ### HW CI farm expectations
 173
 174 To make sure that testing of one vendor's drivers doesn't block
 175 unrelated work by other vendors, we require that a given driver's test
 176 farm produces a spurious failure no more than once a week.  If every
 177 driver had CI and failed once a week, we would be seeing someone's
 178 code getting blocked on a spurious failure daily, which is an
 179 unacceptable cost to the project.
 180
 181 Additionally, the test farm needs to be able to provide a short enough
 182 turnaround time that people can regularly use the "Merge when pipeline
 183 succeeds" button successfully (until we get
 184 [marge-bot](https://github.com/smarkets/marge-bot) in place on
 185 freedesktop.org).  As a result, we require that the test farm be able
 186 to handle a whole pipeline's worth of jobs in less than 5 minutes (to
 187 compare, the build stage is about 10 minutes, if you could get all
 188 your jobs scheduled on the shared runners in time.).
 189
 190 If a test farm is short the HW to provide these guarantees, consider
 191 dropping tests to reduce runtime.
 192 `VK-GL-CTS/scripts/log/bottleneck_report.py` can help you find what
 193 tests were slow in a `results.qpa` file.  Or, you can have a job with
 194 no `parallel` field set and:
 195
 196 ```
 197   variables:
 198     CI_NODE_INDEX: 1
 199     CI_NODE_TOTAL: 10
 200 ```
 201
 202 to just run 1/10th of the test list.
 203
 204 If a HW CI farm goes offline (network dies and all CI pipelines end up
 205 stalled) or its runners are consistenly spuriously failing (disk
 206 full?), and the maintainer is not immediately available to fix the
 207 issue, please push through an MR disabling that farm's jobs by adding
 208 '.' to the front of the jobs names until the maintainer can bring
 209 things back up.  If this happens, the farm maintainer should provide a
 210 report to mesa-dev@lists.freedesktop.org after the fact explaining
 211 what happened and what the mitigation plan is for that failure next
 212 time.