.gitlab-ci/README.md

   1 # Mesa testing
   2
   3 The goal of the "test" stage of the .gitlab-ci.yml is to do pre-merge
   4 testing of Mesa drivers on various platforms, so that we can ensure no
   5 regressions are merged, as long as developers are merging code using
   6 marge-bot.
   7
   8 There are currently 3 main automated testing systems deployed for
   9 Mesa.  LAVA and gitlab-runner on the DUTs are used in pre-merge
  10 testing and are described in this document, while Intel has a
  11 jenkins-based CI system with restricted access that isn't connected to
  12 gitlab.
  13
  14 ## Mesa testing using LAVA
  15
  16 [LAVA](https://lavasoftware.org/) is a system for functional testing
  17 of boards including deploying custom bootloaders and kernels.  This is
  18 particularly relevant to testing Mesa because we often need to change
  19 kernels for UAPI changes (and this lets us do full testing of a new
  20 kernel during development), and our workloads can easily take down
  21 boards when mistakes are made (kernel oopses, OOMs that take out
  22 critical system services).
  23
  24 ### Mesa-LAVA software architecture
  25
  26 The gitlab-runner will run on some host that has access to the LAVA
  27 lab, with tags like "lava-mesa-boardname" to control only taking in
  28 jobs for the hardware that the LAVA lab contains.  The gitlab-runner
  29 spawns a docker container with lava-cli in it, and connects to the
  30 LAVA lab using a predefined token to submit jobs under a specific
  31 device type.
  32
  33 The LAVA instance manages scheduling those jobs to the boards present.
  34 For a job, it will deploy the kernel, device tree, and the ramdisk
  35 containing the CTS.
  36
  37 ### Deploying a new Mesa-LAVA lab
  38
  39 You'll want to start with setting up your LAVA instance and getting
  40 some boards booting using test jobs.  Start with the stock QEMU
  41 examples to make sure your instance works at all.  Then, you'll need
  42 to define your actual boards.
  43
  44 The device type in lava-gitlab-ci.yml is the device type you create in
  45 your LAVA instance, which doesn't have to match the board's name in
  46 `/etc/lava-dispatcher/device-types`.  You create your boards under
  47 that device type and the Mesa jobs will be scheduled to any of them.
  48 Instantiate your boards by creating them in the UI or at the command
  49 line attached to that device type, then populate their dictionary
  50 (using an "extends" line probably referencing the board's template in
  51 `/etc/lava-dispatcher/device-types`).  Now, go find a relevant
  52 healthcheck job for your board as a test job definition, or cobble
  53 something together from a board that boots using the same boot_method
  54 and some public images, and figure out how to get your boards booting.
  55
  56 Once you can boot your board using a custom job definition, it's time
  57 to connect Mesa CI to it.  Install gitlab-runner and register as a
  58 shared runner (you'll need a gitlab admin for help with this).  The
  59 runner *must* have a tag (like "mesa-lava-db410c") to restrict the
  60 jobs it takes or it will grab random jobs from tasks across fd.o, and
  61 your runner isn't ready for that.
  62
  63 The runner will be running an ARM docker image (we haven't done any
  64 x86 LAVA yet, so that isn't documented).  If your host for the
  65 gitlab-runner is x86, then you'll need to install qemu-user-static and
  66 the binfmt support.
  67
  68 The docker image will need access to the lava instance.  If it's on a
  69 public network it should be fine.  If you're running the LAVA instance
  70 on localhost, you'll need to set `network_mode="host"` in
  71 `/etc/gitlab-runner/config.toml` so it can access localhost.  Create a
  72 gitlab-runner user in your LAVA instance, log in under that user on
  73 the web interface, and create an API token.  Copy that into a
  74 `lavacli.yaml`:
  75
  76 ```
  77 default:
  78   token: <token contents>
  79   uri: <url to the instance>
  80   username: gitlab-runner
  81 ```
  82
  83 Add a volume mount of that `lavacli.yaml` to
  84 `/etc/gitlab-runner/config.toml` so that the docker container can
  85 access it.  You probably have a `volumes = ["/cache"]` already, so now it would be
  86
  87 ```
  88   volumes = ["/home/anholt/lava-config/lavacli.yaml:/root/.config/lavacli.yaml", "/cache"]
  89 ```
  90
  91 Note that this token is visible to anybody that can submit MRs to
  92 Mesa!  It is not an actual secret.  We could just bake it into the
  93 gitlab CI yml, but this way the current method of connecting to the
  94 LAVA instance is separated from the Mesa branches (particularly
  95 relevant as we have many stable branches all using CI).
  96
  97 Now it's time to define your test runner in
  98 `.gitlab-ci/lava-gitlab-ci.yml`.
  99
 100 ## Mesa testing using gitlab-runner on DUTs
 101
 102 ### Software architecture
 103
 104 For freedreno and llvmpipe CI, we're using gitlab-runner on the test
 105 devices (DUTs), cached docker containers with VK-GL-CTS, and the
 106 normal shared x86_64 runners to build the Mesa drivers to be run
 107 inside of those containers on the DUTs.
 108
 109 The docker containers are rebuilt from the debian-install.sh script
 110 when DEBIAN\_TAG is changed in .gitlab-ci.yml, and
 111 debian-test-install.sh when DEBIAN\_ARM64\_TAG is changed in
 112 .gitlab-ci.yml.  The resulting images are around 500MB, and are
 113 expected to change approximately weekly (though an individual
 114 developer working on them may produce many more images while trying to
 115 come up with a working MR!).
 116
 117 gitlab-runner is a client that polls gitlab.freedesktop.org for
 118 available jobs, with no inbound networking requirements.  Jobs can
 119 have tags, so we can have DUT-specific jobs that only run on runners
 120 with that tag marked in the gitlab UI.
 121
 122 Since dEQP takes a long time to run, we mark the job as "parallel" at
 123 some level, which spawns multiple jobs from one definition, and then
 124 deqp-runner.sh takes the corresponding fraction of the test list for
 125 that job.
 126
 127 To reduce dEQP runtime (or avoid tests with unreliable results), a
 128 deqp-runner.sh invocation can provide a list of tests to skip.  If
 129 your driver is not yet conformant, you can pass a list of expected
 130 failures, and the job will only fail on tests that aren't listed (look
 131 at the job's log for which specific tests failed).
 132
 133 ### DUT requirements
 134
 135 #### DUTs must have a stable kernel and GPU reset.
 136
 137 If the system goes down during a test run, that job will eventually
 138 time out and fail (default 1 hour).  However, if the kernel can't
 139 reliably reset the GPU on failure, bugs in one MR may leak into
 140 spurious failures in another MR.  This would be an unacceptable impact
 141 on Mesa developers working on other drivers.
 142
 143 #### DUTs must be able to run docker
 144
 145 The Mesa gitlab-runner based test architecture is built around docker,
 146 so that we can cache the debian package installation and CTS build
 147 step across multiple test runs.  Since the images are large and change
 148 approximately weekly, the DUTs also need to be running some script to
 149 prune stale docker images periodically in order to not run out of disk
 150 space as we rev those containers (perhaps [this
 151 script](https://gitlab.com/gitlab-org/gitlab-runner/issues/2980#note_169233611)).
 152
 153 Note that docker doesn't allow containers to be stored on NFS, and
 154 doesn't allow multiple docker daemons to interact with the same
 155 network block device, so you will probably need some sort of physical
 156 storage on your DUTs.
 157
 158 #### DUTs must be public
 159
 160 By including your device in .gitlab-ci.yml, you're effectively letting
 161 anyone on the internet run code on your device.  docker containers may
 162 provide some limited protection, but how much you trust that and what
 163 you do to mitigate hostile access is up to you.
 164
 165 #### DUTs must expose the dri device nodes to the containers.
 166
 167 Obviously, to get access to the HW, we need to pass the render node
 168 through.  This is done by adding `devices = ["/dev/dri"]` to the
 169 `runners.docker` section of /etc/gitlab-runner/config.toml.
 170
 171 ### HW CI farm expectations
 172
 173 To make sure that testing of one vendor's drivers doesn't block
 174 unrelated work by other vendors, we require that a given driver's test
 175 farm produces a spurious failure no more than once a week.  If every
 176 driver had CI and failed once a week, we would be seeing someone's
 177 code getting blocked on a spurious failure daily, which is an
 178 unacceptable cost to the project.
 179
 180 Additionally, the test farm needs to be able to provide a short enough
 181 turnaround time that people can regularly use the "Merge when pipeline
 182 succeeds" button successfully (until we get
 183 [marge-bot](https://github.com/smarkets/marge-bot) in place on
 184 freedesktop.org).  As a result, we require that the test farm be able
 185 to handle a whole pipeline's worth of jobs in less than 5 minutes (to
 186 compare, the build stage is about 10 minutes, if you could get all
 187 your jobs scheduled on the shared runners in time.).
 188
 189 If a test farm is short the HW to provide these guarantees, consider
 190 dropping tests to reduce runtime.
 191 `VK-GL-CTS/scripts/log/bottleneck_report.py` can help you find what
 192 tests were slow in a `results.qpa` file.  Or, you can have a job with
 193 no `parallel` field set and:
 194
 195 ```
 196   variables:
 197     CI_NODE_INDEX: 1
 198     CI_NODE_TOTAL: 10
 199 ```
 200
 201 to just run 1/10th of the test list.
 202
 203 If a HW CI farm goes offline (network dies and all CI pipelines end up
 204 stalled) or its runners are consistenly spuriously failing (disk
 205 full?), and the maintainer is not immediately available to fix the
 206 issue, please push through an MR disabling that farm's jobs by adding
 207 '.' to the front of the jobs names until the maintainer can bring
 208 things back up.  If this happens, the farm maintainer should provide a
 209 report to mesa-dev@lists.freedesktop.org after the fact explaining
 210 what happened and what the mitigation plan is for that failure next
 211 time.