.gitlab-ci/README.md

   1 ## Mesa testing using gitlab-runner
   2
   3 The goal of the "test" stage of the .gitlab-ci.yml is to do pre-merge
   4 testing of Mesa drivers on various platforms, so that we can ensure no
   5 regressions are merged, as long as developers are merging code using
   6 the "Merge when pipeline completes" button.
   7
   8 This document only covers the CI from .gitlab-ci.yml and this
   9 directory.  For other CI systems, see Intel's [Mesa
  10 CI](https://gitlab.freedesktop.org/Mesa_CI) or panfrost's LAVA-based
  11 CI (`src/gallium/drivers/panfrost/ci/`)
  12
  13 ### Software architecture
  14
  15 For freedreno and llvmpipe CI, we're using gitlab-runner on the test
  16 devices (DUTs), cached docker containers with VK-GL-CTS, and the
  17 normal shared x86_64 runners to build the Mesa drivers to be run
  18 inside of those containers on the DUTs.
  19
  20 The docker containers are rebuilt from the debian-install.sh script
  21 when DEBIAN\_TAG is changed in .gitlab-ci.yml, and
  22 debian-test-install.sh when DEBIAN\_ARM64\_TAG is changed in
  23 .gitlab-ci.yml.  The resulting images are around 500MB, and are
  24 expected to change approximately weekly (though an individual
  25 developer working on them may produce many more images while trying to
  26 come up with a working MR!).
  27
  28 gitlab-runner is a client that polls gitlab.freedesktop.org for
  29 available jobs, with no inbound networking requirements.  Jobs can
  30 have tags, so we can have DUT-specific jobs that only run on runners
  31 with that tag marked in the gitlab UI.
  32
  33 Since dEQP takes a long time to run, we mark the job as "parallel" at
  34 some level, which spawns multiple jobs from one definition, and then
  35 deqp-runner.sh takes the corresponding fraction of the test list for
  36 that job.
  37
  38 To reduce dEQP runtime (or avoid tests with unreliable results), a
  39 deqp-runner.sh invocation can provide a list of tests to skip.  If
  40 your driver is not yet conformant, you can pass a list of expected
  41 failures, and the job will only fail on tests that aren't listed (look
  42 at the job's log for which specific tests failed).
  43
  44 ### DUT requirements
  45
  46 #### DUTs must have a stable kernel and GPU reset.
  47
  48 If the system goes down during a test run, that job will eventually
  49 time out and fail (default 1 hour).  However, if the kernel can't
  50 reliably reset the GPU on failure, bugs in one MR may leak into
  51 spurious failures in another MR.  This would be an unacceptable impact
  52 on Mesa developers working on other drivers.
  53
  54 #### DUTs must be able to run docker
  55
  56 The Mesa gitlab-runner based test architecture is built around docker,
  57 so that we can cache the debian package installation and CTS build
  58 step across multiple test runs.  Since the images are large and change
  59 approximately weekly, the DUTs also need to be running some script to
  60 prune stale docker images periodically in order to not run out of disk
  61 space as we rev those containers (perhaps [this
  62 script](https://gitlab.com/gitlab-org/gitlab-runner/issues/2980#note_169233611)).
  63
  64 Note that docker doesn't allow containers to be stored on NFS, and
  65 doesn't allow multiple docker daemons to interact with the same
  66 network block device, so you will probably need some sort of physical
  67 storage on your DUTs.
  68
  69 #### DUTs must be public
  70
  71 By including your device in .gitlab-ci.yml, you're effectively letting
  72 anyone on the internet run code on your device.  docker containers may
  73 provide some limited protection, but how much you trust that and what
  74 you do to mitigate hostile access is up to you.
  75
  76 #### DUTs must expose the dri device nodes to the containers.
  77
  78 Obviously, to get access to the HW, we need to pass the render node
  79 through.  This is done by adding `devices = ["/dev/dri"]` to the
  80 `runners.docker` section of /etc/gitlab-runner/config.toml.
  81
  82 ### HW CI farm expectations
  83
  84 To make sure that testing of one vendor's drivers doesn't block
  85 unrelated work by other vendors, we require that a given driver's test
  86 farm produces a spurious failure no more than once a week.  If every
  87 driver had CI and failed once a week, we would be seeing someone's
  88 code getting blocked on a spurious failure daily, which is an
  89 unacceptable cost to the project.
  90
  91 Additionally, the test farm needs to be able to provide a short enough
  92 turnaround time that people can regularly use the "Merge when pipeline
  93 succeeds" button successfully (until we get
  94 [marge-bot](https://github.com/smarkets/marge-bot) in place on
  95 freedesktop.org).  As a result, we require that the test farm be able
  96 to handle a whole pipeline's worth of jobs in less than 5 minutes (to
  97 compare, the build stage is about 10 minutes, if you could get all
  98 your jobs scheduled on the shared runners in time.).
  99
 100 If a test farm is short the HW to provide these guarantees, consider
 101 dropping tests to reduce runtime.
 102 `VK-GL-CTS/scripts/log/bottleneck_report.py` can help you find what
 103 tests were slow in a `results.qpa` file.  Or, you can have a job with
 104 no `parallel` field set and:
 105
 106 ```
 107   variables:
 108     CI_NODE_INDEX: 1
 109     CI_NODE_TOTAL: 10
 110 ```
 111
 112 to just run 1/10th of the test list.
 113
 114 If a HW CI farm goes offline (network dies and all CI pipelines end up
 115 stalled) or its runners are consistenly spuriously failing (disk
 116 full?), and the maintainer is not immediately available to fix the
 117 issue, please push through an MR disabling that farm's jobs by adding
 118 '.' to the front of the jobs names until the maintainer can bring
 119 things back up.  If this happens, the farm maintainer should provide a
 120 report to mesa-dev@lists.freedesktop.org after the fact explaining
 121 what happened and what the mitigation plan is for that failure next
 122 time.