From: Tim Rowley Date: Thu, 25 Feb 2016 00:28:13 +0000 (-0600) Subject: gallium/docs - add OpenSWR documentation X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=d003be2a303edfe93cde756e56ce31608d51fe7c;p=mesa.git gallium/docs - add OpenSWR documentation Acked-by: Jose Fonseca --- diff --git a/src/gallium/docs/source/drivers/openswr.rst b/src/gallium/docs/source/drivers/openswr.rst new file mode 100644 index 00000000000..84aa51f5d80 --- /dev/null +++ b/src/gallium/docs/source/drivers/openswr.rst @@ -0,0 +1,21 @@ +OpenSWR +======= + +The Gallium OpenSWR driver is a high performance, highly scalable +software renderer targeted towards visualization workloads. For such +geometry heavy workloads there is a considerable speedup over llvmpipe, +which is to be expected as the geometry frontend of llvmpipe is single +threaded. + +This rasterizer is x86 specific and requires AVX or AVX2. The driver +fits into the gallium framework, and reuses gallivm for doing the TGSI +to vectorized llvm-IR conversion of the shader kernels. + +.. toctree:: + :glob: + + openswr/usage + openswr/faq + openswr/profiling + openswr/knobs + diff --git a/src/gallium/docs/source/drivers/openswr/faq.rst b/src/gallium/docs/source/drivers/openswr/faq.rst new file mode 100644 index 00000000000..596d77f3780 --- /dev/null +++ b/src/gallium/docs/source/drivers/openswr/faq.rst @@ -0,0 +1,141 @@ +FAQ +=== + +Why another software rasterizer? +-------------------------------- + +Good question, given there are already three (swrast, softpipe, +llvmpipe) in the Mesa3D tree. Two important reasons for this: + + * Architecture - given our focus on scientific visualization, our + workloads are much different than the typical game; we have heavy + vertex load and relatively simple shaders. In addition, the core + counts of machines we run on are much higher. These parameters led + to design decisions much different than llvmpipe. + + * Historical - Intel had developed a high performance software + graphics stack for internal purposes. Later we adapted this + graphics stack for use in visualization and decided to move forward + with Mesa3D to provide a high quality API layer while at the same + time benefiting from the excellent performance the software + rasterizerizer gives us. + +What's the architecture? +------------------------ + +SWR is a tile based immediate mode renderer with a sort-free threading +model which is arranged as a ring of queues. Each entry in the ring +represents a draw context that contains all of the draw state and work +queues. An API thread sets up each draw context and worker threads +will execute both the frontend (vertex/geometry processing) and +backend (fragment) work as required. The ring allows for backend +threads to pull work in order. Large draws are split into chunks to +allow vertex processing to happen in parallel, with the backend work +pickup preserving draw ordering. + +Our pipeline uses just-in-time compiled code for the fetch shader that +does vertex attribute gathering and AOS to SOA conversions, the vertex +shader and fragment shaders, streamout, and fragment blending. SWR +core also supports geometry and compute shaders but we haven't exposed +them through our driver yet. The fetch shader, streamout, and blend is +built internally to swr core using LLVM directly, while for the vertex +and pixel shaders we reuse bits of llvmpipe from +``gallium/auxiliary/gallivm`` to build the kernels, which we wrap +differently than llvmpipe's ``auxiliary/draw`` code. + +What's the performance? +----------------------- + +For the types of high-geometry workloads we're interested in, we are +significantly faster than llvmpipe. This is to be expected, as +llvmpipe only threads the fragment processing and not the geometry +frontend. The performance advantage over llvmpipe roughly scales +linearly with the number of cores available. + +While our current performance is quite good, we know there is more +potential in this architecture. When we switched from a prototype +OpenGL driver to Mesa we regressed performance severely, some due to +interface issues that need tuning, some differences in shader code +generation, and some due to conformance and feature additions to the +core swr. We are looking to recovering most of this performance back. + +What's the conformance? +----------------------- + +The major applications we are targeting are all based on the +Visualization Toolkit (VTK), and as such our development efforts have +been focused on making sure these work as best as possible. Our +current code passes vtk's rendering tests with their new "OpenGL2" +(really OpenGL 3.2) backend at 99%. + +piglit testing shows a much lower pass rate, roughly 80% at the time +of writing. Core SWR undergoes rigorous unit testing and we are quite +confident in the rasterizer, and understand the areas where it +currently has issues (example: line rendering is done with triangles, +so doesn't match the strict line rendering rules). The majority of +the piglit failures are errors in our driver layer interfacing Mesa +and SWR. Fixing these issues is one of our major future development +goals. + +Why are you open sourcing this? +------------------------------- + + * Our customers prefer open source, and allowing them to simply + download the Mesa source and enable our driver makes life much + easier for them. + + * The internal gallium APIs are not stable, so we'd like our driver + to be visible for changes. + + * It's easier to work with the Mesa community when the source we're + working with can be used as reference. + +What are your development plans? +-------------------------------- + + * Performance - see the performance section earlier for details. + + * Conformance - see the conformance section earlier for details. + + * Features - core SWR has a lot of functionality we have yet to + expose through our driver, such as MSAA, geometry shaders, compute + shaders, and tesselation. + + * AVX512 support + +What is the licensing of the code? +---------------------------------- + + * All code is under the normal Mesa MIT license. + +Will this work on AMD? +---------------------- + + * If using an AMD processor with AVX or AVX2, it should work though + we don't have that hardware around to test. Patches if needed + would be welcome. + +Will this work on ARM, MIPS, POWER, ? +------------------------------------------------------------------------- + + * Not without a lot of work. We make extensive use of AVX and AVX2 + intrinsics in our code and the in-tree JIT creation. It is not the + intention for this codebase to support non-x86 architectures. + +What hardware do I need? +------------------------ + + * Any x86 processor with at least AVX (introduced in the Intel + SandyBridge and AMD Bulldozer microarchitectures in 2011) will + work. + + * You don't need a fire-breathing Xeon machine to work on SWR - we do + day-to-day development with laptops and desktop CPUs. + +Does one build work on both AVX and AVX2? +----------------------------------------- + +Yes. The build system creates two shared libraries, ``libswrAVX.so`` and +``libswrAVX2.so``, and ``swr_create_screen()`` loads the appropriate one at +runtime. + diff --git a/src/gallium/docs/source/drivers/openswr/knobs.rst b/src/gallium/docs/source/drivers/openswr/knobs.rst new file mode 100644 index 00000000000..06f228a2e92 --- /dev/null +++ b/src/gallium/docs/source/drivers/openswr/knobs.rst @@ -0,0 +1,114 @@ +Knobs +===== + +OpenSWR has a number of environment variables which control its +operation, in addition to the normal Mesa and gallium controls. + +.. envvar:: KNOB_ENABLE_ASSERT_DIALOGS (true) + +Use dialogs when asserts fire. Asserts are only enabled in debug builds + +.. envvar:: KNOB_SINGLE_THREADED (false) + +If enabled will perform all rendering on the API thread. This is useful mainly for debugging purposes. + +.. envvar:: KNOB_DUMP_SHADER_IR (false) + +Dumps shader LLVM IR at various stages of jit compilation. + +.. envvar:: KNOB_USE_GENERIC_STORETILE (false) + +Always use generic function for performing StoreTile. Will be slightly slower than using optimized (jitted) path + +.. envvar:: KNOB_FAST_CLEAR (true) + +Replace 3D primitive execute with a SWRClearRT operation and defer clear execution to first backend op on hottile, or hottile store + +.. envvar:: KNOB_MAX_NUMA_NODES (0) + +Maximum # of NUMA-nodes per system used for worker threads 0 == ALL NUMA-nodes in the system N == Use at most N NUMA-nodes for rendering + +.. envvar:: KNOB_MAX_CORES_PER_NUMA_NODE (0) + +Maximum # of cores per NUMA-node used for worker threads. 0 == ALL non-API thread cores per NUMA-node N == Use at most N cores per NUMA-node + +.. envvar:: KNOB_MAX_THREADS_PER_CORE (1) + +Maximum # of (hyper)threads per physical core used for worker threads. 0 == ALL hyper-threads per core N == Use at most N hyper-threads per physical core + +.. envvar:: KNOB_MAX_WORKER_THREADS (0) + +Maximum worker threads to spawn. IMPORTANT: If this is non-zero, no worker threads will be bound to specific HW threads. They will all be "floating" SW threads. In this case, the above 3 KNOBS will be ignored. + +.. envvar:: KNOB_BUCKETS_START_FRAME (1200) + +Frame from when to start saving buckets data. NOTE: KNOB_ENABLE_RDTSC must be enabled in core/knobs.h for this to have an effect. + +.. envvar:: KNOB_BUCKETS_END_FRAME (1400) + +Frame at which to stop saving buckets data. NOTE: KNOB_ENABLE_RDTSC must be enabled in core/knobs.h for this to have an effect. + +.. envvar:: KNOB_WORKER_SPIN_LOOP_COUNT (5000) + +Number of spin-loop iterations worker threads will perform before going to sleep when waiting for work + +.. envvar:: KNOB_MAX_DRAWS_IN_FLIGHT (160) + +Maximum number of draws outstanding before API thread blocks. + +.. envvar:: KNOB_MAX_PRIMS_PER_DRAW (2040) + +Maximum primitives in a single Draw(). Larger primitives are split into smaller Draw calls. Should be a multiple of (3 * vectorWidth). + +.. envvar:: KNOB_MAX_TESS_PRIMS_PER_DRAW (16) + +Maximum primitives in a single Draw() with tessellation enabled. Larger primitives are split into smaller Draw calls. Should be a multiple of (vectorWidth). + +.. envvar:: KNOB_MAX_FRAC_ODD_TESS_FACTOR (63.0f) + +(DEBUG) Maximum tessellation factor for fractional-odd partitioning. + +.. envvar:: KNOB_MAX_FRAC_EVEN_TESS_FACTOR (64.0f) + +(DEBUG) Maximum tessellation factor for fractional-even partitioning. + +.. envvar:: KNOB_MAX_INTEGER_TESS_FACTOR (64) + +(DEBUG) Maximum tessellation factor for integer partitioning. + +.. envvar:: KNOB_BUCKETS_ENABLE_THREADVIZ (false) + +Enable threadviz output. + +.. envvar:: KNOB_TOSS_DRAW (false) + +Disable per-draw/dispatch execution + +.. envvar:: KNOB_TOSS_QUEUE_FE (false) + +Stop per-draw execution at worker FE NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h + +.. envvar:: KNOB_TOSS_FETCH (false) + +Stop per-draw execution at vertex fetch NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h + +.. envvar:: KNOB_TOSS_IA (false) + +Stop per-draw execution at input assembler NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h + +.. envvar:: KNOB_TOSS_VS (false) + +Stop per-draw execution at vertex shader NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h + +.. envvar:: KNOB_TOSS_SETUP_TRIS (false) + +Stop per-draw execution at primitive setup NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h + +.. envvar:: KNOB_TOSS_BIN_TRIS (false) + +Stop per-draw execution at primitive binning NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h + +.. envvar:: KNOB_TOSS_RS (false) + +Stop per-draw execution at rasterizer NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h + diff --git a/src/gallium/docs/source/drivers/openswr/profiling.rst b/src/gallium/docs/source/drivers/openswr/profiling.rst new file mode 100644 index 00000000000..357754c3506 --- /dev/null +++ b/src/gallium/docs/source/drivers/openswr/profiling.rst @@ -0,0 +1,67 @@ +Profiling +========= + +OpenSWR contains built-in profiling which can be enabled +at build time to provide insight into performance tuning. + +To enable this, uncomment the following line in ``rasterizer/core/knobs.h`` and rebuild: :: + + //#define KNOB_ENABLE_RDTSC + +Running an application will result in a ``rdtsc.txt`` file being +created in current working directory. This file contains profile +information captured between the ``KNOB_BUCKETS_START_FRAME`` and +``KNOB_BUCKETS_END_FRAME`` (see knobs section). + +The resulting file will contain sections for each thread with a +hierarchical breakdown of the time spent in the various operations. +For example: :: + + Thread 0 (API) + %Tot %Par Cycles CPE NumEvent CPE2 NumEvent2 Bucket + 0.00 0.00 28370 2837 10 0 0 APIClearRenderTarget + 0.00 41.23 11698 1169 10 0 0 |-> APIDrawWakeAllThreads + 0.00 18.34 5202 520 10 0 0 |-> APIGetDrawContext + 98.72 98.72 12413773688 29957 414380 0 0 APIDraw + 0.36 0.36 44689364 107 414380 0 0 |-> APIDrawWakeAllThreads + 96.36 97.62 12117951562 9747 1243140 0 0 |-> APIGetDrawContext + 0.00 0.00 19904 995 20 0 0 APIStoreTiles + 0.00 7.88 1568 78 20 0 0 |-> APIDrawWakeAllThreads + 0.00 25.28 5032 251 20 0 0 |-> APIGetDrawContext + 1.28 1.28 161344902 64 2486370 0 0 APIGetDrawContext + 0.00 0.00 50368 2518 20 0 0 APISync + 0.00 2.70 1360 68 20 0 0 |-> APIDrawWakeAllThreads + 0.00 65.27 32876 1643 20 0 0 |-> APIGetDrawContext + + + Thread 1 (WORKER) + %Tot %Par Cycles CPE NumEvent CPE2 NumEvent2 Bucket + 83.92 83.92 13198987522 96411 136902 0 0 FEProcessDraw + 24.91 29.69 3918184840 167 23410158 0 0 |-> FEFetchShader + 11.17 13.31 1756972646 75 23410158 0 0 |-> FEVertexShader + 8.89 10.59 1397902996 59 23410161 0 0 |-> FEPAAssemble + 19.06 22.71 2997794710 384 7803387 0 0 |-> FEClipTriangles + 11.67 61.21 1834958176 235 7803387 0 0 |-> FEBinTriangles + 0.00 0.00 0 0 187258 0 0 |-> FECullZeroAreaAndBackface + 0.00 0.00 0 0 60051033 0 0 |-> FECullBetweenCenters + 0.11 0.11 17217556 2869592 6 0 0 FEProcessStoreTiles + 15.97 15.97 2511392576 73665 34092 0 0 WorkerWorkOnFifoBE + 14.04 87.95 2208687340 9187 240408 0 0 |-> WorkerFoundWork + 0.06 0.43 9390536 13263 708 0 0 |-> BELoadTiles + 0.00 0.01 293020 182 1609 0 0 |-> BEClear + 12.63 89.94 1986508990 949 2093014 0 0 |-> BERasterizeTriangle + 2.37 18.75 372374596 177 2093014 0 0 |-> BETriangleSetup + 0.42 3.35 66539016 31 2093014 0 0 |-> BEStepSetup + 0.00 0.00 0 0 21766 0 0 |-> BETrivialReject + 1.05 8.33 165410662 79 2071248 0 0 |-> BERasterizePartial + 6.06 48.02 953847796 1260 756783 0 0 |-> BEPixelBackend + 0.20 3.30 31521202 41 756783 0 0 |-> BESetup + 0.16 2.69 25624304 33 756783 0 0 |-> BEBarycentric + 0.18 2.92 27884986 36 756783 0 0 |-> BEEarlyDepthTest + 0.19 3.20 30564174 41 744058 0 0 |-> BEPixelShader + 0.26 4.30 41058646 55 744058 0 0 |-> BEOutputMerger + 1.27 20.94 199750822 32 6054264 0 0 |-> BEEndTile + 0.33 2.34 51758160 23687 2185 0 0 |-> BEStoreTiles + 0.20 60.22 31169500 28807 1082 0 0 |-> B8G8R8A8_UNORM + 0.00 0.00 302752 302752 1 0 0 WorkerWaitForThreadEvent + diff --git a/src/gallium/docs/source/drivers/openswr/usage.rst b/src/gallium/docs/source/drivers/openswr/usage.rst new file mode 100644 index 00000000000..e55b4211a54 --- /dev/null +++ b/src/gallium/docs/source/drivers/openswr/usage.rst @@ -0,0 +1,38 @@ +Usage +===== + +Requirements +^^^^^^^^^^^^ + +* An x86 processor with AVX or AVX2 +* LLVM version 3.6 or later + +Building +^^^^^^^^ + +To build with GNU automake, select building the swr driver at +configure time, for example: :: + + configure --with-gallium-drivers=swrast,swr + +Using +^^^^^ + +On Linux, building will create a drop-in alternative for libGL.so into:: + + lib/gallium/libGL.so + +or:: + + build/foo/gallium/targets/libgl-xlib/libGL.so + +To use it set the LD_LIBRARY_PATH environment variable accordingly. + +**IMPORTANT:** Mesa will default to using llvmpipe or softpipe as the default software renderer. To select the OpenSWR driver, set the GALLIUM_DRIVER environment variable appropriately: :: + + GALLIUM_DRIVER=swr + +To verify OpenSWR is being used, check to see if a message like the following is printed when the application is started: :: + + SWR detected AVX2 +