From: Alyssa Rosenzweig Date: Sat, 7 Dec 2019 21:42:01 +0000 (-0500) Subject: panfrost: Describe thread local storage sizing rules X-Git-Url: https://git.libre-soc.org/?a=commitdiff_plain;h=8b290bb13d6806556f77fc3ff605ce9efe7a6b40;p=mesa.git panfrost: Describe thread local storage sizing rules Deeply nested powers-of-two, basically :-) Signed-off-by: Alyssa Rosenzweig --- diff --git a/src/panfrost/Makefile.sources b/src/panfrost/Makefile.sources index 8a9bfc308a7..2dd2571b036 100644 --- a/src/panfrost/Makefile.sources +++ b/src/panfrost/Makefile.sources @@ -17,7 +17,8 @@ bifrost_FILES := \ encoder_FILES := \ encoder/pan_encoder.h \ encoder/pan_invocation.c \ - encoder/pan_tiler.c + encoder/pan_tiler.c \ + encoder/pan_scratch.c midgard_FILES := \ midgard/compiler.h \ diff --git a/src/panfrost/encoder/meson.build b/src/panfrost/encoder/meson.build index 007785769af..310772d59c5 100644 --- a/src/panfrost/encoder/meson.build +++ b/src/panfrost/encoder/meson.build @@ -24,6 +24,7 @@ libpanfrost_encoder_files = files( 'pan_invocation.c', 'pan_tiler.c', + 'pan_scratch.c', ) libpanfrost_encoder = static_library( diff --git a/src/panfrost/encoder/pan_scratch.c b/src/panfrost/encoder/pan_scratch.c new file mode 100644 index 00000000000..4a0561c7383 --- /dev/null +++ b/src/panfrost/encoder/pan_scratch.c @@ -0,0 +1,79 @@ +/* + * Copyright (C) 2019 Collabora, Ltd. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice (including the next + * paragraph) shall be included in all copies or substantial portions of the + * Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Authors: + * Alyssa Rosenzweig + */ + +#include "util/u_math.h" +#include "pan_encoder.h" + +/* Midgard has a small register file, so shaders with high register pressure + * need to spill from the register file onto the stack. In addition to + * spilling, it is desireable to allocate temporary arrays on the stack (for + * instance because the register file does not support indirect access but the + * stack does). + * + * The stack is located in "Thread Local Storage", sometimes abbreviated TLS in + * the kernel source code. Thread local storage is allocated per-thread, + * per-core, so threads executing concurrently do not interfere with each + * other's stacks. On modern kernels, we may query + * DRM_PANFROST_PARAM_THREAD_TLS_ALLOC for the number of threads per core we + * must allocate for, and DRM_PANFROST_PARAM_SHADER_PRESENT for a bitmask of + * shader cores (so take a popcount of that mask for the number of shader + * cores). On older kernels that do not support querying these values, + * following kbase, we may use the worst-case value of 1024 threads for + * THREAD_TLS_ALLOC, and the worst-case value of 16 cores for Midgard per the + * "shader core count" column of the implementations table in + * https://en.wikipedia.org/wiki/Mali_%28GPU% [citation needed] + * + * Within a particular thread, there is stack allocated. If it is present, its + * size is a power-of-two, and it is at least 256 bytes. Stack is allocated + * with the framebuffer descriptor used for all shaders within a frame (note + * that they don't execute concurrently so it's fine). So, consider the maximum + * stack size used by any shader within a job, and then compute (where npot + * denotes the next power of two): + * + * allocated = npot(max(size, 256)) * (# of threads/core) * (# of cores) + * + * The size of Thread Local Storage is signaled to the GPU in a dedicated + * log_stack_size field. Since stack sizes are powers of two, it follows that + * stack_size is logarithmic. Consider some sample values: + * + * stack size | log_stack_size + * --------------------------- + * 256 | 4 + * 512 | 5 + * 1024 | 6 + * + * Noting that log2(256) = 8, we have the relation: + * + * stack_size <= 2^(log_stack_size + 4) + * + * Given the constraints about powers-of-two and the minimum of 256, we thus + * derive a formula for log_stack_size in terms of stack size (s): + * + * log_stack_size = ceil(log2(max(s, 256))) - 4 + * + * There are other valid characterisations of this formula, of course, but this + * is computationally simple, so good enough for our purposes. + */