478a788b116d54a6cf77b961c62c4bcff9615ba5

1 /*

2 * Copyright (C) 2019 Collabora, Ltd.

3 *

4 * Permission is hereby granted, free of charge, to any person obtaining a

5 * copy of this software and associated documentation files (the "Software"),

6 * to deal in the Software without restriction, including without limitation

7 * the rights to use, copy, modify, merge, publish, distribute, sublicense,

8 * and/or sell copies of the Software, and to permit persons to whom the

9 * Software is furnished to do so, subject to the following conditions:

10 *

11 * The above copyright notice and this permission notice (including the next

12 * paragraph) shall be included in all copies or substantial portions of the

13 * Software.

14 *

15 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR

16 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

17 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL

18 * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER

19 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,

20 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE

21 * SOFTWARE.

22 *

23 * Authors:

24 * Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>

25 */

30 /* Midgard has a small register file, so shaders with high register pressure

31 * need to spill from the register file onto the stack. In addition to

32 * spilling, it is desireable to allocate temporary arrays on the stack (for

33 * instance because the register file does not support indirect access but the

34 * stack does).

35 *

36 * The stack is located in "Thread Local Storage", sometimes abbreviated TLS in

37 * the kernel source code. Thread local storage is allocated per-thread,

38 * per-core, so threads executing concurrently do not interfere with each

39 * other's stacks. On modern kernels, we may query

40 * DRM_PANFROST_PARAM_THREAD_TLS_ALLOC for the number of threads per core we

41 * must allocate for, and DRM_PANFROST_PARAM_SHADER_PRESENT for a bitmask of

42 * shader cores (so take a popcount of that mask for the number of shader

43 * cores). On older kernels that do not support querying these values,

44 * following kbase, we may use the worst-case value of 256 threads for

45 * THREAD_TLS_ALLOC, and the worst-case value of 16 cores for Midgard per the

46 * "shader core count" column of the implementations table in

47 * https://en.wikipedia.org/wiki/Mali_%28GPU% [citation needed]

48 *

49 * Within a particular thread, there is stack allocated. If it is present, its

50 * size is a power-of-two, and it is at least 16 bytes. Stack is allocated

51 * with the shared memory descriptor used for all shaders within a frame (note

52 * that they don't execute concurrently so it's fine). So, consider the maximum

53 * stack size used by any shader within a job, and then compute (where npot

54 * denotes the next power of two):

55 *

56 * bytes/thread = npot(max(size, 16))

57 * allocated = (# of bytes/thread) * (# of threads/core) * (# of cores)

58 *

59 * The size of Thread Local Storage is signaled to the GPU in a dedicated

60 * log_stack_size field. Since stack sizes are powers of two, it follows that

61 * stack_size is logarithmic. Consider some sample values:

62 *

63 * stack size | log_stack_size

64 * ---------------------------

65 * 256 | 4

66 * 512 | 5

67 * 1024 | 6

68 *

69 * Noting that log2(256) = 8, we have the relation:

70 *

71 * stack_size <= 2^(log_stack_size + 4)

72 *

73 * Given the constraints about powers-of-two and the minimum of 256, we thus

74 * derive a formula for log_stack_size in terms of stack size (s), where s is

75 * positive:

76 *

77 * log_stack_size = ceil(log2(max(s, 16))) - 4

78 *

79 * There are other valid characterisations of this formula, of course, but this

80 * is computationally simple, so good enough for our purposes. If s=0, since

81 * there is no spilling used whatsoever, we may set log_stack_size to 0 to

82 * disable the stack.

83 */

85 /* Computes log_stack_size = ceil(log2(max(s, 16))) - 4 */

87 unsigned

89 {

92 else

94 }

96 /* Computes the aligned stack size given the shift and thread count. The blob

97 * reserves an extra page, and since this is hardware-internal, we do too. */

99 unsigned

104 {

109 }