radeonsi: reload PS inputs with direct indexing at each use (v2)
The LLVM compiler can CSE interp intrinsics thanks to
LLVMReadNoneAttribute.
26011 shaders in 14651 tests
Totals:
SGPRS:
1146340 ->
1132676 (-1.19 %)
VGPRS: 727371 -> 711730 (-2.15 %)
Spilled SGPRs: 2218 -> 2078 (-6.31 %)
Spilled VGPRs: 369 -> 369 (0.00 %)
Scratch VGPRs: 1344 -> 1344 (0.00 %) dwords per thread
Code Size:
35841268 ->
36009732 (0.47 %) bytes
LDS: 767 -> 767 (0.00 %) blocks
Max Waves: 222559 -> 224779 (1.00 %)
Wait states: 0 -> 0 (0.00 %)
v2: don't call load_input for fragment shaders in emit_declaration
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>