vc4: Don't pair up TLB scoreboard locking instructions early in QPU sched.
authorEric Anholt <eric@anholt.net>
Mon, 7 Nov 2016 18:52:32 +0000 (10:52 -0800)
committerEric Anholt <eric@anholt.net>
Wed, 9 Nov 2016 23:33:56 +0000 (15:33 -0800)
commite887341d3f4a3b13b2bf56b4a931afb78ca0526e
tree44ed272174e196d6d56ace4f900aea06e5f03226
parent695a2e2ffa5faa7b303ca819dcb2e2922dfce5ab
vc4: Don't pair up TLB scoreboard locking instructions early in QPU sched.

Jonas Pfeil noticed that we were putting passthrough tlb_z writes early in
the shader, despite QIR and QPU scheduling both trying to delay scoreboard
locking for as long as possible.

The problem was that when trying to pair up QPU instructions, at some
point the passthrough tlb_z would be the last one available and it would
get paired, even if the other half would open up other instructions to be
scheduled and we could have paired tlb_z with something later in the
program.  Also, since passthrough z is just a mov, it pairs up really
easily.

The proper fix would probably be to flip the order of scheduling
instructions so we went from bottom to top (also relevant for branch delay
slot scheduling).

However, we can do a quick fix here to just not schedule a TLB lock until
there's nothing but TLB left in the program, at a slight instruction cost
(est .61% cycle count in shader-db) but a major fragment shader
parallelism win.

glmark2 results:
  texture:texture-filter=linear: +1.24481% +/- 0.626117% (n=15)
  bump:bump-render=height: 1.24991% +/- 0.154793% (n=136,133 -- screensaver
    outliers removed)
src/gallium/drivers/vc4/vc4_qpu_schedule.c