broadcom/vc4: Use the RA callback to improve register selection's choices.
We simply pick r4 if available (anything else would force a MOV), then
round-robin through accumulators (avoids physical regfile RAW delay
slots), then round-robin through the physical regfile.
The effect on instruction count is pretty impressive:
total instructions in shared programs: 76563 -> 74526 (-2.66%)
instructions in affected programs: 66463 -> 64426 (-3.06%)
and we could probably do better with a little heuristic of "if we're going
to choose a physical reg, and other operands of instructions using this as
a src have the same physical regfile, then use the other regfile".