radeonsi/gfx10: implement NGG culling for 4x wave32 subgroups