aco: optimize packing of 16bit subdword registers on GFX6/7