gallium: implement TGSI_OPCODE_DP2A, add sqrt to NRM3/NRM4