x86: improve match_template()'s diagnostics
At the example of
extractps $0, %xmm0, %xmm0
insertps $0, %xmm0, %eax
(both having respectively the same mistake of using the wrong kind of
destination register) it is easy to see that current behavior is far
from ideal: The former results in "unsupported instruction" for 32-bit
code simply because the 2nd template we have is a Cpu64 one. Instead we
should aim at emitting the "best" possible error, which will typically
be the one where we passed the largest number of checks. Generalize the
original "specific_error" approach by making it apply to the entire
matching loop, utilizing that line numbers increase as we pass further
checks.