opcodes/arm: add disassembler styling for arm
This commit adds disassembler styling for the ARM architecture.
The ARM disassembler is driven by several instruction tables,
e.g. cde_opcodes, coprocessor_opcodes, neon_opcodes, etc
The type for elements in each table can vary, but they all have one
thing in common, a 'const char *assembler' field. This field
contains a string that describes the assembler syntax of the
instruction.
Embedded within that assembler syntax are various escape characters,
prefixed with a '%'. Here's an example of a very simple instruction
from the arm_opcodes table:
"pld\t%a"
The '%a' indicates a particular type of operand, the function
print_insn_arm processes the arm_opcodes table, and includes a switch
statement that handles the '%a' operand, and takes care of printing
the correct value for that instruction operand.
It is worth noting that there are many print_* functions, each
function handles a single *_opcodes table, and includes its own switch
statement for operand handling. As a result, every *_opcodes table
uses a different mapping for the operand escape sequences. This means
that '%a' might print an address for one *_opcodes table, but in a
different *_opcodes table '%a' might print a register operand.
Notice as well that in our example above, the instruction mnemonic
'pld' is embedded within the assembler string. Some instructions also
include comments within the assembler string, for example, also from
the arm_opcodes table:
"nop\t\t\t@ (mov r0, r0)"
here, everything after the '@' is a comment that is displayed at the
end of the instruction disassembly.
The next complexity is that the meaning of some escape sequences is
not necessarily fixed. Consider these two examples from arm_opcodes:
"ldrex%c\tr%12-15d, [%16-19R]"
"setpan\t#%9-9d"
Here, the '%d' escape is used with a bitfield modifier, '%12-15d' in
the first instruction, and '%9-9d' in the second instruction, but,
both of these are the '%d' escape.
However, in the first instruction, the '%d' is used to print a
register number, notice the 'r' immediately before the '%d'. In the
second instruction the '%d' is used to print an immediate, notice the
'#' just before the '%d'.
We have two problems here, first, the '%d' needs to know if it should
use register style or immediate style, and secondly, the 'r' and '#'
characters also need to be styled appropriately.
The final thing we must consider is that some escape codes result in
more than just a single operand being printed, for example, the '%q'
operand as used in arm_opcodes ends up calling arm_decode_shift, which
can print a register name, a shift type, and a shift amount, this
could end up using register, sub-mnemonic, and immediate styles, as
well as the text style for things like ',' between the different
parts.
I propose a three layer approach to adding styling:
(1) Basic state machine:
When we start printing an instruction we should maintain the idea
of a 'base_style'. Every character from the assembler string will
be printed using the base_style.
The base_style will start as mnemonic, as each instruction starts
with an instruction mnemonic. When we encounter the first '\t'
character, the base_style will change to text. When we encounter
the first '@' the base_style will change to comment_start.
This simple state machine ensures that for simple instructions the
basic parts, except for the operands themselves, will be printed in
the correct style.
(2) Simple operand styling:
For operands that only have a single meaning, or which expand to
multiple parts, all of which have a consistent meaning, then I
will simply update the operand printing code to print the operand
with the correct style. This will cover a large number of the
operands, and is the most consistent with how styling has been
added to previous architectures.
(3) New styling syntax in assembler strings:
For cases like the '%d' that I describe above, I propose adding a
new extension to the assembler syntax. This extension will allow
me to temporarily change the base_style. Operands like '%d', will
then print using the base_style rather than using a fixed style.
Here are the two examples from above that use '%d', updated with
the new syntax extension:
"ldrex%c\t%{R:r%12-15d%}, [%16-19R]"
"setpan\t%{I:#%9-9d%}"
The syntax has the general form '%{X:....%}' where the 'X'
character changes to indicate a different style. In the first
instruction I use '%{R:...%}' to change base_style to the register
style, and in the second '%{I:...%}' changes base_style to
immediate style.
Notice that the 'r' and '#' characters are included within the new
style group, this ensures that these characters are printed with
the correct style rather than as text.
The function decode_base_style maps from character to style. I've
included a character for each style for completeness, though only
a small number of styles are currently used.
I have updated arm-dis.c to the above scheme, and checked all of the
tests in gas/testsuite/gas/arm/, and the styling looks reasonable.
There are no regressions on the ARM gas/binutils/ld tests that I can
see, so I don't believe I've changed the output layout at all. There
were two binutils tests for which I needed to force the disassembler
styling off.
I can't guarantee that I've not missed some untested corners of the
disassembler, or that I might have just missed some incorrectly styled
output when reviewing the test results, but I don't believe I've
introduced any changes that could break the disassembler - the worst
should be some aspect is not styled correctly.