gdb/python: handle non utf-8 characters when source highlighting
This commit adds support for source files that contain non utf-8
characters when performing source styling using the Python pygments
package. This does not change the behaviour of GDB when the GNU
Source Highlight library is used.
For the following problem description, assume that either GDB is built
without GNU Source Highlight support, of that this has been disabled
using 'maintenance set gnu-source-highlight enabled off'.
The initial problem reported was that a source file containing non
utf-8 characters would cause GDB to print a Python exception, and then
display the source without styling, e.g.:
Python Exception <class 'UnicodeDecodeError'>: 'utf-8' codec can't decode byte 0xc0 in position 142: invalid start byte
/* Source code here, without styling... */
Further, as the user steps through different source files, each time
the problematic source file was evicted from the source cache, and
then later reloaded, the exception would be printed again.
Finally, this problem is only present when using Python 3, this issue
is not present for Python 2.
What makes this especially frustrating is that GDB can clearly print
the source file contents, they're right there... If we disable
styling completely, or make use of the GNU Source Highlight library,
then everything is fine. So why is there an error when we try to
apply styling using Python?
The problem is the use of PyString_FromString (which is an alias for
PyUnicode_FromString in Python 3), this function converts a C string
into a either a Unicode object (Py3) or a str object (Py2). For
Python 2 there is no unicode encoding performed during this function
call, but for Python 3 the input is assumed to be a uft-8 encoding
string for the purpose of the conversion. And here of course, is the
problem, if the source file contains non utf-8 characters, then it
should not be treated as utf-8, but that's what we do, and that's why
we get an error.
My first thought when looking at this was to spot when the
PyString_FromString call failed with a UnicodeDecodeError and silently
ignore the error. This would mean that GDB would print the source
without styling, but would also avoid the annoying exception message.
However, I also make use of `pygmentize`, a command line wrapper
around the Python pygments module, which I use to apply syntax
highlighting in the output of `less`. And this command line wrapper
is quite happy to syntax highlight my source file that contains non
utf-8 characters, so it feels like the problem should be solvable.
It turns out that inside the pygments module there is already support
for guessing the encoding of the incoming file content, if the
incoming content is not already a Unicode string. This is what
happens for Python 2 where the incoming content is of `str` type.
We could try and make GDB smarter when it comes to converting C
strings into Python Unicode objects; this would probably require us to
just try a couple of different encoding schemes rather than just
giving up after utf-8.
However, I figure, why bother? The pygments module already does this
for us, and the colorize API is not part of the documented external
API of GDB. So, why not just change the colorize API, instead of the
content being a Unicode string (for Python 3), lets just make the
content be a bytes object. The pygments module can then take
responsibility for guessing the encoding.
So, currently, the colorize API receives a unicode object, and returns
a unicode object. I propose that the colorize API receive a bytes
object, and return a bytes object.