Posted by cayhorstmann
on April 10, 2012 at 9:50 PM PDT
I've been too busy to blog for quite some time, but today something happened that seemed strange enough to break my silence. A student came to me with a Java source file that the grading script rejected. We looked at it and couldn't figure out why. I unearthed the error message: ♦
MergeSorter.java:1: error: illegal character: \65279
Huh? What's \65279? Why the backslash? I didn't even know what notation
that is. I looked at the file with Emacs hexl-mode and saw that the first
three bytes were hex
EF BB BF. In all these years, I had
never seen that, but Google set me straight. It's the Unicode byte order
mark or BOM. I asked the student what editor he had used to produce this
file. Sure enough, it was Notepad. Of course. If I had the power to
eradicate one program from the face of the earth, it surely would be
Just in case you haven't been down this particular rathole before,
here's a refresher on the BOM. At one point in time, Unicode fit into 16
bit, and it seemed attractive to encode it with fixed-width 16-bit
quantities. For example, an uppercase A is hexadecimal 0041, so you have
one byte of 00 and one byte of 41. Or do you? In a little-endian platform
such as Intel, it would be more convenient to have a byte of 41 followed
by a byte of 00. Rather than lamely settling on either little-endian or
big-endian encoding, Unicode gives a much more interesting choice. Your
file can start out with the byte order mark, hexadecimal FEFF. If it shows
up as FE FF when reading a byte at a time, the data is big-endian, and if
it shows up as FF FE, it's little-endian.
But UTF-16 is so last millennium. Now Unicode has grown to 20 bit.
While one could theoretically encode it fixed-length with 3-byte or 4-byte
values, just about everyone uses the more economical UTF-8 instead. That's
a variable-length encoding. 7-bit ASCII is embedded as 0bbbbbbb, where
each b is a bit. Then we have a bunch of two-byte codes of the form
110bbbbb 10bbbbbb, followed by three-byte codes 1110bbbb 10bbbbbb
10bbbbbb, and so on. EF BB BF happens to be the three-byte encoding of the
BOM. Work it out for yourself as an exercise! And, by the way, the decimal
value is 65279.
But who needs a byte order mark for UTF-8? There are no two ways of
ordering the bytes. The first byte is always the one starting with
something other than 10, and the others always start with 10. Why would
Notepad put a BOM into an UTF-8 document? That's actual work. Usually,
Notepad is stupid, not evil. So I checked the Unicode spec here . They say it's
perfectly ok to add a BOM in front of a file, and it might actually be
useful because it allows a guess that this is a UTF-8 encoded file. If you
open the file, knowing that it is UTF-8, you should ignore it.
That's fair. So Java, which, as we all know, loves Unicode, will surely
do the right thing, read the BOM and ignore it in a file that's opened
with UTF-8 encoding. Umm, no. Check out this and this bug report.
The folks at Sun were wringing their hands and wailed how fixing this bug
would break a whole bunch of "customer" tools. Which turned out to be the
Sun app server.
Well, guess what. Not fixing the bug breaks
now rejects perfectly valid UTF-8 source files.
Why didn't I notice this earlier? I guess I have finally reached the
point where students configure Windows to use UTF-8 and not some archaic
Microsoft-specific 8-bit encoding. That's good. Now we just need
javac to read those UTF-8 files. If Notepad can, surely