Posted by joconner
on March 24, 2010 at 12:13 AM PDT
UTF-8 can help you store multilingual text, but sometimes your applications need help to display them correctly. Sometimes that help is just a hint to tell the application which charset to use when decoding the text. Can the BOM help you?
Yesterday a coworker complained that Excel wasn't displaying a CSV (comma separated values) file correctly. Our application allows the user to send a report via email. The application provides the report as a CSV file. Because the report can contain multilingual text, we've decided to encode it in UTF-8. Unfortunately, when users click on the file to display it, usually in Excel, all of the multi-byte UTF-8 characters display incorrectly.
The problem was immediately clear to me...Excel was opening the UTF-8 encoded files, but it was incorrectly identifying them as Latin-1 encoded files. In the absence of any charset identification, Excel must guess about a file's content encoding. In our environment, many host PCs use en_US locales with Latin-1 as the typical charset. Excel uses that default to read and display CSV files.
My solution to the problem was to use the byte-order marker (BOM) to identify the CSV file as a Unicode file. I instructed my colleague to prepend the
FEFF character to the file. The Java application that writes the file uses a
FileWriter that encodes to UTF-8 to create the CSV file. It was simple to just output the BOM as the first character in the file.
Now when our customers double-click on these files, Excel opens the file, notices the BOM, and automatically selects UTF-8 as the file's charset encoding. Now Excel displays the previously mangled characters correctly. And I was able to help resolve a problem with an easy solution.
Maybe you can give your applications a hint about plain text files as well. Writing the BOM to your file can help Unicode-enabled applications know how to decode your Unicode files.