Posted by kohsuke
on January 11, 2008 at 7:10 PM PST
When writing a web app (or just seeing static HTML pages), sometimes you see Ã‚ where you expected whitespace. Why does this happen?
"non-breaking space" character, which is known as Unicode code point 160 (written as U+00A0), AKA " " in HTML, is often used to force browsers to put whitespace. This is particularly so since "space" characters (U+0020) are normalized by them.
When a non-breaking space character is sent to the browser, it is first encoded into a sequence of bytes for transmission. If the server chooses UTF-8 for encoding, this character is converted into two bytes, "C2 A0" (for those who are curious, see UTF-8 encoding rule for yourself.)
Now, if a browser decodes this with UTF-8, everything is happy. But often for various reasons it fails to pick up the correct encoding, and instead it often ends up using iso-8859-1 , as this is often set as the system default encoding, especially in the U.S.
When the byte sequence "C2 A0" is interpreted as iso-8859-1, this is decoded into two characters, "A circumflex" followed by "non-breaking space". That's why you see a strange "Ã‚" (followed by space, which you can't see.)
When this happens, what you need to find out is why the browser is choosing the incorrect encoding. It's hard to list possible causes exhaustively, but the typical ones are:
- You wrote a static HTML file in UTF-8, but your web server doesn't know that , so it doesn't send the HTTP Content-Type header with proper charset. Thus the browser ends up making a guess at the encoding, and it fails.
- You wrote a web application, but it's not sending the Content-Type header. ServletResponse.setCharacterEncoding("UTF-8") is your friend.
- Putting UTF-8 BOM at the beginning of the HTML document helps browser detects the right encoding.
- HTTP meta tag can be used to re-iterate the encoding, like "<META http-equiv="Content-Type" content="text/html;charset=UTF-8">"