Skip to main content

wrong char in inputstream when reading utf-8 website

No replies
senderj
Offline
Joined: 2008-09-24

I have a program that reads web pages and extract the inf I need. It works well with sites in "charset=iso-8859-1". Recently I used it to read a site in "charset=utf-8" and found that some of the utf-8 chars were somehow became 0x3f, or rather a byte in some utf-8 chars became 0x3f.

Take for example, one of the table cell containing a utf-8 char 0xe4bf9d. It becomes 0xe4bf3f. So far I found that 0x9d 0x81 0x8f 0x90 are all changed to 0x3f.

Here is my coding:
URL url = new URL(site);
URLConnection conn = url.openConnection();
InputStream is = conn.getInputStream();
BufferedReader r = new BufferedReader(new InputStreamReader(is));
then read char by char with r.read(); to parse the HTML.

I've tried to put in something like if (r.read() == 129) doSomething; but 129 was never read (129 = 0x81). So the problem has nothing to do with the rest of my parsing coding. I've tried to use new InputStreamReader(is, "utf-8") but the result is even worse, none of the chars were interpreted correctly. The problem only happens when reading utf-8 site, but no problem at all for many year when use on ascii site. So if anybody know what's wrong, please give me a help.