Skip to main content

Any known issues with 4-byte utf-8 characters and JAX-WS?

3 replies [Last post]
chriscorbell
Offline
Joined: 2007-07-19

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
joconner
Offline
Joined: 2003-06-17

Converting your UTF-8 to a Unicode code point value, I get U+267CC, definitely in the supplementary area. It is a completely valid Unicode character, supported nicely in Java SE 5 and higher. What version of Java are you using? Maybe you are using an older version of the Java platform, one that doesn't quite grok the character. Can you try a slightly less ambitious character, perhaps one up to U+FFFF...let's see how your app works then, and then we'll re-evaluate the problem.

Regards,
John O'Conner
http://joconner.com

justinlindh
Offline
Joined: 2007-03-23

I'm also having some problems that sound similar to this.

I'm submitting data from a web page, and when I use Japanese characters I'm seeing the following received in the debugger:
\u0006F22\u0005B57 (for: 漢字)

This is UTF-16, but I'm sending this data to a dotnet application that is expecting UTF-8. How can I do this conversion? I've tried:
String utf8Body = new String(request.getBody().getBytes(), "UTF-8");

But this only serves to mangle the String once received. I'm new to internationalization issues, so any help is greatly appreciated.

joconner
Offline
Joined: 2003-06-17

Most UTF-8 encoded Japanese characters will encode in 3 bytes. For example, the character æ¼¢ (KAN) encodes as the three UTF-8 code units E6 BC A2. If you have Japanese characters that encode as four UTF-8 code units, you must be using characters above the base multilingual plane (supplementary characters). Maybe you are encoding the characters incorrectly? Are you really using supplementary characters?

Regards,
John O'Conner