Skip to main content

Java 6 parsers not ignoring whitespace

5 replies [Last post]
justin83
Offline
Joined: 2006-07-04
Points: 0

Hi,

I am loading in an XML document using a DOM parser and an associated XML schema. I use the "setIgnoringElementContentWhitespace" method of the "DocumentBuilderFactory" instance to "true" before instantiating a "DocumentBuilder" instance. The document loads in fine and ignores the whitespace between element tags when I use the Xerces implementation that comes with Java 5, but the implementation used in Java 6 still reports the white space as text nodes. Setting the feature "http://apache.org/xml/features/dom/include-ignorable-whitespace" to "true" directly on the factory instance does not help either. Any ideas on additional steps that are needed to ignore whitespace in the Java 6 implementation? Here is the parser generation section of my code:

//-----------------------------------------------------------------------------------------------
private Document loadDocument() throws IOException
{
//load the schema
SchemaFactory jaxp = SchemaFactory.newInstance(W3C_XML_SCHEMA);
InputStream schemaSource
= TGImporter.class.getResourceAsStream("TGExporter.xsd");
if(schemaSource == null)
{
throw new IOException("\"TGExporter.xsd\" not found");
}
Schema schema = null;
try
{
schema = jaxp.newSchema(new StreamSource(schemaSource));
}
catch(SAXException e)
{
throw new IOException("Unable to parse \"TGExporter.xsd\": " +
e.getMessage());
}

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
factory.setIgnoringElementContentWhitespace(true);
factory.setIgnoringComments(true);
factory.setSchema(schema);

try
{
DocumentBuilder b = factory.newDocumentBuilder();
b.setErrorHandler(new ErrHandler());
return b.parse(new File(filename));
}
catch(Exception e)
{
throw new IOException("Error parsing \"" + filename + "\": " +
e.getMessage());
}
}
//---------------------------------------------------------------------------------------------

Thanks,

Justin

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
ndw
Offline
Joined: 2004-03-04
Points: 0

I think I've got a fix that works, but I'm really concerned about the problem.

I accept that the behavior is different from JDK5, but I believe the JDK5 behavior is a bug.

There's nothing in XML Schema validity assessment or the resulting PSVI that identifies "ignorable whitespace" in the same way that DTD validation does. I fear that the behavior that users have grown to expect from JDK5 will not be interoperable.

Hello rock (breaking users existing code). Hello hard place (non-conformant, non-interoperable behavior).

justin83
Offline
Joined: 2006-07-04
Points: 0

I was looking at some of my previous posts, and I realized that I never posted my response to this. I found this in the XML chapter of "Core Java 2 : Volume II":


public static void removeWhitespaceNodes(Element e) {
NodeList children = e.getChildNodes();
for (int i = children.getLength() - 1; i >= 0; i--) {
Node child = children.item(i);
if (child instanceof Text && ((Text) child).getData().trim().length() == 0) {
e.removeChild(child);
}
else if (child instanceof Element) {
removeWhitespaceNodes((Element) child);
}
}
}

Calling this on your document element will remove all white-space only text nodes in the DOM graph.

chellurpradeepkumar
Offline
Joined: 2009-05-07
Points: 0

Hi,

Thanks for posting the solution, was quite handy.

Regards,
Pradeep.

joehw
Offline
Joined: 2004-12-15
Points: 0

For anyone interested in this issue, please refer to http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6564400. We are currently working on the issue.

joehw
Offline
Joined: 2004-12-15
Points: 0

Hi Justin,

Could you provide the xml file and schema you used? I'll try to reproduce the problem once I have your files.

Thanks.