Skip to main content

Parsing a file

7 replies [Last post]
cencio1980
Offline
Joined: 2008-01-31
Points: 0

Hi all,

i tryed to parse a file, and i notice that if there are spaces, return or tabs between elements the resulting DOM Document has some empty Text nodes.

For example i parse this:

SD0042
5

PC9008
2

(i loose indent in this forum) i have that has 7 children (instead of 2) and each has 5 children (instead of 2) and i'm getting creazy to avoid empty ones.

If i remove all spaces, tabs and return like this:
SD0....
the DOM Document is perfect, without extra nodes.

This is the code i use for Parsing:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
factory.setValidating(true);
factory.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA);
factory.setAttribute(JAXP_SCHEMA_SOURCE, new File("/tmp/request.xsd"));
DocumentBuilder builder = factory.newDocumentBuilder();
builder.setErrorHandler(new MyErrorHandler());
Document document = builder.parse( req.getInputStream() );

i tryed both jaxp 1.3 and 1.4.
Any help?

Message was edited by: cencio1980

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
joehw
Offline
Joined: 2004-12-15
Points: 0

Since you are using a validating parser, you may tell it to ignore whitespace by setting:
factory.setIgnoringElementContentWhitespace(true);

Refer to: http://java.sun.com/javase/6/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setIgnoringElementContentWhitespace(boolean)

Hope that helps.

Joe

cencio1980
Offline
Joined: 2008-01-31
Points: 0

Hi joehw ,

i tryed to set IgnofingElementContentWhitespace at true, but nothing changes.

My parser still add some extra nodes. The Validation works perfect and ignores spaces or returns.

Any idea?

Thx,
Cencio

joehw
Offline
Joined: 2004-12-15
Points: 0

I created a test using your code and tried it on a simple xml file. It does work for me. I'm posting it here along with the sample xml file. Please try it with your files and see if it works for you.

Too bad this forum doesn't preseve file format. Just as yours, the XML file had indentations. So when I set setIgnoringElementContentWhitespace to false, the Person node would have 6 children, while true, 3 children.

XML:

Doofus

McGee

XSD:








Test:
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.w3c.dom.Document;
import org.w3c.dom.DocumentType;
import org.w3c.dom.Entity;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.ErrorHandler;

import junit.framework.*;
import junit.textui.TestRunner;

/**
* i tryed to parse a file, and i notice that if there are spaces,
* return or tabs between elements the resulting DOM Document has some empty Text nodes.
*
* @author joehw@dev.java.net
*/
public class ID37658 extends TestCase {
/** All output will use this encoding */
static final String outputEncoding = "UTF-8";

/** Output goes here */
private PrintWriter out;

/** Indent level */
private int indent = 0;

/** Indentation will be in multiples of basicIndent */
private final String basicIndent = " ";

/** Constants used for JAXP 1.2 */
static final String JAXP_SCHEMA_LANGUAGE =
"http://java.sun.com/xml/jaxp/properties/schemaLanguage";
static final String W3C_XML_SCHEMA =
"http://www.w3.org/2001/XMLSchema";
static final String JAXP_SCHEMA_SOURCE =
"http://java.sun.com/xml/jaxp/properties/schemaSource";
public ID37658(String name) {
super(name);
}

public void testParse() throws ParserConfigurationException, SAXException, IOException {
boolean ignoreWhitespace = true;
File xmlFile = new File(getClass().getResource("ID37658.xml").getFile());
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
factory.setValidating(true);
factory.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA);
factory.setAttribute(JAXP_SCHEMA_SOURCE, new File(getClass().getResource("ID37658.xsd").getFile()));
factory.setIgnoringElementContentWhitespace(ignoreWhitespace);

DocumentBuilder builder = factory.newDocumentBuilder();
builder.setErrorHandler(new ErrorHandler() {
public void error(SAXParseException e) throws SAXException {
System.out.println("Error: " + e.getMessage());
}

public void fatalError(SAXParseException e) throws SAXException {
System.out.println("Fatal error: " + e.getMessage());
}

public void warning(SAXParseException e) throws SAXException {
System.out.println("Warning: " + e.getMessage());
}
});
Document document = builder.parse( xmlFile );

// Print out the DOM tree
OutputStreamWriter outWriter =
new OutputStreamWriter(System.out, outputEncoding);
out = new PrintWriter(outWriter, true);
echo(document);
}

public static void main(String[] args) {
TestRunner.run(ID37658.class);
}

/**
* Echo common attributes of a DOM2 Node and terminate output with an
* EOL character.
*/
private void printlnCommon(Node n) {
out.print(" nodeName=\"" + n.getNodeName() + "\"");

String val = n.getNamespaceURI();
if (val != null) {
out.print(" uri=\"" + val + "\"");
}

val = n.getPrefix();
if (val != null) {
out.print(" pre=\"" + val + "\"");
}

val = n.getLocalName();
if (val != null) {
out.print(" local=\"" + val + "\"");
}

val = n.getNodeValue();
if (val != null) {
out.print(" nodeValue=");
if (val.trim().equals("")) {
// Whitespace
out.print("[WS]");
} else {
out.print("\"" + n.getNodeValue() + "\"");
}
}
out.println();
}

/**
* Indent to the current level in multiples of basicIndent
*/
private void outputIndentation() {
for (int i = 0; i < indent; i++) {
out.print(basicIndent);
}
}

/**
* Recursive routine to print out DOM tree nodes
*/
private void echo(Node n) {
// Indent to the current level before printing anything
outputIndentation();

int type = n.getNodeType();
switch (type) {
case Node.ATTRIBUTE_NODE:
out.print("ATTR:");
printlnCommon(n);
break;
case Node.CDATA_SECTION_NODE:
out.print("CDATA:");
printlnCommon(n);
break;
case Node.COMMENT_NODE:
out.print("COMM:");
printlnCommon(n);
break;
case Node.DOCUMENT_FRAGMENT_NODE:
out.print("DOC_FRAG:");
printlnCommon(n);
break;
case Node.DOCUMENT_NODE:
out.print("DOC:");
printlnCommon(n);
break;
case Node.DOCUMENT_TYPE_NODE:
out.print("DOC_TYPE:");
printlnCommon(n);

// Print entities if any
NamedNodeMap nodeMap = ((DocumentType)n).getEntities();
indent += 2;
for (int i = 0; i < nodeMap.getLength(); i++) {
Entity entity = (Entity)nodeMap.item(i);
echo(entity);
}
indent -= 2;
break;
case Node.ELEMENT_NODE:
out.print("ELEM:");
printlnCommon(n);

// Print attributes if any. Note: element attributes are not
// children of ELEMENT_NODEs but are properties of their
// associated ELEMENT_NODE. For this reason, they are printed
// with 2x the indent level to indicate this.
NamedNodeMap atts = n.getAttributes();
indent += 2;
for (int i = 0; i < atts.getLength(); i++) {
Node att = atts.item(i);
echo(att);
}
indent -= 2;
break;
case Node.ENTITY_NODE:
out.print("ENT:");
printlnCommon(n);
break;
case Node.ENTITY_REFERENCE_NODE:
out.print("ENT_REF:");
printlnCommon(n);
break;
case Node.NOTATION_NODE:
out.print("NOTATION:");
printlnCommon(n);
break;
case Node.PROCESSING_INSTRUCTION_NODE:
out.print("PROC_INST:");
printlnCommon(n);
break;
case Node.TEXT_NODE:
out.print("TEXT:");
printlnCommon(n);
break;
default:
out.print("UNSUPPORTED NODE: " + type);
printlnCommon(n);
break;
}

// Print children if any
indent++;
out.print("Number of Children: "+n.getChildNodes().getLength());
int count = 0;
for (Node child = n.getFirstChild(); child != null;
child = child.getNextSibling()) {
out.print("Child "+(++count));
echo(child);
}
indent--;
}
}

Message was edited by: joehw

Message was edited by: joehw

Message was edited by: joehw

Message was edited by: joehw

cencio1980
Offline
Joined: 2008-01-31
Points: 0

Uhmmm... i do same but seems that it ignores factory.setIgnoringElementContentWhitespace(true); for me...

Maybe some probs with libs?

java version "1.5.0_11"
jaxb 1.4

The code u send is incomplete..

I send u a mail to joehw[at]dev.java.net.

Thx for ur help!
Lorenzo

joehw
Offline
Joined: 2004-12-15
Points: 0

Hi Lorenzo,

I saw your email. For the benefit of others who read this forum, I'll post my findings here.

I did some research and found this report http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6564400 that stated a regression in jdk6 from jdk5_11 (in the REPRODUCIBILITY section, quote: Release Regression From : 5.0u11
The above release value was the last known release where this
bug was not reproducible. Since then there has been a regression.
Posted Date : 2007-06-01 06:41:58.0)

Note that the bug was filed in June and fixed in July. So the bug exists in all jaxp 1.4 releases (e.g. 1.4.0, 1.4.1 and 1.4.2). Since you are using jaxp 1.4, you may be hitting this bug. I feel sorry I didn't realize this when I was using the latest jaxp codebase.

Would it be feasible for you to use the latest build, e.g. the nightly build? The nightly-build contains recent fixes and passes all the SQE and TCK tests.

Thanks,
Joe

Message was edited by: joehw

cencio1980
Offline
Joined: 2008-01-31
Points: 0

Yes, that i the bug i found.

I try the Nighlty (jaxp-1_4-20080309/) with 1.5.0_11-b03 and also 1.6.0_02-b05 but i still hit it.

I don't understand if it is fixed or not.

Anyway really thx for your help,
Lorenzo.

joehw
Offline
Joined: 2004-12-15
Points: 0

You're very welcome. I really hope it'd work for you or you might have found other ways around this issue (e.g. skip empty text nodes).

I will send my test files via email just in case you still want to try it. Note that the test is junit test, so you would need junit.jar on the classpath. To make sure the jaxp classes in the jars are used, you may use the bootclasspath, such as:
-Xbootclasspath/p:[pathtojaxpjars]/jaxp-api.jar:[pathtojaxpjars]/jaxp-ri.jar

You may also use the jdk endorsed mechanism for the purpose.

Thanks,
Joe