Skip to main content

Validation and memory usage

7 replies [Last post]
prunge
Offline
Joined: 2004-05-06
Points: 0

When performing validation using JAXP validation API, does the whole document get loaded up into memory and then validated? How do validators cope with very large (e.g. > 1GB) XML files?

I'm using JDK 1.6.0_02, and I'm having memory issues validating very large XML files.

My input to the validator is a StreamSource built from an InputStream. The XML files are typically very large, and have very large elements (elements can contain many megabytes of text - in fact it's base64 encoded data).

I'm seeing OutOfMemoryErrors occurring when trying to validate, e.g.:
[pre]
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:515)
at java.lang.StringBuffer.append(StringBuffer.java:306)
at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.handleCharacters(XMLSchemaValidator.java:1566)
at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.characters(XMLSchemaValidator.java:738)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:461)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:807)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
at com.sun.org.apache.xerces.internal.jaxp.validation.StreamValidatorHelper.validate(StreamValidatorHelper.java:144)
at com.sun.org.apache.xerces.internal.jaxp.validation.ValidatorImpl.validate(ValidatorImpl.java:107)
at javax.xml.validation.Validator.validate(Validator.java:127)
at validationtester.ValidationTester.main(ValidationTester.java:45)
[/pre]

Some sample code:

<br />
public class ValidationTester<br />
{<br />
	public static void main(String... args)<br />
	throws Exception<br />
	{<br />
		//File schemaFile = new File("one.xsd");<br />
		URL schemaFile = ValidationTester.class.getResource("/one.xsd");</p>
<p>		List streamList = new ArrayList();<br />
		InputStream start = new ByteArrayInputStream("<?xml version=\"1.0\" encoding=\"UTF-8\"?>one:name".getBytes("UTF-8"));<br />
		InputStream middle = new ConstantInputStream(100000000, (byte)'a');<br />
		InputStream end = new ByteArrayInputStream("".getBytes("UTF-8"));</p>
<p>		streamList.add(start);<br />
		streamList.add(middle);<br />
		streamList.add(end);</p>
<p>		InputStream xmlIs = new SequenceInputStream(Collections.enumeration(streamList));<br />
		Source source = new StreamSource(xmlIs);</p>
<p>		SchemaFactory sf =<br />
			SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);<br />
		Schema schema = sf.newSchema(schemaFile);<br />
		Validator validator = schema.newValidator();<br />
		validator.validate(source);</p>
<p>		System.out.println("Validation successful");<br />
	}<br />
}<br />

where ConstantInputStream is a class that is an input stream that repeats a single byte N times - in this case, it is repeating the letter 'a' 100 million times to build up a very large element without using up lots of memory.

Has anyone had any experience with validating very large files? Are there some Validator properties or features that can be set to reduce memory usage?

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
bvesely
Offline
Joined: 2010-04-07
Points: 0

Try to increase the memory usuage, for example use -Xmx256m.
I still have a problem of the schema taking a long time to load. For a 180 kb file, it takes over four minute to load using JAXP. Has anybody found a solution for a faster load time?

jesusjaviers
Offline
Joined: 2009-01-27
Points: 0

I tried in my environment (Solaris 10, Java 1.6) this validator:

https://msv.dev.java.net/

I executed the command line option, but it says you can use it as an API. For a 1GB xml file, it took around 4 minutes to validate against schema, and less than 300 MB of memory (I think it uses SAX)

Hope it helps,

Javier

prunge
Offline
Joined: 2004-05-06
Points: 0

Thanks, I'll take a look at it (sorry for late reply).

anki30
Offline
Joined: 2010-02-18
Points: 0

Hi,

I am facing the same issue. For me the part of xsd is like:








And corresponding xml-

Value is a huge text, when i created a txt file with this text, the size of text file was 140MB

The parsing and import of elements was throwing OutOfMemoryError-
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3209)
at java.lang.String.(String.java:216)
at java.lang.StringBuffer.toString(StringBuffer.java:585)
at org.apache.xerces.impl.dv.xs.XSSimpleTypeDecl.normalize(Unknown Source)
at org.apache.xerces.impl.dv.xs.XSSimpleTypeDecl.getActualValue(Unknown Source)
at org.apache.xerces.impl.dv.xs.XSSimpleTypeDecl.validate(Unknown Source)
at org.apache.xerces.impl.xs.XMLSchemaValidator.elementLocallyValidComplexType(Unknown Source)
at org.apache.xerces.impl.xs.XMLSchemaValidator.elementLocallyValidType(Unknown Source)
at org.apache.xerces.impl.xs.XMLSchemaValidator.processElementContent(Unknown Source)
at org.apache.xerces.impl.xs.XMLSchemaValidator.handleEndElement(Unknown Source)
at org.apache.xerces.impl.xs.XMLSchemaValidator.endElement(Unknown Source)
at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)

Since the xml that we are trying to parse is generated by an internal component, so we turned off the schema validation as a workaround for this issue. I am wondering if this is a limitation that we have to live with or there is something that we could do.

Also, I am wondering if using msv validator could help me.

prunge
Offline
Joined: 2004-05-06
Points: 0

A year later and I haven't worked out a solution to this. Some workarounds we tossed around:

- Do piecewise validation of the portions of the XML that do not contain huge amounts of text (through SAX or StAX perhaps to only validate on known small elements and ignore the rest?)
- Find a different implementation of a validator that does not load all element text into memory at once (still don't know of one)
- Don't use big elements in the first place and use other methods, e.g. MIME attachments (technical limitations meant this was not an option for us)

I'm not sure why a validator needs to load entire element text into memory - thinking of all the constraints that are possible (patterns, length, whitespace, etc.) - this probably could all be done using a streaming approach, though I can't imagine it would be too easy to change Xerces to do this.

joehw
Offline
Joined: 2004-12-15
Points: 0

Looking at the stacks, it's out of memory because it's loading a huge text/string into the string buffer. Loading 100 million 'a' would cause a memory spike of more than 500M. You may use the -Xmx option to increase the heap size and see if that'd solve your problem.

Good luck
Joe

borgel
Offline
Joined: 2008-02-18
Points: 0

I am in the same situation. Did you find a solution?