Skip to main content

Transform BIG XML to SQL

1 reply [Last post]
pucko
Offline
Joined: 2008-11-17
Points: 0

Hi!

I'm trying to parse some realy big XML-files (up to 3-4GB) and transform them to SQL with xslt using xalan.

(I downloaded the xalan-java .jar's from http://xml.apache.org/xalan-j/. This seams to give me some features like tFactory.setAttribute("http://xml.apache.org/xalan/features/incremental"), java.lang.Boolean.TRUE); with don't work without them. Even do the API seams identical to the Java standard API, don't realy know what the difference is.

But any way, even thou I use the tFactory.setAttribute("http://xml.apache.org/xalan/features/incremental")-feature, witch allows the transformer to generate output while the document is being parsed, I still run out of java-heap-space.

I use a SAX-parser with streamSource and StreamResult (and an XSLT-file to generate the parser).

Smaler XML-files works just fine, but when it gets up to 200-250MB I run out of java-heap-space.

Before, when I only used SAX-components (SAXTransformer, SAXSource, SAXResult) I got the following exception efter (more or less exactly) 5 min:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.apache.xml.dtm.ref.DTMDefaultBase.ensureSizeOfIndex(DTMDefaultBase.java:300)
at org.apache.xml.dtm.ref.DTMDefaultBase.indexNode(DTMDefaultBase.java:326)
at org.apache.xml.dtm.ref.sax2dtm.SAX2DTM.startElement(SAX2DTM.java:1885)
at org.apache.xalan.transformer.TransformerHandlerImpl.startElement(TransformerHandlerImpl.java:498)
at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
at org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown Source)
at org.apache.xerces.impl.dtd.XMLDTDValidator.emptyElement(Unknown Source)
at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at test.micke.xalan.Test.main(Test.java:68)

and the result file with the SQL-kod would always be 212,9MB (If a used a specific XML-file with a specific XSLT-file. But the important thing is that a specific XML with a specific xslt would generate the same result every time, same size).

Now, when a use streamSource and StreamResult, it will run for 3 min until i get an exception, but this time it's another exception:

file:/home/micke/Desktop/Taric-filer/xslt_ptB(test).xsl; Line #0; Column #0; org.apache.xml.utils.WrappedRuntimeException

(The file is the XSLT-file i'm using. The Line # and Column # are always different, it can be Line #34; Column #68 or something else. This one is empty (witch is why it is 0.0 this time) so we know it's not something wrong with the xslt-file)

But it's the same with the output - a specific XML with a specific xslt will generate the exact same amount of output before the exception. So the XML and xslt that previously generated 212,9MB before the exception, still generate 212,9MB. So I figure it's still the same problem.

Any one who knows how to fix this?

Just ask and I put out the code, can even give you the xslt and a link to the server with the XML-files.

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
joehw
Offline
Joined: 2004-12-15
Points: 0

First of all, the xslt feature in the jdk is actually from Xalan, although not the latest version.

Looking at the code where OutOfMemory was thrown, it appeared that it's trying to increase the cache that held all the nodes. Of course, when you get OutOfMemory, it could happen anywhere. But you may try to increase the maximum heap size and hopefully it could get over a certain threshold, e.g. no more new namespace/local names to cache and etc. On the other hand, the resulting file was about the size when you ran out of memory, and 3-4GB is pretty big, I'm a little suspecious if that incremental feature did work. But it's worth trying :)

Good luck,
Joe