Skip to main content

Compression of XML?

3 replies [Last post]
panderson_007
Offline
Joined: 2006-08-02
Points: 0

I am not sure if this was exactly the intent of FI, but I have been tasked with using it to compress XML in order to send large docs over a low bandwidth connection. Doing so I have noticed some issues and would like to find out what I am doing wrong. Any assistance is greatly appreciated.

First, I do NOT get the same XML Doc when I create a FI Doc (using the SAXDocumentSerializer) and then recreate the XML doc from the generated FI Doc. My original XML doc starts as follows:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

After compressing and recreating the XML Doc, I get:

<?xml version="1.0" encoding="UTF-8"?>

Second, I have an XML doc that is formatted in a friendly human readable format (i.e. CR/LF and indenting included). How do I get the whitespace to be compressed?

Third, The CR/LF get removed by the way I am reading the XML file. But if I include the CR/LF when I read in the XML file, I get an OutOfMemoryError: Java heap space. This error is NOT received when the CR/LF are removed. The XML file is 2M and has almost 73,000 lines. Is there a way to correct this without increasing the heap space?

Fourth, Would specifying the schema aid in the compression of the XML doc? If so, what is the best way for doing this?

Thank you,
Phillip

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
panderson_007
Offline
Joined: 2006-08-02
Points: 0

Thank you for the quick response.

I will post the issue and try another serializer.

I can't post the xml because it is for a DoD contract and I was asked not to post the information. I will try to create a similarly formatted xml and get that to you.

I changed my implementation to strip out the WS/CR/LF when it reads in the xml file.

The data that I am compressing could contain both base64 and floating point data, could you elaborate on how to specify a schema when doing the compression?

sandoz
Offline
Joined: 2003-06-20
Points: 0

> Thank you for the quick response.
>
> I will post the issue and try another serializer.
>

Thanks!

> I can't post the xml because it is for a DoD contract
> and I was asked not to post the information. I will
> try to create a similarly formatted xml and get that
> to you.
>

OK.

> I changed my implementation to strip out the WS/CR/LF
> when it reads in the xml file.
>
> The data that I am compressing could contain both
> base64 and floating point data, could you elaborate
> on how to specify a schema when doing the compression?

Do you already have a schema? Or do you want to generate one from a set of documents?

There is an example in the fi/samples directory:

fi/samples/src/samples/typed/ConvertLexicalValues.java

which shows how, if you have a schema, to obtain the data types associated with elements/attributes and then convert an XML document to an FI document where the lexical values in the XML document are converted to binary form in the FI document. In this case base64 and floats are converted but it is easy to extend to other types.

There is some sample schema and a document, so you should be able to execute the following:

java -cp <...> ConvertLexicalValues \
data/schema/Content.xsd data/content.xml

Note that you will require the FastInfosetUtilities.jar (which is JDK 5.0 dependent).

This example is only dependent on schema to obtain the mapping of attributes/elements to data types. If you do not have a schema but know what attributes/elements will contain lexical values that can be converted it should be easy to generate the mappings by other means (or by even create a 'fake' schema if you do not want to write any code to do this and reuse the utility code!).

Hope this helps,
Paul.

sandoz
Offline
Joined: 2003-06-20
Points: 0

One: This is a bug in the preservation of the standlone property. The FI encoding can support this but the SAX serializer and parser do not currently. Would you like to log an issue (a bug) for this?

Two: The serializer will by default try to compress content that is less than seven characters. If there is indenting WS/CR/LF that is greater than this you can increase the value by invoking the following method:

FastInfosetSerializer.
setCharacterContentChunkSizeLimit(int )

with a suitable value. Increasing this value may use more memory by the serializer/deserializer.

Three: Would it be possible for me to get access to the XML document? Otherwise i cannot tell why the FI SAX parsing is using too much memory when WS/CR/LF is present. Note that the parsers/serializers have been tested large documents (that contain WS/CR/LF) and i have not seen this error.

Four: How many unique elements/attributes are there in the document? My experience is that the utilizing of a schema with FI is more effective on smaller documents where the ratio of unique markup to content is high. For larger documents the ratio is often lower.
However, there is one area that may make a difference and that is the encoding of content in binary form. Does the document contain base64 data? or floating point data? If so it may be possible to encode such data in a more efficient binary form.

Are you also comparing with GZIP? This can be an effective solution as well but the compression can cost. BTW you can always GZIP the FI document too, and since the compression operates on less data it should be faster to compress than the XML document.

Also if WS/CR/CL is just used for human readability and can easily be regenerated when decoding it might be best to remove it when serializing. Who reads a 2MB XML document?? :-)

Paul.