Skip to main content

Entity reference Conversion to Special Character

2 replies [Last post]
ipodee
Offline
Joined: 2008-10-23
Points: 0

Does anyone know how to diable the auto conversion of entity reference by SAXParser?

For example, when I feed source xml file with entity reference like "& lt;" to SAXParser.parse(...), it's converted to < character in the target xml.

I don't want this happen, how can I do it?

Thanks,

Kevin

Message was edited by: ipodee

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
prunge
Offline
Joined: 2004-05-06
Points: 0

Hi Kevin,

If you are using Xerces you can turn on the

http://apache.org/xml/features/scanner/notify-builtin-refs

(see http://xerces.apache.org/xerces2-j/features.html#scanner.notify-builtin-...)

feature to be notified of these entities. You will still be notified of the parsed characters in the characters() method but because they will be surrounded with startEntity() and endEntity() events you can write some additional logic for this.

[code]
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(true);
spf.setFeature("http://apache.org/xml/features/scanner/notify-builtin-refs", true);

SAXParser parser = spf.newSAXParser();
XmlHandler handler = new XmlHandler();

//Need this otherwise XmlHandler is treated as a standard DefaultHandler
parser.setProperty ("http://xml.org/sax/properties/lexical-handler", handler);

String xml = "This is a &lt;test&gt;";
StringReader reader = new StringReader(xml);

parser.parse(new InputSource(reader), handler);
[/code]

[code]
private static class XmlHandler extends DefaultHandler2
{
@Override
public void characters(char[] ch, int start, int length)
throws SAXException
{
System.out.println("Characters: " + new String(ch, start, length));
}

@Override
public void endEntity(String name) throws SAXException
{
System.out.println("end entity: " + name);
}

@Override
public void startEntity(String name) throws SAXException
{
System.out.println("start entity: " + name);
}
}
[/code]

And the output is:

Characters: This is a
start entity: lt
Characters: <
end entity: lt
Characters: test
start entity: gt
Characters: >
end entity: gt

Hope this helps,

Peter

(fixed message to have properly escaped entities)

Message was edited by: prunge

joehw
Offline
Joined: 2004-12-15
Points: 0

While it's impossible to turn that off since entities like < are sort of built-in in XML, you may eascape them in a CDATA section.

Joe