Skip to main content

HTTP Error 503 in xpath.evaluate() (in DOMParser.parse() in fact)

3 replies [Last post]
masternag
Offline
Joined: 2006-05-21
Points: 0

I want to parse a XHTML page with xpath expression, and I get the following error :

java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1305)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:677)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1315)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1282)
at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:283)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1193)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1090)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1003)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:807)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:107)
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:225)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:283)
at com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:468)
at com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:515)

Here is the code I'm using :
String result = "";
try {
XPath xpath = XPathFactory.newInstance().newXPath();
String expression = "//div[@class=\"box\"]/img[@src]";
InputSource inputSource = new InputSource("test.xhtml");
result = xpath.evaluate(expression, inputSource);
} catch (XPathExpressionException ex) {
Logger.getLogger(MazeWebPage.class.getName()).log(Level.SEVERE, null, ex);
}

And the file "test.xhtml" looks like :

<div class="text">Something uninteresting</div>
<div class="box"><img src="value-i-want-to-retrieve"/></div>

According to this page (http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic) the problem seems to be that the parser automatically tries to download the DTD "xhtml1-strict.dtd" from the W3 website, due to the !DOCTYPE directive in the webpage.

Unfortunately, I haven't been able to find a way to bypass this limitation. Would anyone got a solution to that problem ? Thanks in advance.

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
nigelren
Offline
Joined: 2010-01-18
Points: 0

The solution I found ( although using JDOM ) is as follows...

SAXBuilder builder = new SAXBuilder();
[b]builder.setEntityResolver(new EntityResolver());[/b]
builder.setIgnoringElementContentWhitespace(true);
StringReader reader = new StringReader(this.content);
this.JDOMContent = builder.build(reader);

With the EntityResolver class being...

public class EntityResolver implements org.xml.sax.EntityResolver {
private final static Logger LOG = Logger.getLogger(EntityResolver.class);

private static final String BASE = "file:///XArchitect/config/w3c/";

@Override
public InputSource resolveEntity(String publicId, String systemId)
throws SAXException, IOException {
InputSource inp = null;
String newFile = null;
final String entityName = systemId;
if (entityName.endsWith(".dtd")) {
newFile = entityName.substring(entityName.lastIndexOf('/') + 1);
} else if ( entityName.startsWith("file:" ) == false ){
LOG.error("Resolve:" + systemId);
}
if (newFile != null) {
inp = new InputSource( BASE + newFile);

}

return inp;
}

}

What happens is that as the document is parsed - any external entity request is passed to the EntityResolver class which can pick these documents up from wherever it wants. So I downloaded the dtd and put it in a directory on my on machine ( the BASE location ).

joehw
Offline
Joined: 2004-12-15
Points: 0

Since w3c blocks the url, you may choose to get a copy of xhtml1-strict.dtd locally and point to it instead. Unfortunately, since they are blocking it, I'm not sure where to get it now. The other option would be simply remove it from your xml file.

If you can't change the xml file, you may use your own parser instead of XPath's default one. That is, xpath.evaluate(expression, document);
where document is a DOM document.

To create the document without reading the dtd, set the DOM factory to ignore DTD:
documentBuilderFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

Hope that works for you.

reonarudo
Offline
Joined: 2008-12-17
Points: 0

I encountered the same problem. Have you found a solution?

Thank you.