【问题标题】:Way it ignore entity reference resolving?它忽略实体引用解析的方式?
【发布时间】:2012-04-02 18:29:23
【问题描述】:

我正在使用 Java 6 和最新版本的 Xerces。我正在尝试解析这样开头的 HTML 文档...

<!DOCTYPE html> 

然后引用实体“&raquo”。解析异常终止...

org.xml.sax.SAXParseException: The entity "raquo" was referenced, but not declared. 
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:249) 
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284) 
    at com.myco.myproject.util.XmlUtilities.getStringAsDocument(XmlUtilities.java:147) 
    at com.myco.myproject.util.NetUtilities.getUrlAsDocument(NetUtilities.java:65) 
    at com.myco.myproject.parsers.impl.AbstractMetromixParser.parsePage(AbstractMetromixParser.java:107) 
    at com.myco.myproject.parsers.impl.AbstractMetromixParser.getEvents(AbstractMetromixParser.java:76) 
    at com.myco.myproject.domain.EventFeed.refresh(EventFeed.java:81) 
    at com.myco.myproject.domain.EventFeed.getEvents(EventFeed.java:72) 
    at com.myco.myproject.parsers.impl.MetromixParserTest.testParser(MetromixParserTest.java:21) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
    at java.lang.reflect.Method.invoke(Method.java:597) 
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) 
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) 
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) 
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) 
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) 
    at org.springframework.test.context.junit4.statements.RunBeforeTestMethodCallbacks.evaluate(RunBeforeTestMethodCallbacks.java:74) 
    at org.springframework.test.context.junit4.statements.RunAfterTestMethodCallbacks.evaluate(RunAfterTestMethodCallbacks.java:83) 
    at org.springframework.test.context.junit4.statements.SpringRepeat.evaluate(SpringRepeat.java:72) 
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:231) 
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) 
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) 
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) 
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) 
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) 
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) 
    at org.springframework.test.context.junit4.statements.RunBeforeTestClassCallbacks.evaluate(RunBeforeTestClassCallbacks.java:61) 
    at org.springframework.test.context.junit4.statements.RunAfterTestClassCallbacks.evaluate(RunAfterTestClassCallbacks.java:71) 
    at org.junit.runners.ParentRunner.run(ParentRunner.java:236) 
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.run(SpringJUnit4ClassRunner.java:174) 
    at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) 
    at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) 
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) 
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) 
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) 
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) 

有没有办法告诉解析器忽略这些它无法解析的实体类型?如果没有,我必须插入什么解析器?

编辑:这是我解析 HTML 的方式,它实际上是 XHTML。在尝试以下操作之前,我通过 JSoup 传递字符串以清理它...

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    factory.setValidating(false);
    factory.setExpandEntityReferences(false);
    final DocumentBuilder builder = factory.newDocumentBuilder();
    final InputSource s = new InputSource(new StringReader(str));
    org.w3c.dom.Document result = builder.parse(s);

【问题讨论】:

  • 使用 HTML 解析器,而不是 XML 解析器。 HTML 不是 XML。
  • 啊,我错误地认为人们知道 JSoup 是什么,但是我使用的 JSoup 解析器将混乱的 HTML 转换为格式良好的 XHTML
  • 在您编辑之前我没有看到任何对 JSoup 的引用,如果我错过了,请道歉。
  • 你能提供一个正在传递给代码的 HTML 的示例 sn-p 吗?

标签: java parsing xerces


【解决方案1】:

1.10.3 版本开始,JSoup 提供了W3CDom 帮助类,它允许您将org.jsoup.nodes.Document 直接转换为org.w3c.dom.Document

考虑以下示例:

String str =
        "<!DOCTYPE html>" +
        "<html>" +
        "<dody>" +
        "<div>&raquo; example</div>" +
        "</dody>" +
        "</html>";

Document document = Jsoup.parse(str);
W3CDom w3cDom = new W3CDom();
org.w3c.dom.Document result = w3cDom.fromJsoup(document);

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2023-03-27
    • 2021-03-07
    • 2011-11-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多