Dom4j 解析 - 如何以编程方式声明 HTML 实体？ “实体“nbsp”已被引用，但未声明。”答案

【问题标题】：Dom4j parsing - How to declare HTML entities programmatically? "The entity "nbsp" was referenced, but not declared."Dom4j 解析 - 如何以编程方式声明 HTML 实体？ “实体“nbsp”已被引用，但未声明。”
【发布时间】：2012-11-03 06:47:36
【问题描述】：

我正在使用 Dom4j 解析 HTML 文档。 Dom4j 需要 XML，因此不声明 HTML 实体。可以在文档的 DTD 中声明它们，但我正在解析外部输入，所以这是不合适的。我宁愿在解析器中以编程方式声明它们。

这是我的代码：

    // Read.
    final DocumentFactory df = DOMDocumentFactory.getInstance();
    SAXReader reader = new SAXReader();
    Document doc, outDoc;
    try {
        doc = reader.read( new StringReader(htmlStr) );
    }
    catch( Exception ex ){
        throw new RuntimeException("Error parsing the HTML:\n       " + ex.toString() );
    }

我看到 SAXReader 有 reader.setEntityResolver( ??? ); 但似乎不是解决方案，因为可覆盖的方法如下所示：

public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException

我在寻找什么类似于

reader.setTrueEntityResolver( new EntityResolver(){
    public InputStream resolve( String name ){ ... }
}

【问题讨论】：

标签： java sax dom4j

【解决方案1】：

我在http://evc-cit.info/dom4j/dom4j_groovy.html 中找到了可能的解决方案建议在哪里添加 XML Commons Catalog 的东西。

但是，这似乎有点过头了，因为无论如何都没有指定 doctype，我只打算解析 commons HTML 4 实体。

更新：事实证明，如果没有明确的 DOCTYPE 声明，这不会产生任何影响 - EntityResolver 永远不会被调用。

Maven 部门：

    <dependency>
        <groupId>xml-resolver</groupId>
        <artifactId>xml-resolver</artifactId>
        <version>1.2</version>
        <scope>test</scope>
    </dependency>

在类路径上的/CatalogManager.proeprties 中配置：

# allow location to be relative to this file's directory
relative-catalogs=yes

# A semicolon-delimited list of catalog files.
# In this instance, we have a single catalog file, and it's a relative path name
catalogs=sgml-lib/xml.soc

# no debugging messages, please
verbosity=0

# Use the SYSTEM identifier 
prefer=system

告诉解析器在遇到 DTD 时使用目录解析器：

cResolver = new CatalogResolver( cMgr )
reader = new SAXReader( )
reader.setEntityResolver( cResolver )

【讨论】：

【解决方案2】：

嗯，正如您所说，DOM4J 并不是要解析 HTML。我宁愿使用tagsoup 或HTML Cleaner 之类的东西。它只是不是实体，HTML 不是 XML。

【讨论】：