如何在java中将字符串解析为HTML DOM答案

【问题标题】：How do I parse a string to HTML DOM in java如何在java中将字符串解析为HTML DOM
【发布时间】：2015-01-28 02:14:40
【问题描述】：

我的 java 程序将网页内容存储在字符串 sb 中，我想将字符串解析为 HTML DOM。我该怎么做？

import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.net.*;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class Scraper {
    public static void main(String[] args) throws IOException, SAXException {
        URL u;
        try {
            u = new URL("https://twitter.com/ssjsatish");
            URLConnection cn = u.openConnection();
            System.out.println("content type:  "+cn.getContentType());
            InputStream is = cn.getInputStream();
            long l = cn.getContentLengthLong();
            StringBuilder sb = new StringBuilder();
            if (l!=0) {
                int c;
                while ((c = is.read()) != -1) {
                   sb.append((char)c);
                }
                is.close();
                System.out.println(sb);
                DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
                InputSource i = new InputSource();
                i.setCharacterStream(new StringReader(sb.toString()));
                Document doc = db.parse(i);
            }
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        }
    }
}

【问题讨论】：

在编写代码时使用一致的样式很重要。我已编辑您的代码以使用存在的几种样式之一。另外，我之前移动了is.close()，这样您的连接打开时间不会超过绝对必要的时间。

标签： html parsing dom html-parsing

【解决方案1】：

您不想使用 XML 解析器来解析 HTML，因为并非所有有效的 HTML 都是有效的 XML。我建议使用专门设计用于解析“真实世界”HTML 的库，例如，我使用jsoup 取得了不错的结果，但还有其他的。使用此类库的另一个优点是它们的 API 在设计时考虑了 Web Scraping，并提供了更简单的方式来访问 HTML 文档中的数据。

【讨论】：

这就是我想要构建的，但我不想使用像 jsoup.org 或 jaunt-api.com 这样的库。我试过jsoup。
@SatishPatel 为什么不呢？如果没有，您可能必须自己编写。
@ColinvH 是的，我想自己写。我想了解他们是如何做到的？
@SatishPatel 然后你应该看看他们的源代码：github.com/jhy/jsoup 如果你想学习如何构建解析器，你应该从像 JSON 这样简单的地方开始。
是的，在解析 HTML 中存在大量特殊情况。如果你对 String -> Tree 的东西（解析）感兴趣，你可能想从更简单的东西开始。解析是一个非常巧妙的话题，但 HTML 并不是学习它的好方法。如果您想了解他们对 Tree 的作用，例如搜索与选择器匹配的元素，那么 jsoup 源代码是一个很好的起点，尤其是因为它是自包含的。