【发布时间】:2015-01-28 02:14:40
【问题描述】:
我的 java 程序将网页内容存储在字符串 sb 中,我想将字符串解析为 HTML DOM。我该怎么做?
import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.net.*;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
public class Scraper {
public static void main(String[] args) throws IOException, SAXException {
URL u;
try {
u = new URL("https://twitter.com/ssjsatish");
URLConnection cn = u.openConnection();
System.out.println("content type: "+cn.getContentType());
InputStream is = cn.getInputStream();
long l = cn.getContentLengthLong();
StringBuilder sb = new StringBuilder();
if (l!=0) {
int c;
while ((c = is.read()) != -1) {
sb.append((char)c);
}
is.close();
System.out.println(sb);
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputSource i = new InputSource();
i.setCharacterStream(new StringReader(sb.toString()));
Document doc = db.parse(i);
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (ParserConfigurationException e) {
e.printStackTrace();
}
}
}
【问题讨论】:
-
在编写代码时使用一致的样式很重要。我已编辑您的代码以使用存在的几种样式之一。另外,我之前移动了
is.close(),这样您的连接打开时间不会超过绝对必要的时间。
标签: html parsing dom html-parsing