将网页内容读入 Java 字符串的最佳方法是什么？答案

【问题标题】：What is the optimal way for reading the contents of a webpage into a string in Java?将网页内容读入 Java 字符串的最佳方法是什么？
【发布时间】：2010-11-14 00:13:27
【问题描述】：

我有以下 Java 代码来获取给定 URL 处 HTML 页面的全部内容。这可以以更有效的方式完成吗？欢迎任何改进。

public static String getHTML(final String url) throws IOException {
    if (url == null || url.length() == 0) {
        throw new IllegalArgumentException("url cannot be null or empty");
    }

    final HttpURLConnection conn = (HttpURLConnection) new URL(url).openConnection();
    final BufferedReader buf = new BufferedReader(new InputStreamReader(conn.getInputStream()));
    final StringBuilder page = new StringBuilder();
    final String lineEnd = System.getProperty("line.separator");
    String line;
    try {
        while (true) {
            line = buf.readLine();
            if (line == null) {
                break;
            }
            page.append(line).append(lineEnd);
        }
    } finally {
        buf.close();
    }

    return page.toString();
}

我不禁觉得行阅读不是最佳的。我知道我可能掩盖了由openConnection 调用引起的MalformedURLException，我可以接受。

我的函数还具有使 HTML 字符串具有当前系统的正确行终止符的副作用。这不是要求。

我意识到网络 IO 可能会使读取 HTML 所需的时间相形见绌，但我仍然想知道这是最优的。

附带说明：如果StringBuilder 有一个用于打开InputStream 的构造函数，它会简单地获取InputStream 的所有内容并将其读入StringBuilder，那就太棒了。

【问题讨论】：

您可能想尝试 java.util.Scanner 作为旁注的解决方案。查看java-tips.org/java-se-tips/java.util/… 并查找使用java.net.URL 读取java.net 的示例。
在我的应用程序的另一部分，我使用正则表达式从行中提取一些值，Scanner 可能会派上用场。但是，在这里我不禁觉得与其他更直接的解决方案相比，它会产生一些开销。
请参阅stackoverflow.com/questions/4185665/…，了解如何使用 Guava 的 CharStreams.toString 方法将 InputStream 转换为字符串，考虑字符集。

标签： java string optimization inputstream micro-optimization

【解决方案1】：

正如在其他答案中所见，在任何稳健的解决方案中都应考虑到许多不同的边缘情况（HTTP 特性、编码、分块等）。因此，我建议在除了玩具程序之外的任何东西中使用事实上的 Java 标准 HTTP 库：Apache HTTP Components HTTP Client。

他们提供了很多样本，"just" getting the response contents for a request looks like this:

HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet("http://www.google.com/"); 
ResponseHandler<String> responseHandler = new BasicResponseHandler();    
String responseBody = httpclient.execute(httpget, responseHandler);
// responseBody now contains the contents of the page
System.out.println(responseBody);
httpclient.getConnectionManager().shutdown();

【讨论】：

这听起来是最好的主意。我将开始使用这个。 http 的 java.net 类在不同版本的 Java 上表现不同也是避免使用它们的另一个原因。
我最近收到了 Apache commons 建议，所以我要试一试。感谢您的建议。

【解决方案2】：

好的，再次编辑。一定要在它周围放置你的 try-finally 块，或者捕获 IOException

 ...
 final static int BUFZ = 4096;
 StringBuilder page = new StringBuilder();
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 // perhaps allocate this one time and reuse if you
  //call this method a lot.
 byte[] buf = new byte[BUFZ] ;
 int nRead = 0;

 while((nRead = is.read(buf, 0, BUFZ) > 0) {
    page.append(new String(buf /* , Charset charset */)); 
 // uses local default char encoding for now
 }

这里试试这个：

 ...
 final static int MAX_SIZE = 10000000;
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 // perhaps allocate this one time and reuse if you
  //call this method a lot.
 byte[] buf = new byte[MAX_SIZE] ;
 int nRead = 0;
 int total = 0;
 // you could also use ArrayList so that you could dynamically
 //  resize or there are other ways to resize an array also
 while(total < MAX_SIZE && (nRead = is.read(buf) > 0) {
      total += nRead;
 }
 ...
 // do something with buf array of length total

好的，下面的代码对您不起作用，因为由于 HTTP/1.1“分块”，Content-length 标题行在开始时没有被发送

 ...
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 int cLen = conn.getContentLength() ;
 byte[] buf = new byte[cLen] ;
 int nRead=0 ;

 while(nRead < cLen) {
      nRead += is.read(buf, nRead, cLen - nRead) ;
 }
 ...
 // do something with buf array

【讨论】：

conn.getContentLength() 对我正在连接的所有页面返回 -1，因为它不知道。
您是否获得了分块的 HTTP 1.1 内容？
也许吧？我不确定您所说的分块 HTTP 1.1 内容是什么意思。我从 URL 得到的回复只是一个 HTML 片段。通常，我连接的 URL 用于为 AJAX 请求获取 HTML。我不确定这是否对此有任何影响。
来自连接到我正在使用的 URL 的萤火虫会话：“Transfer-Encoding: chunked”
您的最新解决方案不会提高速度。您正在为每个read 创建无用的字符串实例。您也不要关闭任何可能导致泄漏的东西。分配太大的缓冲区会浪费时间和空间，而且由于我的响应大小可能会有很大差异，因此这不是一个合适的解决方案。

【解决方案3】：

您可以在 InputStreamReader 之上进行自己的缓冲，方法是将更大的块读入字符数组并将数组内容附加到 StringBuilder。

但这会让你的代码更难理解，我怀疑这是否值得。

请注意 Sean A.O. 的提议。 Harney 读取原始字节，因此您需要在此基础上转换为文本。

【讨论】：

谢谢，我更新了我提出的答案。我正在使用 String 构造函数将 byte[] 转换为 String 并附加到您提到的 StringBuilder 。不幸的是 StringBuilder.append() 只为 char[] 或 String 参数定义，而不是 byte[] 。