将整个 html 文件读入字符串？答案

【问题标题】：Reading entire html file to String?将整个 html 文件读入字符串？
【发布时间】：2012-08-20 09:37:19
【问题描述】：

有没有比将整个 html 文件读取到单个字符串变量更好的方法：

    String content = "";
    try {
        BufferedReader in = new BufferedReader(new FileReader("mypage.html"));
        String str;
        while ((str = in.readLine()) != null) {
            content +=str;
        }
        in.close();
    } catch (IOException e) {
    }

【问题讨论】：

标签： java file-io

【解决方案1】：

有来自 Apache Commons 的 IOUtils.toString(..) 实用程序。

如果您使用的是Guava，那么还有Files.readLines(..) 和Files.toString(..)。

【讨论】：

第一个链接失效了
两个链接现在都失效了。

【解决方案2】：

你应该使用StringBuilder:

StringBuilder contentBuilder = new StringBuilder();
try {
    BufferedReader in = new BufferedReader(new FileReader("mypage.html"));
    String str;
    while ((str = in.readLine()) != null) {
        contentBuilder.append(str);
    }
    in.close();
} catch (IOException e) {
}
String content = contentBuilder.toString();

【讨论】：

对我来说这不起作用，因为整个 html 内容作为一个单独的字符串返回，而 in.readLine() 只是读取第一次调用的整个内容
它如何知道 mypage.html 的位置？
@CraZyDroiD 您需要将相对路径传递给项目根文件夹。例如，如果您的 mypage.html 位于根文件夹中，就在 /src 旁边，您可以只执行“mypage.html”，但如果您将其放在文件夹中，您也必须引用该文件夹，如在“/myfolder/mypage.html”中

【解决方案3】：

您可以使用JSoup。
对于 java 来说，这是一个非常强大的 HTML parser

【讨论】：

【解决方案4】：

正如 Jean 所说，使用 StringBuilder 而不是 += 会更好。但如果你正在寻找更简单的东西，Guava、IOUtils 和 Jsoup 都是不错的选择。

以番石榴为例：

String content = Files.asCharSource(new File("/path/to/mypage.html"), StandardCharsets.UTF_8).read();

IOUtils 示例：

InputStream in = new URL("/path/to/mypage.html").openStream();
String content;

try {
   content = IOUtils.toString(in, StandardCharsets.UTF_8);
 } finally {
   IOUtils.closeQuietly(in);
 }

Jsoup 示例：

String content = Jsoup.parse(new File("/path/to/mypage.html"), "UTF-8").toString();

或

String content = Jsoup.parse(new File("/path/to/mypage.html"), "UTF-8").outerHtml();

注意事项：

Files.readLines() 和 Files.toString()

这些现在已从 Guava 发行版 22.0（2017 年 5 月 22 日）起弃用。应该使用Files.asCharSource() 代替如上例所示。 (version 22.0 release diffs)

IOUtils.toString(InputStream) 和 Charsets.UTF_8

自 Apache Commons-IO 版本 2.5（2016 年 5 月 6 日）起已弃用。 IOUtils.toString 现在应该传递 InputStream 和 Charset，如上例所示。应使用 Java 7 的 StandardCharsets 而不是 Charsets 如上例所示。 (deprecated Charsets.UTF_8)

【讨论】：

【解决方案5】：

我更喜欢使用Guava：

import com.google.common.base.Charsets;
import com.google.common.io.Files;
File file = new File("/path/to/file", Charsets.UTF_8);
String content = Files.toString(file);

【讨论】：

注意：a ) 在文件路径后丢失。

【解决方案6】：

对于字符串操作，使用 StringBuilder 或 StringBuffer 类来累积字符串数据块。不要对字符串对象使用+= 操作。 String 类是不可变的，运行时会产生大量的字符串对象，会影响性能。

改用StringBuilder/StringBuffer类实例的.append()方法。

【讨论】：

【解决方案7】：

这是一个仅使用标准 java 库检索网页 html 的解决方案：

import java.io.*;
import java.net.*;

String urlToRead = "https://google.com";
URL url; // The URL to read
HttpURLConnection conn; // The actual connection to the web page
BufferedReader rd; // Used to read results from the web page
String line; // An individual line of the web page HTML
String result = ""; // A long string containing all the HTML
try {
 url = new URL(urlToRead);
 conn = (HttpURLConnection) url.openConnection();
 conn.setRequestMethod("GET");
 rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
 while ((line = rd.readLine()) != null) {
  result += line;
 }
 rd.close();
} catch (Exception e) {
 e.printStackTrace();
}

System.out.println(result);

SRC

【讨论】：

【解决方案8】：

 import org.apache.commons.io.IOUtils;
 import java.io.IOException;     
    try {
               var content = new String(IOUtils.toByteArray ( this.getClass().
                        getResource("/index.html")));
            } catch (IOException e) {
                e.printStackTrace();
            }

//上面提到的 Java 10 代码 - 假设 index.html 在资源文件夹中可用。

【讨论】：