获取 Nutch 2.3.1 获取的页面的原始 html答案

【问题标题】：Obtain the raw html of pages fetched by Nutch 2.3.1获取 Nutch 2.3.1 获取的页面的原始 html
【发布时间】：2016-11-01 14:55:09
【问题描述】：

我想使用多个网页训练 NLP 模型以获得良好的精度。由于我没有网页，我正在考虑在 Amazon EMR 上使用网络爬虫。我想使用尊重 robots.txt 规则的分布式、可扩展和可扩展的开源解决方案。经过一番研究，我决定采用 Apache Nutch。

我发现 Nutch 的主要贡献者 Julien Nioche 的 this video 对入门特别有用。虽然我使用了最新可用版本的 Hadoop (Amazon 2.7.3) 和 Nutch (2.3.1)，但我还是成功地完成了一个小示例作业。

不幸的是，我找不到从 Nutch 的输出中检索原始 html 文件的简单方法。在寻找解决这个问题的方法时，我发现了一些其他有用的资源（除了 Nutch 自己的 wiki 和 tutorial 页面）。

其中一些（如this answer 或this page）建议实现一个新的插件（或修改现有的）：总体思路是添加几行代码在将任何获取的 html 页面的内容发送到段之前，实际上将其保存到文件中。

其他人（如this answer）建议实施一个简单的后处理工具来访问这些段，检查其中包含的所有记录并将任何看起来是html页面的内容保存到文件中.

这些资源都包含（或多或少精确的）说明和代码示例，但是当我尝试运行它们时没有运气，因为它们引用了非常旧的 Nutch 版本。此外，由于缺乏资源/文档，我所有将它们调整为 Nuth 2.3.1 的尝试都失败了。

例如，我将以下代码附加到HtmlParser（parse-html 插件的核心）的末尾，但是保存在指定文件夹中的所有文件都是空的：

String html = root.toString();
if (html == null) {
    byte[] bytes = content.getContent();
    try {
      html = new String(bytes, encoding);
    } catch (UnsupportedEncodingException e) {
        LOG.trace(e.getMessage(), e);
    }
}
if (html != null) {
    html = html.trim();
    if (!html.isEmpty()) {
        if (dumpFolder == null) {
            String currentUsersHomeFolder = System.getProperty("user.home");
            currentUsersHomeFolder = "/Users/stefano";
            dumpFolder = currentUsersHomeFolder + File.separator + "nutch_dump";
            new File(dumpFolder).mkdir();
        }
        try {
            String filename = base.toString().replaceAll("\\P{LD}", "_");
            if (!filename.toLowerCase().endsWith(".htm") && !filename.toLowerCase().endsWith(".html")) {
                filename += ".html";
            }
            System.out.println(">> " + dumpFolder+ File.separator +filename);
            PrintWriter writer = new PrintWriter(dumpFolder + File.separator + filename, encoding);
            writer.write(html);
            writer.close();
        } catch (Exception e) {
            LOG.trace(e.getMessage(), e);
        }
    }
}

在另一种情况下，我得到了以下错误（我喜欢这个错误，因为它提到了序言，但它也让我感到困惑）：

[Fatal Error] data:1:1: Content is not allowed in prolog.

所以，在考虑将我的设置降级到 Nutch 1.x 之前，我的问题是：你们中的任何人都曾在使用最新版本的 Nutch 时遇到过这个问题并成功解决了吗？ p>

如果是这样，您能否与社区分享或至少提供一些有用的解决方案指针？

非常感谢！

PS：如果您想知道如何在 IntelliJ 中正确打开 Nutch 源，this answer 实际上可能会为您指明正确的方向。

【问题讨论】：

可能是 stackoverflow.com/questions/10098169/… 的副本。

标签： html web-scraping web-crawler nutch hadoop2

【解决方案1】：

您可以通过编辑 Nutch 代码来保存原始 HTML 首先通过关注https://wiki.apache.org/nutch/RunNutchInEclipse在eclipse中运行nutch@

在 Eclipse 编辑文件 FetcherReducer.java 中运行完 nutch 后，将此代码添加到输出方法中，再次运行 ant eclipse 以重建类

最终，原始 html 将添加到您数据库中的 reportUrl 列中

if (content != null) {
    ByteBuffer raw = fit.page.getContent();
    if (raw != null) {
        ByteArrayInputStream arrayInputStream = new ByteArrayInputStream(raw.array(), raw.arrayOffset() + raw.position(), raw.remaining());
        Scanner scanner = new Scanner(arrayInputStream);
        scanner.useDelimiter("\\Z");//To read all scanner content in one String
        String data = "";
        if (scanner.hasNext()) {
            data = scanner.next();
        }
        fit.page.setReprUrl(StringUtil.cleanField(data));
        scanner.close();
    } 
}

【讨论】：

【解决方案2】：

很高兴您发现该视频很有用。如果你只需要网页来训练 NLP 模型，为什么不使用 CommonCrawl 数据集呢？它包含数十亿页，是免费的，并且会为您省去大规模网络爬网的麻烦？

现在回答您的问题，您可以编写一个自定义 IndexWriter 并将页面内容写入您想要的任何内容。我不使用 Nutch 2.x，因为我更喜欢 1.x，因为它更快、功能更多且更易于使用（老实说，我实际上更喜欢 StormCrawler，但我有偏见）。 Nutch 1.x 有一个 WARCExporter 类，它可以生成与 CommonCrawl 使用的 WARC 格式相同的数据转储；还有另一个用于以各种格式导出的类。

【讨论】：

非常茂盛！感谢所有有用的提示，并可能为我节省了时间！