运行nutch爬虫时爬取的数据存放在哪里？答案

【问题标题】：Where is the crawled data stored when running nutch crawler?运行nutch爬虫时爬取的数据存放在哪里？
【发布时间】：2015-03-30 09:43:18
【问题描述】：

我是 Nutch 的新手。我需要抓取网络（比如几百个网页），读取抓取的数据并进行一些分析。

我点击了链接https://wiki.apache.org/nutch/NutchTutorial（并集成了 Solr，因为我将来可能需要搜索文本）并使用一些 URL 作为种子来运行爬网。

现在，我在本地计算机中找不到 text/html 数据。我在哪里可以找到数据以及以文本格式读取数据的最佳方式是什么？

版本

apache-nutch-1.9
solr-4.10.4

【问题讨论】：

标签： web-crawler nutch

【解决方案1】：

抓取结束后，您可以使用 bin/nutch dump 命令转储以纯 html 格式获取的所有 url。

用法如下：

$ bin/nutch dump [-h] [-mimetype <mimetype>] [-outputDir <outputDir>]
   [-segment <segment>]
 -h,--help                show this help message
 -mimetype <mimetype>     an optional list of mimetypes to dump, excluding
                      all others. Defaults to all.
 -outputDir <outputDir>   output directory (which will be created) to host
                      the raw data
 -segment <segment>       the segment(s) to use

所以例如你可以做类似的事情

$ bin/nutch dump -segment crawl/segments -outputDir crawl/dump/

这将在 -outputDir 位置创建一个新目录并转储以 html 格式抓取的所有页面。

还有更多方法可以从 Nutch 导出特定数据，请查看https://wiki.apache.org/nutch/CommandLineOptions

【讨论】：

感谢您的信息。我以不同的方式做到了。在“segments/2******************/content/part-00000”文件夹中有一个名为“data”的文件，它是一个顺序文件。我写了一个java程序把它转换成文本。您的回答非常直截了当，而且信息量很大。
nutch2的获取方法