使用 nutch 抓取 PDF 文档答案

【问题标题】：Crawl PDF documents using nutch使用 nutch 抓取 PDF 文档
【发布时间】：2016-01-28 12:51:34
【问题描述】：

我也必须从给定的 URL 抓取 PDF 文档... 建议使用任何工具/API 来抓取 PDF 文档... 现在我正在使用 nutch 进行抓取，但我无法从给定的 URL 抓取 PDF...我应该使用任何插件以 nutch 抓取 PDF 吗？

seed.txt --> http://nutch.apache.org regex-urlfilter.txt--->+^http://([a-z0-9]*.)*nutch.apache.org/

提前致谢

【问题讨论】：

查看amac4.blogspot.com/2013/07/configuring-nutch-to-crawl-urls.html

标签： pdf nutch

【解决方案1】：

编辑 regex-urlfilter.txt 并删除任何出现的“pdf”
编辑 suffix-urlfilter.txt 并删除任何出现的“pdf”
编辑nutch-site.xml，添加“parse-tika”和“parse-html”在 plugin.includes 部分。这应该是这样的

这个答案来自here。我在 Nutch 工作时测试过它

<property>


<name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>
    ...
  </description>
</property>

【讨论】：

执行此操作后会返回 pdf 文件还是仅返回 pdf 文件文本？
Nutch 将包含原始文件和文件中的已解析文本（如果已解析）。使用 bin/nutch readseg 和 bin/nutch dump 命令，您可以访问这两个命令 (wiki.apache.org/nutch/CommandLineOptions)。

【解决方案2】：

我发现即使你使用了 tika 插件，它仍然无法将 pdf 或任何 ms office 文件爬入 crawldb。您需要在 nutch-site.xml 中的 white-list 中添加要抓取的 url 以获取 pdf 和任何 ms office 文件：

<property>
  <name>http.robot.rules.whitelist</name>
  <value>xxx.xxx.xxx.xxx</value>
  <description>Comma separated list of hostnames or IP addresses to ignore 
  robot rules parsing for. Use with care and only if you are explicitly
  allowed by the site owner to ignore the site's robots.txt!
  </description>
</property>

【讨论】：

【解决方案3】：

使用 Nutch 的 parse-tika 插件。纯文本、XML、OpenDocument (OpenOffice.org)、Microsoft Office（Word、Excel、Powerpoint）、PDF、RTF、MP3（ID3 标签）都由 Tika 插件解析

【讨论】：