Nutch 获取超时答案

【问题标题】：Nutch fetching timeoutNutch 获取超时
【发布时间】：2017-02-27 18:32:14
【问题描述】：

我正在尝试使用 nutch-1.12 抓取某些网站，但对于种子列表中的某些网站无法正常抓取：

http://www.nature.com/ (1)
https://www.theguardian.com/international (2)
http://www.geomar.de (3)

正如您在下面的日志中看到的 (2) 和 (3) 工作正常，而获取 (1) 导致超时，而链接本身在浏览器中工作正常。由于我不想大幅增加等待时间和尝试次数，我想知道是否有另一种方法可以确定为什么会生成此超时以及如何解决它。

日志

Injector: starting at 2017-02-27 18:33:38
Injector: crawlDb: nature_crawl/crawldb
Injector: urlDir: urls-2
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 3
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 3
Injector: finished at 2017-02-27 18:33:42, elapsed: 00:00:03
Generator: starting at 2017-02-27 18:33:45
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: running in local mode, generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: nature_crawl/segments/20170227183349
Generator: finished at 2017-02-27 18:33:51, elapsed: 00:00:05
Fetcher: starting at 2017-02-27 18:33:53
Fetcher: segment: nature_crawl/segments/20170227183349
Fetcher: threads: 3
Fetcher: time-out divisor: 2
QueueFeeder finished: total 3 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching https://www.theguardian.com/international (queue crawl delay=1000ms)
Using queue mode : byHost
fetching http://www.nature.com/ (queue crawl delay=1000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
fetching http://www.geomar.de/ (queue crawl delay=1000ms)
robots.txt whitelist not configured.
robots.txt whitelist not configured.
robots.txt whitelist not configured.
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=2
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
.
.
.
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
fetch of http://www.nature.com/ failed with: java.net.SocketTimeoutException: Read timed out
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
-activeThreads=0
Fetcher: finished at 2017-02-27 18:34:18, elapsed: 00:00:24
ParseSegment: starting at 2017-02-27 18:34:21
ParseSegment: segment: nature_crawl/segments/20170227183349
Parsed (507ms):http://www.geomar.de/
Parsed (344ms):https://www.theguardian.com/international
ParseSegment: finished at 2017-02-27 18:34:24, elapsed: 00:00:03
CrawlDb update: starting at 2017-02-27 18:34:26
CrawlDb update: db: nature_crawl/crawldb
CrawlDb update: segments: [nature_crawl/segments/20170227183349]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2017-02-27 18:34:30, elapsed: 00:00:03

【问题讨论】：

标签： web-crawler nutch

【解决方案1】：

您可以尝试在 nutch-site.xml 中增加 http 超时设置

<property>
  <name>http.timeout</name>
  <value>30000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>

否则，请检查该网站的 robots.txt 是否允许抓取其网页。

【讨论】：

如果我理解 robots.txt 的概念，那么像 nature.com/nature/index.html 这样的正确页面应该可以通过 nature.com/robots.txt 访问。如果没有其他解决方案，我会尝试增加超时值。
是的，因为 nutch 遵守 robots.txt，如果路径不被允许，它就不会爬网。另一件可能值得尝试的事情是更改爬虫的用户代理（例如“http.agent .name") 并从中删除“Nutch”一词。一些网站会根据名称阻止机器人。
http.agent.name 设置为“nature”也不起作用。
"http.agent.name" 只是用户代理中的一件事。您可以在 nutch-default.xml 中看到这一点：HTTP 'User-Agent' 请求标头。不得为空 - 请将此设置为与您的组织唯一相关的单个单词。注意：您还应该检查其他相关属性：http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version 并适当地设置它们的值。
我更改了 http.agent.version 的值，它对我有用。谢谢！

【解决方案2】：

不知道为什么，但如果用户代理字符串包含“Nutch”，www.nature.com 似乎会保持连接挂起。也可以使用 wget 重现：

wget -U 'my-test-crawler/Nutch-1.13-SNAPSHOT (mydotmailatexampledotcom)' -d http://www.nature.com/

【讨论】：

用户代理字符串是否与nutch-default.xml中的http.agent.name相同？我将其更改为“自然”，但仍然遇到相同的错误。
这无济于事，因为总是附加部分“/Nutch-1.13-SNAPSHOT”（或任何其他版本）。您必须覆盖整个代理字符串，最简单的方法是使用http.agent.rotate（不要忘记将接受的用户代理字符串添加到您的agents.txt）。