如何在 Nutch 中加快爬行速度答案

【问题标题】：How to speed up crawling in Nutch如何在 Nutch 中加快爬行速度
【发布时间】：2011-02-02 07:54:36
【问题描述】：

我正在尝试开发一个应用程序，在该应用程序中，我将为 Nutch 中的 urls 文件提供一组受限的 url。我可以通过从段中读取数据来抓取这些 url 并获取它们的内容。

我通过给出深度 1 进行了爬网，因为我不关心网页中的外链或内链。我只需要 urls 文件中的网页内容。

但执行此抓取需要时间。所以，建议我一种减少爬行时间并提高爬行速度的方法。我也不需要索引，因为我不关心搜索部分。

有人对如何加快抓取速度有建议吗？

【问题讨论】：

Arjun，你正在抓取的是我的网站！停下！

标签： nutch web-crawler

【解决方案1】：

获得速度的主要是配置nutch-site.xml

<property>
<name>fetcher.threads.per.queue</name>
   <value>50</value>
   <description></description>
</property>

【讨论】：

【解决方案2】：

您可以在 nutch-site.xml 中扩展线程。增加 fetcher.threads.per.host 和 fetcher.threads.fetch 都会提高你爬取的速度。我注意到了巨大的改进。但是，在增加这些时要小心。如果您没有硬件或连接来支持这种增加的流量，那么抓取中的错误数量可能会显着增加。

【讨论】：

【解决方案3】：

对我来说，这个属性对我帮助很大，因为一个缓慢的域可以减慢所有的获取阶段：

 <property>
  <name>generate.max.count</name>
  <value>50</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
 </property>

例如，如果您尊重 robots.txt（默认行为）并且域太长而无法抓取，则延迟将为：fetcher.max.crawl.delay。而且队列中的很多这个域会减慢所有的fetch阶段，所以最好限制generate.max.count。

你可以用同样的方式添加这个属性来限制获取阶段的时间：

<property>
  <name>fetcher.throughput.threshold.pages</name>
  <value>1</value>
  <description>The threshold of minimum pages per second. If the fetcher downloads less
  pages per second than the configured threshold, the fetcher stops, preventing slow queue's
  from stalling the throughput. This threshold must be an integer. This can be useful when
  fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check.
  </description>
</property>

但是请不要触碰 fetcher.threads.per.queue 属性，你会进入黑名单...这不是提高爬取速度的好方法...

【讨论】：

【解决方案4】：

你好，我也是这个爬行的新手，但我使用了一些方法，我得到了一些好的结果，可能你会我已经用这些属性更改了我的 nutch-site.xml

<property>
  <name>fetcher.server.delay</name>
  <value>0.5</value>
 <description>The number of seconds the fetcher will delay between 
   successive requests to the same server. Note that this might get
   overriden by a Crawl-Delay from a robots.txt and is used ONLY if 
   fetcher.threads.per.queue is set to 1.
 </description>

</property>
<property>
  <name>fetcher.threads.fetch</name>
  <value>400</value>
  <description>The number of FetcherThreads the fetcher should use.
    This is also determines the maximum number of requests that are
    made at once (each FetcherThread handles one connection).</description>
</property>


<property>
  <name>fetcher.threads.per.host</name>
  <value>25</value>
  <description>This number is the maximum number of threads that
    should be allowed to access a host at one time.</description>
</property>

请提出更多选择谢谢

【讨论】：

【解决方案5】：

我也有类似的问题，可以借助 https://wiki.apache.org/nutch/OptimizingCrawls

它提供了一些有用的信息，说明哪些因素会减慢您的抓取速度，以及您可以采取哪些措施来改善这些问题。

不幸的是，在我的情况下，我的队列非常不平衡，不能向更大的队列请求太快，否则我会被阻塞，所以我可能需要先使用集群解决方案或 TOR，然后才能进一步加快线程速度。

【讨论】：

【解决方案6】：

如果您不需要关注链接，我认为没有理由使用 Nutch。您可以简单地获取您的 url 列表并使用 http 客户端库或使用 curl 的简单脚本获取这些。

【讨论】：

是的，感谢您的意见。我之前使用 PHP-multi curl 完成了报废，并成功获得了结果。但我面临的问题是获取网页内容需要时间。因此，考虑到可扩展性和速度，我考虑转向 nutch。