关于使用 nutch 抓取短网址答案

【问题标题】：Regarding crawling of short URLs using nutch关于使用 nutch 抓取短网址
【发布时间】：2011-01-25 16:35:42
【问题描述】：

我正在为我的应用程序使用 nutch 爬虫，它需要爬取一组我提供给 urls 目录的 URL，并且只获取该 URL 的内容。我对内部或外部链接的内容不感兴趣。所以我使用了 NUTCH 爬虫，并通过将深度设为 1 来运行爬虫命令。

bin/nutch 抓取网址 -dir crawl -depth 1

Nutch 抓取 url 并给我给定 url 的内容。

我正在使用 readseg 实用程序阅读内容。

bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch -nogenerate -noparse -noparsedata

我正在获取网页的内容。

我面临的问题是，如果我提供像

这样的直接网址

http://isoc.org/wp/worldipv6day/
http://openhackindia.eventbrite.com
http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/
http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locations.php
http://bangalore.yahoo.com/labs/summerschool.html
http://riadevcamp.eventbrite.com
http://www.sleepingtime.org/

然后我就可以获取网页的内容了。但是，当我将 URL 集作为短 URL 给出时，例如

http://is.gd/jOoAa9
http://is.gd/ubHRAF
http://is.gd/GiFqj9
http://is.gd/H5rUhg
http://is.gd/wvKINL
http://is.gd/K6jTNl
http://is.gd/mpa6fr
http://is.gd/fmobvj
http://is.gd/s7uZf***

我无法获取内容。

当我阅读这些片段时，它没有显示任何内容。请在下面找到从段中读取的转储文件的内容。

*记录:: 0 网址:: http://is.gd/0yKjO6 抓取数据:: 版本：7 状态：1（db_unfetched）获取时间：2011 年 1 月 25 日星期二 20:56:07 IST 修改时间：Thu Jan 01 05:30:00 IST 1970 获取后重试次数：0 重试间隔：2592000秒（30天）得分：1.0 签名：空元数据：_ngt_：1295969171407 内容：：版本：-1 网址：http://is.gd/0yKjO6 基地：http://is.gd/0yKjO6 内容类型：文本/html 元数据：日期=2011 年 1 月 25 日星期二 15:26:28 GMT nutch.crawl.score=1.0 位置=http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4= 1 _fst_=36 nutch.segment.name=20110125205614 内容类型=文本/html； charset=UTF-8 Connection=close 服务器=nginx X-Powered-By=PHP/5.2.14 内容：记录:: 1 网址:: http://is.gd/1tpKaN 内容：：版本：-1 网址：http://is.gd/1tpKaN 基地：http://is.gd/1tpKaN 内容类型：文本/html 元数据：日期=2011 年 1 月 25 日星期二 15:26:28 GMT nutch.crawl.score=1.0 位置=http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice? tu3=1 _fst_=36 nutch.segment.name=20110125205614 内容类型=文本/html； charset=UTF-8 Connection=close 服务器=nginx X-Powered-By=PHP/5.2.14 内容：抓取数据:: 版本：7 状态：1（db_unfetched）获取时间：2011 年 1 月 25 日星期二 20:56:07 IST 修改时间：Thu Jan 01 05:30:00 IST 1970 获取后重试次数：0 重试间隔：2592000秒（30天）得分：1.0*

我还尝试将 nutch-default.xml 中的 max.redirects 属性设置为 4，但没有发现任何进展。请为我提供此问题的解决方案。

感谢和问候，阿琼·库马尔·雷迪

【问题讨论】：

当使用is.gd 缩短链接时，它不包含您正在抓取的实际页面，它只是一个转发。这就是 Nutch 无法获取它的原因。

标签： nutch web-crawler short-url

【解决方案1】：

使用 nutch 1.2 尝试编辑文件 conf/nutch-default.xml
找到 http.redirect.max 并将值更改为至少 1 而不是默认的 0。

<property>
  <name>http.redirect.max</name>
  <value>2</value><!-- instead of 0 -->
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

祝你好运

【讨论】：

【解决方案2】：

您必须将深度设置为 2 或更大，因为第一次提取会返回 301（或 302）代码。重定向将在下一次迭代中进行，因此您必须允许更多深度。

另外，请确保您允许 regex-urlfilter.txt 中将遵循的所有 url

【讨论】：

我已经尝试通过保持深度 3 来做到这一点，但我无法获取网页的内容。你能告诉我我应该在 regex-urlfilter.txt 中改变什么吗？
regex-urlfilter.txt 允许您设置 Nutch 可以或不可以跟随的 url。如果您设置“is.gd”可以被抓取，您还必须在该文件中添加您的初始网址将重定向到的所有其他网址（如“holykaw.alltop.com"）