Nutch 说没有要获取的 URL - 检查您的种子列表和 URL 过滤器答案

【问题标题】：Nutch says No URLs to fetch - check your seed list and URL filtersNutch 说没有要获取的 URL - 检查您的种子列表和 URL 过滤器
【发布时间】：2014-07-18 02:38:37
【问题描述】：

~/runtime/local/bin/urls/seed.txt >>

http://nutch.apache.org/

~/runtime/local/conf/nutch-site.xml >>

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
            <name>http.agent.name</name>
            <value>My Nutch Spider</value>
    </property>

    <property>
            <name>http.timeout</name>
            <value>99999999</value>
            <description></description>
    </property>

    <property>
            <value>protocol-file|protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|
            scoring-opic|urlnormalizer-(pass|regex|basic)|index-more
            </value>
            <description>Regular expression naming plugin directory names to
            include.  Any plugin not matching this expression is excluded.
            In any case you need at least include the nutch-extensionpoints plugin.
            </description>
    </property>
</configuration>

~/runtime/local/conf/regex-urlfilter.txt >>

# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org/([a-z0-9\-A-Z]*\/)*

如果我爬，它会这样说。

/home/apache-nutch-1.4-bin/runtime/local/bin
$ ./nutch crawl urls -dir newCrawl/ -depth 3 -topN 3
cygpath: can't convert empty path
solrUrl is not set, indexing will be skipped...
crawl started in: newCrawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 3
Injector: starting at 2014-07-18 11:35:36
Injector: crawlDb: newCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2014-07-18 11:35:39, elapsed: 00:00:02
Generator: starting at 2014-07-18 11:35:39
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: newCrawl

无论网址是什么，它总是说没有要获取的网址。我在这个问题上苦苦挣扎了 3 天。请帮忙！！！！

【问题讨论】：

在seed.txt中添加一些维基百科页面或包含许多链接的站点并尝试。

标签： nutch

【解决方案1】：

我在查看您的正则表达式过滤器时发现了一些您可能会考虑尝试的故障。由于它不太适合评论，因此我无论如何都会将其发布在这里，即使它可能不是完整的答案。

您自定义的正则表达式+^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org/([a-z0-9\-A-Z]*\/)* 可能是问题所在。 Nutch 的 regex-urlfilter 有时会变得非常混乱，我强烈建议您从适合所有人的东西开始，也许是从 Wiki 开始的 +^http://([a-z0-9]*\.)*nutch.apache.org/ 只是为了开始。
完成上述两个步骤后，您确定 Nutch 可以正常工作，然后您可以调整正则表达式。

为了测试正则表达式，我找到了两种方法：

提供一个 URL 列表作为种子。并将它们注入到一个新的数据库中，看看谁被注入或拒绝了。这实际上没有任何编码。
可以设置Nutch in Eclipse，调用对应的类进行测试。

【讨论】：

谢谢。但是第一个是错字。 T.T
嗨。我还将使用上述用户提供的选项。我也怀疑问题出在正则表达式上。因此，首先将其删除并尝试运行该程序，如果它有效，那么您可以为 nutch 站点创建正确的正则表达式。 technical-fundas.blogspot.in/2014/06/…