【问题标题】:Nutch Crawl does not workingNutch 爬行不工作
【发布时间】:2017-01-05 13:06:55
【问题描述】:

我想使用 Apache Nutch 1.12 抓取网站并将数据索引到 Apache Solr。我关注了这个tutorial

我的seed.txt 文件有这个网址http://nutch.apache.org/

在我的正则表达式 url 过滤器中,我有这样的 +^http://([a-z0-9]*.)*nutch.apache.org/

当我尝试获取数据时,我只获得了我的 seed.txt 文件中的 url。

Fetcher: starting at 2017-01-03 09:56:23
Fetcher: segment: crawl/segments/20170103095613
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 2 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
fetching http://nutch.apache.org/ (queue crawl delay=5000ms)
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
robots.txt whitelist not configured.
robots.txt whitelist not configured.
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=2
Thread FetcherThread has no more work available
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
-activeThreads=0

我在这里缺少什么。

【问题讨论】:

  • 递归尝试,Generate > Fetch > Parse > Updatedb 。查看您的日志条目了解更多详情

标签: solr nutch


【解决方案1】:

我尝试再执行一次获取操作,但得到了预期的结果。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多