【问题标题】:Nutch Crawl ScriptNutch 爬行脚本
【发布时间】:2015-09-09 19:27:41
【问题描述】:

运行 Nutch 1.10 并且我在使用 Nutch 开发人员提供的 Crawl 脚本时遇到问题:

Usage: crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num     Rounds>
    -i|--index      Indexes crawl results into a configured indexer
    -D              A Java property to pass to Nutch calls
    Seed Dir        Directory in which to look for a seeds file
    Crawl Dir       Directory where the crawl/link/segments dirs are saved
    Num Rounds      The number of rounds to run this crawl for
 Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/  2

我想知道是否有人可以让我对阅读本文有所了解。例如:

    -i|--index      **What is the configured indexer? Is this part of Nutch? Or is it an another program like Solr? When I put in -i, what am I doing?**
    -D              **Not sure how these get used in the crawl but the instruction is pretty self-explanatory.**
    Seed Dir        **Self-explanatory but where do I put the directory within Nutch? I created a urls directory (per the instructions) in the apache-nutch-1.10 directory. I've also tried putting it in the apache-nutch-1.10/bin file because that is were the crawl starts from.**
    Crawl Dir       **Is this where the results of the crawl go or is there where the data for the injection to the crawldb goes? If its the latter where do I get said data? The directory starts out empty and never gets filled. Confusing!**
    Num Rounds      **Self-explanatory**

其他问题: 爬取的结果去哪儿了?他们是否必须使用 Solr 核心(或其他一些软件)?他们可以直接进入一个目录以便我查看吗? 他们出来的格式是什么?

谢谢!

【问题讨论】:

标签: solr cygwin nutch


【解决方案1】:

-i : 是一个类似 Solr/ElasticSearch 等的程序。所以当你指定 -i 选项时,爬虫脚本会运行索引作业,否则它会跳过它。

Crawl Dir : 是存储爬取数据的目录。这包括 crawldb、segments 和 linkdb。所以基本上所有与抓取相关的数据都在这里。

爬取的结果进入您指定的 crawlDir。它被存储为一个序列文件,并且有查看数据的命令。

您可以在 -https://wiki.apache.org/nutch/CommandLineOptions 找到它们。

【讨论】:

  • 只是一个断章取义的问题。你能回答这个吗:stackoverflow.com/questions/39853492/…
  • bin/crawl -s urls -i http://127.0.0.1:8983/solr/nutch/ crawl 2 这会为我返回 commadline optins .. 有什么问题吗? @sujen-shah
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多