【发布时间】:2015-09-09 19:27:41
【问题描述】:
运行 Nutch 1.10 并且我在使用 Nutch 开发人员提供的 Crawl 脚本时遇到问题:
Usage: crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num Rounds>
-i|--index Indexes crawl results into a configured indexer
-D A Java property to pass to Nutch calls
Seed Dir Directory in which to look for a seeds file
Crawl Dir Directory where the crawl/link/segments dirs are saved
Num Rounds The number of rounds to run this crawl for
Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/ 2
我想知道是否有人可以让我对阅读本文有所了解。例如:
-i|--index **What is the configured indexer? Is this part of Nutch? Or is it an another program like Solr? When I put in -i, what am I doing?**
-D **Not sure how these get used in the crawl but the instruction is pretty self-explanatory.**
Seed Dir **Self-explanatory but where do I put the directory within Nutch? I created a urls directory (per the instructions) in the apache-nutch-1.10 directory. I've also tried putting it in the apache-nutch-1.10/bin file because that is were the crawl starts from.**
Crawl Dir **Is this where the results of the crawl go or is there where the data for the injection to the crawldb goes? If its the latter where do I get said data? The directory starts out empty and never gets filled. Confusing!**
Num Rounds **Self-explanatory**
其他问题: 爬取的结果去哪儿了?他们是否必须使用 Solr 核心(或其他一些软件)?他们可以直接进入一个目录以便我查看吗? 他们出来的格式是什么?
谢谢!
【问题讨论】:
-
阅读 Nutch 教程并熟悉术语 crawldb、segments、。等wiki.apache.org/nutch/….