【发布时间】:2011-07-13 18:30:56
【问题描述】:
This is the code taken from http://code.google.com/p/crawler4j/ and the name of this file is MyCrawler.java
public class MyCrawler extends WebCrawler {
Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
/*
* You should implement this function to specify
* whether the given URL should be visited or not.
*/
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
return false;
}
if (href.startsWith("http://www.xyz.us.edu/")) {
return true;
}
return false;
}
/*
* This function is called when a page is fetched
* and ready to be processed by your program
*/
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String text = page.getText();
List<WebURL> links = page.getURLs();
}
}
这是调用 MyCrawler 的 Controller.java 的代码..
public class Controller {
public static void main(String[] args) throws Exception {
CrawlController controller = new CrawlController("/data/crawl/root");
controller.addSeed("http://www.xyz.us.edu/");
controller.start(MyCrawler.class, 10);
}
}
所以我只是想确定一下这一行在 controller.java 文件中的含义
controller.start(MyCrawler.class, 10);
这里 10 是什么意思.. 如果我们将这个 10 增加到 20 那么会有什么效果...任何建议将不胜感激...
【问题讨论】:
标签: java web-crawler