【问题标题】:Exclude Some URL from getting crawled排除某些 URL 被抓取
【发布时间】:2025-11-29 16:40:02
【问题描述】:

我正在编写一个爬虫,并且在该爬虫中我不想爬取某些页面(排除某些链接以使其不被爬取)。所以我为那个页面写了排除项。这段代码有什么问题..尽管编写了排除项,但仍会调用此 http://www.host.com/technology/ 网址。我不希望任何以此网址 http://www.host.com/technology/ 开头的网址被抓取..

public class MyCrawler extends WebCrawler {

    Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
            + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf"
            + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

List<String> exclusions;


    public MyCrawler() {

        exclusions = new ArrayList<String>();
        //Add here all your exclusions
//I do not want this url to get crawled..
        exclusions.add("http://www.host.com/technology/");

    }

    public boolean shouldVisit(WebURL url) {
        String href = url.getURL().toLowerCase();
        System.out.println(href);
        if (filters.matcher(href).matches()) {
            System.out.println("noooo");
            return false;
        }

        if (exclusions.contains(href)) {//why this loop is not working??
        System.out.println("Yes2");
            return false;
    }

        if (href.startsWith("http://www.host.com/")) {
            System.out.println("Yes1");
            return true;
        }



        System.out.println("No");
        return false;
    }

    public void visit(Page page) {
        int docid = page.getWebURL().getDocid();
        String url = page.getWebURL().getURL();         
        String text = page.getText();
        List<WebURL> links = page.getURLs();
        int parentDocid = page.getWebURL().getParentDocid();

        System.out.println("=============");
        System.out.println("Docid: " + docid);
        System.out.println("URL: " + url);
        System.out.println("Text length: " + text.length());
        System.out.println("Number of links: " + links.size());
        System.out.println("Docid of parent page: " + parentDocid);
        System.out.println("=============");
    }   
}

【问题讨论】:

    标签: java web-crawler


    【解决方案1】:

    如果您不想抓取任何以排除项开头的网址,则必须执行以下操作:

    for(String exclusion : exclusions){
        if(href.startsWith(exclusion)){
            return false;
        }
    }
    

    另外,if 语句不是循环。

    【讨论】:

    • 您看到的是整个 URL 是否在排除列表中 (exclusions.contains(href)),而不是查看 URL 是否以任何排除开头(我的示例)。