【问题标题】:How to skip Parent directories while scraping a File Type Website?如何在抓取文件类型网站时跳过父目录?
【发布时间】:2018-05-30 18:37:07
【问题描述】:

在抓取使用目录存储文件的基本文件夹系统网站时,

yield scrapy.Request(url1, callback=self.parse)

跟踪链接并抓取已爬取链接的所有内容,但我通常会遇到爬虫通过根目录链接传递的情况,它会获取具有不同 url 的所有相同文件,因为根目录介于两者之间。

http://example.com/root/sub/file
http://example.com/root/sub/../sub/file

任何帮助将不胜感激。

这是代码示例的 sn-p

class fileSpider(Spider):
    name = 'filespider'
    def __init__(self, filename=None):
        if filename:
            with open(filename, 'r') as f:
                self.start_urls =  [url.strip() for url in f.readlines()]

    def parse(self, response):
        item = Item()
        for url in response.xpath('//a/@href').extract():
            url1 = response.url + url
            if(url1[-4::] in videoext):
                item['name'] = url
                item['url'] = url1
                item['depth'] = response.meta["depth"]
                yield item
            elif(url1[-1]=='/'):
                yield scrapy.Request(url1, callback=self.parse)   
        pass

【问题讨论】:

    标签: scrapy web-crawler scrapy-spider google-crawlers scrapyd


    【解决方案1】:

    您可以使用os.path.normpath 来规范化所有路径,这样就不会出现重复:

    import os
    import urlparse
    ...
    
        def parse(self, response):
            item = Item()
            for url in response.xpath('//a/@href').extract():
                url1 = response.url + url
    
                # =======================
                url_parts = list(urlparse.urlparse(url1))
                url_parts[2] = os.path.normpath(url_parts[2])
                url1 = urlparse.urlunparse(url_parts)
                # =======================
    
                if(url1[-4::] in videoext):
                    item['name'] = url
                    item['url'] = url1
                    item['depth'] = response.meta["depth"]
                    yield item
                elif(url1[-1]=='/'):
                    yield scrapy.Request(url1, callback=self.parse)   
            pass
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-01-11
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-09-30
      • 2017-12-05
      相关资源
      最近更新 更多