Scrapy 不会消耗所有 start_urls答案

【问题标题】：Scrapy does not consume all start_urlsScrapy 不会消耗所有 start_urls
【发布时间】：2022-03-20 15:30:24
【问题描述】：

我已经苦苦挣扎了一段时间，一直没能解决。问题是我有一个包含几百个 URL 的 start_urls 列表，但这些 URL 中只有一部分被我的蜘蛛的 start_requests() 消耗。

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class MySpider(CrawlSpider):
    
    #SETTINGS
    name = 'example'
    allowed_domains = []
    start_urls = []
                
    #set rules for links to follow        
    link_follow_extractor = LinkExtractor(allow=allowed_domains,unique=True) 
    rules = (Rule(link_follow_extractor, callback='parse', process_request = 'process_request', follow=True),) 

    def __init__(self,*args, **kwargs):
        super(MySpider, self).__init__(* args, ** kwargs)
        
        #urls to scrape
        self.start_urls = ['https://example1.com','https://example2.com']
        self.allowed_domains = ['example1.com','example2.com']          

    def start_requests(self):
                
        #create initial requests for urls in start_urls        
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse,priority=1000,meta={'priority':100,'start':True})
    
    def parse(self, response):
        print("parse")

我已经阅读了 StackOverflow 上关于这个问题的多篇文章，以及 Github 上的一些帖子（一直到 2015 年），但一直无法让它发挥作用。

据我了解，问题在于，当我创建初始请求时，其他请求已经生成了一个响应，该响应已被解析并创建了填满队列的新请求。我确认这是我的问题，因为当我使用中间件将每个域要下载的页面数限制为 2 时，问题似乎得到了解决。这是有道理的，因为第一个创建的请求只会生成几个新请求，而队列中的空间会留给 start_urls 列表的其余部分。

我还注意到，当我将并发请求从 32 个减少到 2 个时，甚至会消耗 start_urls 列表的一小部分。将并发请求数增加到几百个是不可能的，因为这会导致超时。

目前还不清楚蜘蛛为什么会出现这种行为，只是不再继续使用 start_urls。如果有人能给我一些关于这个问题的潜在解决方案的指示，将不胜感激。

【问题讨论】：

你解决了吗？

标签： python asynchronous web-scraping scrapy

【解决方案1】：

我在同一个问题上苦苦挣扎：我的爬虫永远不会超过我定义的任何 start_urls 的第 1 页。

除了文档说 CrawlSpider 类在每个响应内部都使用它自己的解析，所以你永远不应该使用自定义解析冒着蜘蛛不再工作的风险，文档没有提到的是解析器使用了由 CrawlSpider 类不解析 start_urls（即使它需要解析 start_urls），所以蜘蛛最初工作并在尝试时失败并出现“回调中没有解析”错误爬到下一页/start_url。

长话短说，尝试这样做（它对我有用）：为 start_urls 添加解析函数。它和我一样不需要真正做任何事情

def parse(self, start_urls):
    for i in range(1, len(start_urls)):
        print('Starting to scrap page: '+ i)
    self.start_urls = start_urls

下面是我的整个代码（用户代理在项目设置中定义）：

    from urllib.request import Request
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class PSSpider(CrawlSpider):
    name = 'jogos'
    allowed_domains = ['meugameusado.com.br']
    start_urls = ['https://www.meugameusado.com.br/playstation/playstation-3/jogos?pagina=1', 'https://www.meugameusado.com.br/playstation/playstation-4/jogos?pagina=1',
    'https://www.meugameusado.com.br/playstation/playstation-2/jogos?pagina=1', 'https://www.meugameusado.com.br/playstation/playstation-5/jogos?pagina=1',
    'https://www.meugameusado.com.br/playstation/playstation-vita/jogos?pagina=1'] 
    
    def parse(self, start_urls):
        for i in range(1, len(start_urls)):
            print('Starting to scrap page: '+ i)
        self.start_urls = start_urls

    rules = (
        Rule(LinkExtractor(allow=([r'/playstation/playstation-2/jogos?pagina=[1-999]',r'/playstation/playstation-3/jogos?pagina=[1-999]',
         r'/playstation/playstation-4/jogos?pagina=[1-999]', r'/playstation/playstation-5/jogos?pagina=[1-999]', r'/playstation/playstation-vita/jogos?pagina=[1-999]', 'jogo-'])
         ,deny=('/jogos-de-','/jogos?sort=','/jogo-de-','buscar?','-mega-drive','-sega-cd','-game-gear','-xbox','-x360','-xbox-360','-xbox-series','-nes','-gc','-gbc','-snes','-n64','-3ds','-wii','switch','-gamecube','-xbox-one','-gba','-ds',r'/nintendo*', r'/xbox*', r'/classicos*',r'/raridades*',r'/outros*'))
         ,callback='parse_item'
         ,follow=True),
    )

    def parse_item(self, response):
        yield {
            'title': response.css('h1.nome-produto::text').get(),
            'price': response.css('span.desconto-a-vista strong::text').get(),
            'images': response.css('span > img::attr(data-largeimg)').getall(),
            'video': response.css('#playerVideo::attr("src")').get(),
            'descricao': response.xpath('//*[@id="descricao"]/h3[contains(text(),"ESPECIFICAÇÕES")]/preceding-sibling::p/text()').getall(),
            'especificacao1': response.xpath('//*[@id="descricao"]/h3[contains(text(),"GARANTIA")]/preceding-sibling::ul/li/strong/text()').getall(),
            'especificacao2': response.xpath('//*[@id="descricao"]/h3[contains(text(),"GARANTIA")]/preceding-sibling::ul/li/text()').getall(),
            'tags': response.xpath('//*[@id="descricao"]/h3[contains(text(),"TAGS")]/following-sibling::ul/li/a/text()').getall(),
            'url': response.url,
        }

【讨论】：