【问题标题】:Scrapy doesn't follow new requestsScrapy 不遵循新的请求
【发布时间】:2021-11-15 11:05:33
【问题描述】:

我已经写了这段代码:

curl_command = "curl blah blah"

class MySpider(scrapy.Spider):
    name = 'myspider'
    allowed_domains = ['some_domain', ]
    start_urls = ['someurl', ]

    postal_codes = ['some_postal_code', ]

    def start_requests(self):
        for postal_code in self.postal_codes:
            curl_req = scrapy.Request.from_curl(curl_command=curl_command)
            curl_req._cb_kwargs = {'page': 0}

            yield curl_req

    def parse(self, response, **kwargs):
        cur_page = kwargs.get('page', 1)

        logging.info("Doing some logic")
        num_pages = do_some_logic()
        yield mySpiderItem

        if cur_page < num_pages:
            logging.info("New Request")
            curl_req = scrapy.Request.from_curl(curl_command=curl_command)
            curl_req._cb_kwargs = {'page': cur_page + 1}

            yield curl_req
            yield scrapy.Request(url="https://jsonplaceholder.typicode.com/posts")

现在的问题是 parse 方法只被调用一次。换句话说,日志看起来像这样:

Doing some logic
New Request
Spider closing

我不知道新请求发生了什么。从逻辑上讲,新请求也应该导致Doing some logic 日志,但由于某种原因它不会。

我在这里错过了什么吗?有没有其他方法可以产生新的请求?

【问题讨论】:

    标签: python web-scraping scrapy scrapy-pipeline


    【解决方案1】:

    从代码示例中很难确切知道问题出在哪里,但我想可能是您没有在请求中使用页码。

    例如,我为其他网站修改了您的代码:

    import scrapy
    import logging
    
    
    curl_command = 'curl "https://scrapingclub.com/exercise/list_basic/"'
    
    
    class MySpider(scrapy.Spider):
        name = 'myspider'
        allowed_domains = ['scrapingclub.com']
        #start_urls = ['someurl', ]
    
        postal_codes = ['some_postal_code', ]
    
        def start_requests(self):
            for postal_code in self.postal_codes:
                curl_req = scrapy.Request.from_curl(curl_command=curl_command, dont_filter=True)
                curl_req._cb_kwargs = {'page': 1}
    
                yield curl_req
    
        def parse(self, response, **kwargs):
            cur_page = kwargs.get('page', 1)
    
            logging.info("Doing some logic")
            #num_pages = do_some_logic()
            #yield mySpiderItem
            num_pages = 4
            if cur_page < num_pages:
                logging.info("New Request")
                curl_req = scrapy.Request.from_curl(curl_command=f'{curl_command}?page={str(cur_page + 1)}', dont_filter=True)
                curl_req._cb_kwargs = {'page': cur_page + 1}
                yield curl_req
                yield scrapy.Request(url="https://jsonplaceholder.typicode.com/posts")
    

    输出:

    [scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapingclub.com/exercise/list_basic/> (referer: None)
    [root] INFO: Doing some logic
    [root] INFO: New Request
    [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'jsonplaceholder.typicode.com': <GET https://jsonplaceholder.typicode.com/posts>
    [scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapingclub.com/exercise/list_basic/?page=2> (referer: https://scrapingclub.com/exercise/list_basic/)
    [root] INFO: Doing some logic
    [root] INFO: New Request
    [scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapingclub.com/exercise/list_basic/?page=3> (referer: https://scrapingclub.com/exercise/list_basic/?page=2)
    [root] INFO: Doing some logic
    [root] INFO: New Request
    [scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapingclub.com/exercise/list_basic/?page=4> (referer: https://scrapingclub.com/exercise/list_basic/?page=3)
    

    Scrapy 有一个默认启用的内置重复过滤器。如果您不希望这种行为,您可以设置 'dont_filter = True' 以避免忽略重复请求。

    【讨论】:

    • 这很奇怪!您的代码运行良好。我将start_requestsparse 方法中的curl_req 更改为不同的scrapy.Request 对象(具有一些不同的url)。现在代码的工作方式与我在问题中描述的一样
    • 确实我没有在请求中使用页码,你能解释一下为什么会导致问题吗?
    • 哦,我明白你的意思了,scrapy 默认将请求发送到相同的 url。如果我们不想要这种行为,我们应该包含dont_filter=True 来请求参数。这样做了,现在一切正常。
    • 如果您可以编辑您的答案并在其中添加有关 dont_filter 的说明,我会将其标记为正确答案。非常感谢您!
    • @Danialz 我添加了关于 dont_filter 的解释。很高兴您的程序有效。
    【解决方案2】:

    我想你忘记了请求中的回调部分。检查我从文档中获得的代码。在你的情况下应该是 callback=self.parse

       class MySpider(scrapy.Spider):
            name = 'myspider'
        
            def start_requests(self):
                return [scrapy.FormRequest("http://www.example.com/login",
                                           formdata={'user': 'john', 'pass': 'secret'},
                                           callback=self.logged_in)]
        
            def logged_in(self, response):
                # here you would extract links to follow and return Requests for
                # each of them, with another callback
                pass
    

    【讨论】:

    • 解析是默认回调。
    • 因为 parse 方法是默认的回调,我把它省略了。但无论如何我尝试显式传递回调方法,但仍然没有运气。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-09-13
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多