【发布时间】:2021-11-15 11:05:33
【问题描述】:
我已经写了这段代码:
curl_command = "curl blah blah"
class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['some_domain', ]
start_urls = ['someurl', ]
postal_codes = ['some_postal_code', ]
def start_requests(self):
for postal_code in self.postal_codes:
curl_req = scrapy.Request.from_curl(curl_command=curl_command)
curl_req._cb_kwargs = {'page': 0}
yield curl_req
def parse(self, response, **kwargs):
cur_page = kwargs.get('page', 1)
logging.info("Doing some logic")
num_pages = do_some_logic()
yield mySpiderItem
if cur_page < num_pages:
logging.info("New Request")
curl_req = scrapy.Request.from_curl(curl_command=curl_command)
curl_req._cb_kwargs = {'page': cur_page + 1}
yield curl_req
yield scrapy.Request(url="https://jsonplaceholder.typicode.com/posts")
现在的问题是 parse 方法只被调用一次。换句话说,日志看起来像这样:
Doing some logic
New Request
Spider closing
我不知道新请求发生了什么。从逻辑上讲,新请求也应该导致Doing some logic 日志,但由于某种原因它不会。
我在这里错过了什么吗?有没有其他方法可以产生新的请求?
【问题讨论】:
标签: python web-scraping scrapy scrapy-pipeline