【发布时间】:2017-09-14 19:34:33
【问题描述】:
我是 scrapy 和 python 的新手,我很难理解流程。我不明白在哪里放置“爬到下一页”功能。我不确定它是否应该在我回调到 parse_data 之后或在它自身的 parse_data 函数中出现
脚本逻辑: 对于类别中的类别,抓取类别中的所有页面。
选项 1:
import scrapy
class Amazon01Spider(scrapy.Spider):
name = 'amazon0.1'
allowed_domains = ['amazon.com']
start_urls = ['https://amazon.com/Books/s?ie=UTF8&page=1&rh=n%3A283155&srs=9187220011']
def parse(self, response):
cats = response.xpath('//*[@id="leftNavContainer"]//*[@class="a-unordered-list a-nostyle a-vertical s-ref-indent-two"]//li//@href').extract()
for cat in cats:
yield scrapy.Request("https://amazon.com/"+""+cat, callback = self.parse_data)
def parse_data(self, response):
items = response.xpath('//*[@class="a-fixed-left-grid-col a-col-right"]')
for item in items:
name = item.xpath('.//*[@class="a-row a-spacing-small"]/div/a/h2/text()').extract_first()
yield {'Name': name}
next_page_url = response.xpath('//*[@class="pagnLink"]/a/@href').extract_first()
yield scrapy.Request("https://amazon.com/"+""+next_page_url, callback = self.parse_data)
选项 2:
import scrapy
class Amazon01Spider(scrapy.Spider):
name = 'amazon0.1'
allowed_domains = ['amazon.com']
start_urls = ['https://amazon.com/Books/s?ie=UTF8&page=1&rh=n%3A283155&srs=9187220011']
def parse(self, response):
cats = response.xpath('//*[@id="leftNavContainer"]//*[@class="a-unordered-list a-nostyle a-vertical s-ref-indent-two"]//li//@href').extract()
for cat in cats:
yield scrapy.Request("https://amazon.com/"+""+cat, callback = self.parse_data)
next_page_url = response.xpath('//*[@class="pagnLink"]/a/@href').extract_first()
yield scrapy.Request("https://amazon.com/"+""+next_page_url)
def parse_data(self, response):
items = response.xpath('//*[@class="a-fixed-left-grid-col a-col-right"]')
for item in items:
name = item.xpath('.//*[@class="a-row a-spacing-small"]/div/a/h2/text()').extract_first()
yield {'Name': name}
【问题讨论】:
标签: python asynchronous web-scraping scrapy web-crawler