【发布时间】:2014-10-30 08:18:59
【问题描述】:
我对scrapy很陌生,我正在尝试使用CrawlSpider抓取一个网站,我想根据“下一步”按钮递归地抓取它。但它不起作用。我认为问题来自正则表达式,但我检查了很多次,我找不到错误。它只抓取着陆页而不进入下一页。
# -*- coding: utf-8 -*-
start_urls = ['https://shopping.yahoo.com/merchantrating/?mid=13652']
rules = (
Rule(LinkExtractor(allow = "/merchantrating/;_ylt=Anf3hF19R8MGFPwuYuJUny4cEb0F\?mid=13652&sort=1&start=\d+"), callback = 'parse_start_url', follow = True),
)
def parse_start_url(self, response):
sel = Selector(response)
contents = sel.xpath('//p')
for content in contents:
item = BedbugsItem()
item['pageContent'] = content.xpath('text()').extract()
self.items.append(item)
return self.items
【问题讨论】: