【发布时间】:2016-07-13 21:21:00
【问题描述】:
我正在尝试在http://www.funda.nl/koop/amsterdam/ 上收集有关阿姆斯特丹待售房屋的数据。主页仅显示有限数量的房屋,底部有一个寻呼机,如下所示:
(“Volgende”在荷兰语中意为“下一个”)。由此我推断总共有255页。这些页面中的每一个都有 URL http://www.funda.nl/koop/amsterdam/p2/、http://www.funda.nl/koop/amsterdam/p3/ 等。要获取所有房屋的数据,我想“循环”所有子页面 p1、p2、...、p255。
我正在尝试了解如何“设置”。到目前为止,我已经编写了以下代码:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Funda.items import FundaItem
# from scrapy.shell import inspect_response
class FundaSpider(CrawlSpider):
name = "Funda"
allowed_domains = ["funda.nl"]
start_urls = ["http://www.funda.nl/koop/amsterdam/"]
le1 = LinkExtractor(allow=r'%s+huis-\d{8}' % start_urls[0]) # Link to the page of an individual house, such as http://www.funda.nl/koop/amsterdam/huis-49805292-nieuwendammerdijk-21/
le2 = LinkExtractor(allow=r'%s+p\d+' % start_urls[0]) # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/
rules = (
Rule(le1, callback='parse_item'),
Rule(le2, callback='get_max_page_number')
)
def parse_item(self, response):
links = self.le1.extract_links(response)
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'):
item = FundaItem()
item['url'] = link.url
yield item
def get_max_page_number(self, response):
links = self.le2.extract_links(response)
max_page_number = 0
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'):
page_number = int(link.url.split("/")[-2].strip('p'))
if page_number > max_page_number:
max_page_number = page_number
return max_page_number
LinkExtractor le2 回调 get_max_page_number,它只返回数字 255。然后我想使用这个数字“合成”不同的 start_urls 以应用于 LinkExtractor le1,从而获得每个页面上各个房屋的链接。
问题是,据我了解,scrapy 异步处理这些请求,所以我无法确保它会先获取数字 255,然后使用该数字生成其他请求。如果是这样,我需要依次使用两个蜘蛛并从脚本中调用它们,而在第二个蜘蛛中,start_url 必须作为变量传递。
关于如何“设置”的任何指示?
【问题讨论】: