【发布时间】:2016-09-01 02:04:49
【问题描述】:
免责声明:我对 Python 和 Scrapy 都很陌生。
我正试图让我的蜘蛛从起始 url 收集 url,跟随这些收集的 url 和两者:
- 从下一页抓取特定项目(并最终退回)
- 从下一页收集更具体的网址并关注这些网址。
我希望能够继续这个产生项目和回调请求的过程,但我不太确定该怎么做。 目前我的代码只返回网址,没有项目。我显然做错了什么。任何反馈将不胜感激。
class VSSpider(scrapy.Spider):
name = "vs5"
allowed_domains = ["votesmart.org"]
start_urls = [
"https://votesmart.org/officials/WA/L/washington-state-legislative#.V8M4p5MrKRv",
]
def parse(self, response):
sel = Selector(response)
#this gathers links to the individual legislator pages, it works
for href in response.xpath('//h5/a/@href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse1)
def parse1(self, response):
sel = Selector(response)
items = []
#these xpaths are on the next page that the spider should follow, when it first visits an individual legislator page
for sel in response.xpath('//*[@id="main"]/section/div/div/div'):
item = LegislatorsItems()
item['current_office'] = sel.xpath('//tr[1]/td/text()').extract()
item['running_for'] = sel.xpath('//tr[2]/td/text()').extract()
items.append(item)
#this is the xpath to the biography of the legislator, which it should follow and scrape next
for href in response.xpath('//*[@id="folder-bio"]/@href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse2, meta={'items': items})
def parse2(self, response):
sel = Selector(response)
items = response.meta['items']
#this is an xpath on the biography page
for sel in response.xpath('//*[@id="main"]/section/div[2]/div/div[3]/div/'):
item = LegislatorsItems()
item['tester'] = sel.xpath('//div[2]/div[2]/ul/li[3]').extract()
items.append(item)
return items
谢谢!
【问题讨论】:
-
快速浏览您的代码后,我猜最后一行的
return items应该有不同的缩进级别。 -
除了 starrify 提到的,
parse2是否可以访问?你能发布抓取日志吗?