【发布时间】:2016-01-11 21:09:58
【问题描述】:
我目前正在尝试使用 Scrapy 浏览 Elite Dangerous 子版块并收集帖子标题、网址和投票数。前两个我做得很好,但不确定如何编写 XPath 表达式来访问投票。
selector.xpath('//div[@class="score unvoted"]').extract() 有效,但它返回当前页面上所有帖子的投票计数(而不是每个单独的帖子)。 response.css('div.score.unvoted').extract() 适用于每个单独的帖子,但返回 [u'<div class="score unvoted">1</div>'],而不仅仅是 1。(我也很想知道如何使用 XPath 来做到这一点!:))
代码如下:
class redditSpider(CrawlSpider): # http://doc.scrapy.org/en/1.0/topics/spiders.html#scrapy.spiders.CrawlSpider
name = "reddits"
allowed_domains = ["reddit.com"]
start_urls = [
"https://www.reddit.com/r/elitedangerous",
]
rules = [
Rule(LinkExtractor(
allow=['/r/EliteDangerous/\?count=\d*&after=\w*']), # Looks for next page with RE
callback='parse_item', # What do I do with this? --- pass to self.parse_item
follow=True), # Tells spider to continue after callback
]
def parse_item(self, response):
selector_list = response.css('div.thing') # Each individual little "box" with content
for selector in selector_list:
item = RedditItem()
item['title'] = selector.xpath('div/p/a/text()').extract()
item['url'] = selector.xpath('a/@href').extract()
# item['votes'] = selector.xpath('//div[@class="score unvoted"]')
item['votes'] = selector.css('div.score.unvoted').extract()
yield item
【问题讨论】:
标签: python css python-2.7 xpath scrapy