不确定如何 XPath 到特定的网站元素答案

【问题标题】：Not sure how to XPath to specific website element不确定如何 XPath 到特定的网站元素
【发布时间】：2016-01-11 21:09:58
【问题描述】：

我目前正在尝试使用 Scrapy 浏览 Elite Dangerous 子版块并收集帖子标题、网址和投票数。前两个我做得很好，但不确定如何编写 XPath 表达式来访问投票。

selector.xpath('//div[@class="score unvoted"]').extract() 有效，但它返回当前页面上所有帖子的投票计数（而不是每个单独的帖子）。 response.css('div.score.unvoted').extract() 适用于每个单独的帖子，但返回 [u'<div class="score unvoted">1</div>']，而不仅仅是 1。（我也很想知道如何使用 XPath 来做到这一点！:)）

代码如下：

class redditSpider(CrawlSpider):  # http://doc.scrapy.org/en/1.0/topics/spiders.html#scrapy.spiders.CrawlSpider
    name = "reddits"
    allowed_domains = ["reddit.com"]
    start_urls = [
    "https://www.reddit.com/r/elitedangerous",
    ]

    rules = [
        Rule(LinkExtractor(
            allow=['/r/EliteDangerous/\?count=\d*&after=\w*']),  # Looks for next page with RE
        callback='parse_item',  # What do I do with this? --- pass to self.parse_item
        follow=True),  # Tells spider to continue after callback
    ]

    def parse_item(self, response):
        selector_list = response.css('div.thing') # Each individual little "box" with content

        for selector in selector_list:
            item = RedditItem()
            item['title'] = selector.xpath('div/p/a/text()').extract()
            item['url'] = selector.xpath('a/@href').extract()
            # item['votes'] = selector.xpath('//div[@class="score unvoted"]')
            item['votes'] = selector.css('div.score.unvoted').extract()
            yield item

【问题讨论】：

标签： python css python-2.7 xpath scrapy

【解决方案1】：

你在正确的轨道上。第一种方法只需要两件事：

开头的一个点使其成为context-specific
text()在最后

固定版本：

selector.xpath('.//div[@class="score unvoted"]/text()').extract()

而且，仅供参考，您也可以使用 ::text pseudo-element 使第二个选项起作用：

response.css('div.score.unvoted::text').extract()

【讨论】：

谢谢，上下文特定的点让一切变得更容易:)。再次感谢 ::text 伪元素。

【解决方案2】：

这应该可行 -

selector.xpath('//div[contains(@class, "score unvoted")]/text()').extract()

【讨论】：

同样的问题，返回整个页面的投票，而不仅仅是当前 div 下的投票。
selector.xpath('//div[contains(@class, "score unvoted")]/text()')[0].extract()