Scrapy 迭代 xpath 结果集答案

【问题标题】：Scrapy iterating over xpath resultsetScrapy 迭代 xpath 结果集
【发布时间】：2021-09-29 00:03:41
【问题描述】：

我正在尝试从以下站点抓取有关 UFC 赛事的信息： http://www.ufcstats.com/statistics/events/completed?page=all

首先，我使用response.xpath('//table[@class="b-statistics__table-events"]/tbody/tr[@class="b-statistics__table-row"]') 获取表中的所有行

接下来，我想遍历这些<tr> 元素并提取其他信息。在 for 循环中，当我使用 extract_first() 时，我总是得到相同的记录（来自第一个表行）。当我使用extract()[0] 时，我得到了正确的结果。

有人知道这些行为的原因吗？

class EventsInfoSpider(scrapy.Spider):
    name = "events_info"

    def __init__(self):
        self.events = {}

    def start_requests(self):
        url = 'http://www.ufcstats.com/statistics/events/completed?page=all'
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):

        for event in response.xpath('//table[@class="b-statistics__table-events"]/tbody/tr[@class="b-statistics__table-row"]'):
            event_info = {'event_url': event.xpath('td[@class="b-statistics__table-col"]/i[@class="b-statistics__table-content"]/a/@href').extract_first().strip(),
                          'event_name': event.xpath('td[@class="b-statistics__table-col"]/i[@class="b-statistics__table-content"]/a/text()').extract_first().strip(),
                          'event_date': event.xpath('td[@class="b-statistics__table-col"]/i[@class="b-statistics__table-content"]/span[@class="b-statistics__date"]/text()').extract_first().strip(),
                          'event_location': event.xpath('td[@class="b-statistics__table-col b-statistics__table-col_style_big-top-padding"]/text()').extract_first().strip()
                          }

            self.events[event_info['event_name']] = event_info

        with open(f'events_data/events.json', 'w+') as json_file:
            json.dump(self.events, json_file, indent=6)
            json_file.close()

        self.log(f'Collected data for {len(self.events)} events')

【问题讨论】：

extract_first() 给出的结果与 extract()[0] 相同，请再次检查。

标签： python html web-scraping xpath scrapy

【解决方案1】：

for 循环中的 xpath 没有引用以前过滤的选择器，这就是为什么它只从第一个 <li> 元素返回值。我通过在 for 循环中 xpath 的开头添加 .// 来解决它。例如：

event.xpath('.//td[@class="b-statistics__table-col"]...

而不是

event.xpath('td[@class="b-statistics__table-col"]...

【讨论】：