Python Scrapy-Selenium 分页问题答案

【问题标题】：Python Scrapy-Selenium Pagination problemPython Scrapy-Selenium 分页问题
【发布时间】：2021-01-04 20:04:25
【问题描述】：

我无法确定如何在该网站上跟踪分页（检查 start_urls）。它所做的是打开 webdriver，成功从第一页抓取数据，然后在加载第二页时关闭 webdriver，仅此而已。

import scrapy
from lxml.html import fromstring
from ..items import PontsItems
from selenium import webdriver


class Names(scrapy.Spider):
    name = 'enseafr'

    download_delay = 5.0

    start_urls = ['https://www.ponts.org/fr/annuaire/recherche?result=1&annuaire_mode=standard&annuaire_as_no=&keyword=&PersonneNom=&PersonnePrenom=&DiplomePromo%5B%5D=2023&DiplomePromo%5B%5D=2022&DiplomePromo%5B%5D=2021&DiplomePromo%5B%5D=2020&DiplomePromo%5B%5D=2019&DiplomePromo%5B%5D=2018&DiplomePromo%5B%5D=2017&DiplomePromo%5B%5D=2016&DiplomePromo%5B%5D=2015&DiplomePromo%5B%5D=2014&DiplomePromo%5B%5D=2013&DiplomePromo%5B%5D=2012&DiplomePromo%5B%5D=2011&DiplomePromo%5B%5D=2010']

    def __init__(self):
        self.driver = webdriver.Chrome()

    def parse(self, response):
        items = PontsItems()
        self.driver.get(response.url)

        next = self.driver.find_element_by_xpath('//a[@class="next"]')
        #'//*[@id="zoneAnnuaire_layout"]/div[3]/div[2]/div[3]/div[11]/a[4]'
        while True:

            try:
                next.click()

                for item in response.xpath('//div[@class="single_desc"]'):
                    name = item.xpath('./div[@class="single_libel"]/a/text()').get().strip()
                    description = item.xpath('./div[@class="single_details"]/div/text()').get()
                    description = fromstring(description).text_content().strip()
                    year = item.xpath('./div[@class="single_details"]/div/b/text()').get()

                    items['name'] = name
                    items['description'] = description
                    items['year'] = year
                    yield items

            except:
                break

        self.driver.close()

这几天我真的被这件事困住了。

【问题讨论】：

您好，请问是什么问题？
我得到这个：selenium.common.exceptions.StaleElementReferenceException：消息：过时的元素引用：元素未附加到页面文档（会话信息：chrome=85.0.4183.102）
如果您使用Chrome 来点击项目，那么您应该在self.driver.page_source 中搜索而不是response，或者干脆使用self.driver.find_element_by_xpath 而不是response.xpath 来搜索值。
当我使用 self.driver.find_element_by_xpath 我得到： TypeError: 'WebElement' object is not iterable 更不用说像 [984:16892:0918/012923.411:ERROR:device_event_log_impl.cc 这样全新的东西了(208)] [01:29:23.411] 蓝牙：bluetooth_adapter_winrt.cc:1074 获取默认适配器失败。 O_o 你们能向我解释一下这个魔法，或者更好地重写这个脚本以在某种程度上实际工作吗？
先把next放到try catch里面。

标签： python selenium pagination scrapy

【解决方案1】：

我不知道如何使用 PontItems() 但我可以使用一个空列表来显示我将如何使用以下内容返回数据。如果出现错误，它会返回当前列表，并在您每次单击新页面旁边时附加到列表中。你只有一个元素，所以使用 find_elements。

items=[]
while True:
    try:
        next = self.driver.find_element_by_xpath('//a[@class="next"]')
        next.click()
        descs = self.driver.find_elements_by_xpath('//div[@class="single_desc"]')
        for item in descs :
            name = item.xpath('./div[@class="single_libel"]/a/text()').get().strip()
            description = item.xpath('./div[@class="single_details"]/div/text()').get()
            description = fromstring(description).text_content().strip()
            year = item.xpath('./div[@class="single_details"]/div/b/text()').get()
            items.append({'name':name,'description':description,'year':year})
    except:
        break
yield items

【讨论】：

您是否针对该网站进行了测试？对我来说，它唯一改变的是获取最后一项数据并且在更改到第二页之前仍然关闭