Scrapy 和 Selenium：如何循环 XPATH 并执行点击答案

【问题标题】：Scrapy & Selenium: How To Loop XPATH and preform a clickScrapy 和 Selenium：如何循环 XPATH 并执行点击
【发布时间】：2017-08-17 08:34:55
【问题描述】：

我一直致力于使用 selenium 和 scrapy 抓取 this 网站。我希望我的代码点击每个公司链接，然后提取并循环这个过程。但我不知道如何从一个公司链接转到另一个。

任何帮助将不胜感激。

from scrapy.http import TextResponse
from selenium import webdriver
import scrapy
import time


class ExampleSpider(scrapy.Spider):
    name = 'comp'
    allowed_domains = ['site']
    start_urls = ["site"]

    def __init__(self, **kwargs):
        super(ExampleSpider, self).__init__(**kwargs)
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        self.driver.implicitly_wait(10)
        index = 0
        while True:
            companies = self.driver.find_elements_by_xpath('//*[@id="company-list"]/ul/li')
            try:
                companies[index].click()
                time.sleep(6)
            except IndexError:
                break
            resp = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
            for com in resp.xpath('body'):
                yield \
                    {
                 # DO Something
                     } 

                self.driver.back()
                index += 1
            self.driver.quit()

它只从第一个链接中提取然后停止。请帮帮我。

【问题讨论】：

你坚持使用 Selenium 吗？此页面似乎正在使用 API - 尝试在浏览器的开发人员工具中查找 XHR 请求。

标签： python loops selenium xpath scrapy

【解决方案1】：

主要问题是因为你有 driver.quit() 在你的 while 循环中。让它脱离循环。

如果您使用它来提取公司名称，也更喜欢使用确切的 xpath，如下所示，

//*[@id="company-list"]/ul/li/div[2]/h4

【讨论】：

我将 driver.quit() 排除在外，但它并没有解决问题，我不仅想取消公司名称，还想取消他们的网站和董事会顾问。
现在是什么问题，它仍然只运行一次还是给出了一些错误？
是的，它仍然只运行一次。

【解决方案2】：

如前所述，尝试使用他们的 API，您将不必为页面渲染、点击元素等而烦恼。在开发者工具中查看 XHR 请求，您可以看到：

要获取公司列表，请致电https://www.investiere.ch/proxy/api2/v1/companies?extra%5Bimagecache%5D=company_logo_70&fields=companyType,lifecycle&page=0&parameters%5Binclude_skipped%5D=yes。点击 Load more... 只是调整 URL 中的page 参数。
从上面的结果中，您可以通过以下属性records[X].uri 中的链接提取公司详细信息，例如对于第一家公司CombaGroup，它是https://www.investiere.ch/api2/v1/companies/10211。
要获取人员列表（例如经理），请点击链接https://www.investiere.ch/proxy/api2/v1/companies/10211/people。

【讨论】：