【问题标题】：Scrapy: Same Xpath for different fieldsScrapy：不同字段的相同 Xpath
【发布时间】：2021-08-30 06:57:12
【问题描述】：

我正在尝试使用 scrapy 从www.galaxus.ch 抓取不同类别的产品。为了呈现 HTML，我使用了 Splash 和 Lua 脚本。要阅读我使用熊猫的 excel 文件。到目前为止，我的脚本运行良好。这是我的代码_

read_excel

import pandas as pd

def read_xlsx():
    df = pd.read_excel('externe_festplatte.xlsx')
    return df['Gtin'].dropna().astype('int64').tolist()

蜘蛛

import scrapy
from scrapy_splash import SplashRequest
from galaxus.spiders.read_files import read_xlsx


base_url = "https://www.galaxus.ch/search?q={}"


class GtinSpider(scrapy.Spider):
    name = 'gtin'
    allowed_domains = ['www.galaxus.ch']

    script = '''
        function main(splash, args)
            splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36")
            splash.private_mode_enabled = false
            assert(splash:go(args.url))
            assert(splash:wait(5))
            
            item_select = assert(splash:select("div.panelLayout_mainContainer__11Jh_"))
            item_select:mouse_click()
            assert(splash:wait(5))
            
            see_more = assert(splash:select("[data-test='showMoreButton-specifications'] span"))
            see_more:mouse_click()
            assert(splash:wait(5))
            
            
            splash:set_viewport_full()
            return splash:html()
        end
    '''

    def start_requests(self):
        for value in read_xlsx():
            yield SplashRequest(
            url=base_url.format(value),
            callback=self.parse, endpoint='execute', args={
            'lua_source': self.script
        }
        )

    def parse(self, response):
        yield{
        'Titel': response.xpath(".//span[@class='jqo5ci-1 goteOY']/text()").get(),
        'Untertitel': response.xpath(".//span[@class='jqo5ci-2 beeFWi']/text()").get(),
        'Beschreibung': response.xpath("//div[@class='sc-1op7ol6-0 hYPLAr']/span/text()").get(),
        'Kategorie': response.xpath("(.//div[@class='breadcrumbView_withIcon__3mWwP']/a)[4]/text()").get(),
        'Produktetyp': response.xpath(".//span[@class='yip624-0 dpAcNY']/text()").get(),
        'Hersteller': response.xpath(".//h1[@class='jqo5ci-0 czhxQj']/strong/text()").get()
       }

问题是，如果我还想从同一页面中抓取 Spezifikationen/ Specification 字段，则每个产品类别都不同，但具有相同的 Xpath //td[@class='sc-18g78bs-4 sxRfA']。示例：

对于这两个产品类别，在 Spezifikationen 字段中，它们具有具有不同字段名称的相同 Xpath。对于 SSD，它是“Formfaktor”，对于 RAM，它是“Arbeitsspeichertyp”，但两者的 Xpath 是相同的。如何解决这个问题？我也想把结果导出到同一个excel文件中。

*我希望我能把我的观点说清楚。我是新的 StackOverflow 用户。我正在努力习惯它。期待您的建议和指导。

【问题讨论】：

标签： python pandas xpath scrapy scrapy-splash

【解决方案1】：

我认为您需要使用 text 作为 XPath 表达式的基础：

memory_type = response.xpath('normalize-space(//td[.="Arbeitsspeichertyp"]/following-sibling::td[1])').get()
form_factor = response.xpath('normalize-space(//td[.="Formfaktor"]/following-sibling::td[1])').get()

【讨论】：

感谢您的建议，但对于 80 多个类别，使用文本非常困难
@RaisulIslam 您可以从我的回答中使用稍微修改的 XPath 表达式来获取字段名称和字段值。
对不起，我没听清楚。你能再解释一下吗？