【问题标题】:How to follow links in Scrapy if there is no href?如果没有href,如何在Scrapy中跟踪链接?
【发布时间】:2020-09-06 05:25:48
【问题描述】:

当我已经解析了一页并从那里提取信息时,我正在尝试跟踪 Scrapy 中的链接。问题是,网页没有href,所以我不能轻松地跟随它。我设法用 @data-param 扩展了我的 XPath 查询,最后得到了一些东西:page=2

问题是我不确定如何访问此链接,因为我想将 listName["listLinkMaker"] 传递给我的 URL 生成器或作曲家。

我是否应该再做一个“def”并称之为 def parse_pagination 来跟踪链接?

代码中使用的JSON真的很简单:

[
{"storeName": "Interspar", "storeLinkMaker": "https://popusti.njuskalo.hr/trgovina/Interspar"}
]

代码如下:

# -*- coding: utf-8 -*-
import scrapy
import json


class LoclocatorTestSpider(scrapy.Spider):
    name = "loclocator_test"
    start_urls = []

    with open("test_one_url.json", encoding="utf-8") as json_file:
        data = json.load(json_file)
        for store in data:
            storeName = store["storeName"]
            storeLinkUrl = store["storeLinkMaker"]
            start_urls.append(storeLinkUrl)

    def parse(self, response):
        selector = "//div[@class='mainContentWrapInner cf']"

        store_name_selector = ".//h1[@class='title']/text()"
        store_branches_selector = ".//li/a[@class='xiti']/@href"

        for basic_info in response.xpath(selector):
            store_branches = {}

            store_branches["storeName"] = basic_info.xpath(store_name_selector).extract_first()
            # This specific XPath extracts 1st part of link needed to crawl all of store branches
            store_branches["storeBranchesLink"] = basic_info.xpath(store_branches_selector).extract_first() + "?"

            store_branches_url = basic_info.xpath(store_branches_selector).extract_first()
            yield response.follow(store_branches_url, self.parse_pagination, meta={"store_branches": store_branches})


    def parse_branches(self, response):
        store_branches_name_selector = "//li[@class='xiti']"
        store_branches = response.meta["store_branches"]

        for store_branch in response.xpath(store_branches_name_selector):
            store_branches["storeBranchName"] = store_branch.xpath(".//span[@class='title']/text()").extract_first()

            yield store_branches

        # This specific XPath extracts 2nd part of link needed to crawl all of store branches
        # URL should look like: https://popusti.njuskalo.hr/trgovina/Interspar?page=n where n>0
        links = response.selector.xpath("//li[@class='next']/button[@class='nBtn link xiti']/@data-param").extract()
        for link in links:
            absolute_url = #LIST FROM FIRST PARSE (ie. store_branches["storeBranchesLink"]) + link
            yield scrapy.Request(absolute_url, callback=self.parse_branches)

谢谢。

【问题讨论】:

    标签: python xpath web-scraping scrapy href


    【解决方案1】:

    我自己设法找到了解决方案,并且我相对接近解决方案。

    下部分:

        # This specific XPath extracts 2nd part of link needed to crawl all of store branches
        # URL should look like: https://popusti.njuskalo.hr/trgovina/Interspar?page=n where n>0
        links = response.selector.xpath("//@data-param").extract()
        store_branches = response.meta["store_branches"]
        for link in links:
            absolute_url = store_branches["storeBranchesLink"]) + link
            yield scrapy.Request(absolute_url, callback=self.parse_branches)
    

    我相信解决方案是添加来自 store_branches 的响应,因为它能够找到所有可能的页面 (?page=n where n>0)。如果有人知道更多的技术信息,因为我对代码的理解比较初级,请务必回答。

    【讨论】:

      猜你喜欢
      • 2020-03-17
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-06-16
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多