【问题标题】:Scrapy not following next pageScrapy 不跟随下一页
【发布时间】:2021-06-28 17:11:53
【问题描述】:

我在这个问题上坐了很长时间,但我尝试的一切都不起作用。 我的目标是简单地从招聘网站中提取数据。每个站点提供 20 个工作岗位。我正在使用scrapy回调提取每个报价的数据。这或多或少有效。问题是无论我尝试什么,scrapy 都不会跳转到下一页。我首先尝试了scrapy & selenium,但不起作用。现在我只尝试使用 scrapy 和后续教程,但它仍然只从第 1 页的前 20 个优惠中提取数据。

重要提示:next 按钮会改变整个页面,这意味着它的 xpath/css 选择器会改变。我尝试了 css last-nth-child 和 xpath last()-1 但没有令人满意的结果。是什么让它变得更加困难,在变量 xpath 元素之后,一个标签后面带有链接。

代码如下:

import scrapy
from random import randint
from time import sleep


class WorkpoolJobsSpider(scrapy.Spider):
name = 'getdata'
allowed_domains = ['workpool-jobs.ch']
start_urls = ['https://www.workpool-jobs.ch/recht-jobs']

def parse(self, response):
    SET_SELECTOR = "//p[@class='inserattitel h2 mt-0']/a/@href"
    for joboffer in response.xpath(SET_SELECTOR):
        url1 = response.urljoin(joboffer.get())
        yield scrapy.Request(url1, callback = self.parse_dir_contents)

    next_page = response.xpath(".//li[@class='page-item'][last()-1]/../@href").get()
    wait(randint(5,10))
    if next_page:
        yield response.follow(url=next_page, callback=self.parse)

def parse_dir_contents(self, response):
    single_info = response.xpath(".//*[@class='col-12 col-md mr-md-3 mr-xl-5']")

    for info in single_info:
        info_Titel = info.xpath(".//article/h1[@class='inserattitel']/text()").extract_first()
        info_Berufsfelder = info.xpath(".//article/div[@class='border-top-grau']/p/text()").extract()
        info_Arbeitspensum = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[1]/text()").extract_first()
        info_Anstellungsverhältnis = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[2]/text()").extract_first()
        info_Arbeitsort = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[4]/a/text()").extract()
        info_VerfügbarAb = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[5]/text()").extract()
        info_Kompetenzenqualifikation = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-7']/dl[2]/dd/text()").extract_first()
        info_Aufgabengebiet = info.xpath(".//article/div[@class='border-bottom-grau'][1]//*[self::p or self::li]").extract()
        info_Erwartungen = info.xpath(".//article/div[@class='border-bottom-grau'][2]/ul/li[descendant-or-self::text()]").extract()
        info_WirBietenIhnen = info.xpath(".//article/div[@class='border-bottom-grau'][3]/ul/li[descendant-or-self::text()]").extract()
        info_Publikationsdatum = info.xpath(".//article/footer[@class='inseratfooter']/p[1]/strong/text()").extract_first()

        yield {'Titel': info_Titel,
        'Berufsfelder': info_Berufsfelder,
        'Arbeitspensum': info_Arbeitspensum,
        'Anstellungsverhältnis': info_Anstellungsverhältnis,
        'Arbeitsort': info_Arbeitsort,
        'VerfügbarAb': info_VerfügbarAb,
        'Kompetenzenqualifikation': info_Kompetenzenqualifikation,
        'Aufgabengebiet': info_Aufgabengebiet,
        'Erwartungen': info_Erwartungen,
        'WirBietenIhnen': info_WirBietenIhnen,
        'Publikationsdatum': info_Publikationsdatum}

非常感谢任何帮助!

【问题讨论】:

  • 也许您应该通过链接中的文本进行搜索——即.//a[text()="nächste"]/@href.//a[contains(text(), "nächste")]/@href.//a[@title="nächste Seite anzeigen"]/@href。最终你可以手动生成链接 - 它看起来像 /recht-jobs?seite=2 所以你可以使用 "/recht-jobs?seite=" + str(number) 并使用 number += 1
  • 谢谢,我去看看。希望最终能解决:)
  • 终于成功了!!我尝试了您的所有建议,但通过一些 youtube 视频和“手动”链接生成它可以工作。非常感谢。
  • 顺便说一句:我使用response.xpath(".//a[contains(text(), 'nächste')]/@href").get() 获取到下一页的链接,但是手动版本可以让您更好地控制要抓取的页面数。

标签: python scrapy next


【解决方案1】:

有了一些来自 furas 的提示,我终于设法让我的代码正常工作。如果以后有人遇到同样的问题,也许我下面的代码也可以帮助你:

import scrapy
from random import randint
from time import sleep


class WorkpoolJobsSpider(scrapy.Spider):
name = "getdata"
page_number = 2
allowed_domains = ["workpool-jobs.ch"]
start_urls = ["https://www.workpool-jobs.ch/recht-jobs"]

def parse(self, response):
    SET_SELECTOR = "//p[@class='inserattitel h2 mt-0']/a/@href"
    for joboffer in response.xpath(SET_SELECTOR):
        url1 = response.urljoin(joboffer.get())
        yield scrapy.Request(url1, callback = self.parse_dir_contents)

    next_page = "https://www.workpool-jobs.ch/recht-jobs?seite=" + str(WorkpoolJobsSpider.page_number)
    sleep(randint(5,10))
    if WorkpoolJobsSpider.page_number < 27:
        WorkpoolJobsSpider.page_number += 1
        yield response.follow(next_page, callback=self.parse)

def parse_dir_contents(self, response):
    single_info = response.xpath(".//*[@class='col-12 col-md mr-md-3 mr-xl-5']")

    for info in single_info:
        info_Titel = info.xpath(".//article/h1[@class='inserattitel']/text()").extract_first()
        info_Berufsfelder = info.xpath(".//article/div[@class='border-top-grau']/p/text()").extract()
        info_Arbeitspensum = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[1]/text()").extract_first()
        info_Anstellungsverhältnis = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[2]/text()").extract_first()
        info_Arbeitsort = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[4]/a/text()").extract()
        info_VerfügbarAb = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[5]/text()").extract()
        info_Kompetenzenqualifikation = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-7']/dl[2]/dd/text()").extract_first()
        info_Aufgabengebiet = info.xpath(".//article/div[@class='border-bottom-grau'][1]//*[self::p or self::li]").extract()
        info_Erwartungen = info.xpath(".//article/div[@class='border-bottom-grau'][2]/ul/li[descendant-or-self::text()]").extract()
        info_WirBietenIhnen = info.xpath(".//article/div[@class='border-bottom-grau'][3]/ul/li[descendant-or-self::text()]").extract()
        info_Publikationsdatum = info.xpath(".//article/footer[@class='inseratfooter']/p[1]/strong/text()").extract_first()

        yield {'Titel': info_Titel,
        'Berufsfelder': info_Berufsfelder,
        'Arbeitspensum': info_Arbeitspensum,
        'Anstellungsverhältnis': info_Anstellungsverhältnis,
        'Arbeitsort': info_Arbeitsort,
        'VerfügbarAb': info_VerfügbarAb,
        'Kompetenzenqualifikation': info_Kompetenzenqualifikation,
        'Aufgabengebiet': info_Aufgabengebiet,
        'Erwartungen': info_Erwartungen,
        'WirBietenIhnen': info_WirBietenIhnen,
        'Publikationsdatum': info_Publikationsdatum}

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2021-04-08
    • 2023-03-14
    • 1970-01-01
    • 2023-03-30
    • 2016-05-20
    • 2017-09-15
    • 2015-07-23
    • 1970-01-01
    相关资源
    最近更新 更多