如何让scrapy网络爬虫框架保持跟踪链接？答案

【问题标题】：How to make the scrapy web crawling framework keep following links?如何让scrapy网络爬虫框架保持跟踪链接？
【发布时间】：2021-10-09 22:05:12
【问题描述】：

我正在尝试制作一个爬虫，它可以从 SCP wiki 中获取信息并跟踪下一个 SCP 的链接并继续这样。

使用我当前的代码，从第一个被跟踪链接中提取数据后，爬虫停止跟踪到下一个链接。

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "scp"
    start_urls = [
        'https://scp-wiki.wikidot.com/scp-002',
    ]

    def parse(self, response):
        for scp in response.xpath('//*[@id="main-content"]'):
            yield {
                'title': scp.xpath('//*[@id="page-content"]/p[1]').get(),
                'tags': scp.xpath('//*[@id="main-content"]/div[4]').get(),
                'class': scp.xpath('//*[@id="page-content"]/p[2]').get(),
                'scp': scp.xpath('//*[@id="page-content"]/p[3]').get(),
                'desc': scp.xpath('//*[@id="page-content"]/p[6]').get(),
            }
        next_page = response.xpath('//*[@id="page-content"]/div[3]/div/p/a[2]/@href').get()
        next_page = 'https://scp-wiki.wikidot.com'+next_page
        print(next_page)
        next_page = response.urljoin(next_page)
        print(next_page)
        yield response.follow(next_page, callback=self.parse)

当我运行这个蜘蛛时，我得到以下错误：

next_page = 'https://scp-wiki.wikidot.com'+next_page
TypeError: can only concatenate str (not "NoneType") to str

【问题讨论】：

显然response.xpath('//*[@id="page-content"]/div[3]/div/p/a[2]/@href').get() 返回None。您是否尝试了解原因？

标签： python scrapy web-crawler

【解决方案1】：

正如错误清楚地表明，它不能将“NoneType”连接到str。

这意味着next_page 变量没有从上一行函数response.xpath().get() 中提到的xpath 中获取任何值。

没有匹配的 xpath，所以get() 返回None。

您可以查看documentation of Scrapy。

【讨论】：

如何获取链接？
如我所见，您正在访问的页面，下一页按钮具有 xpath //*[@id="page-content"]/div[3]/div/p/a[2] 。您提到的 xpath 以 /@href 结尾，可能是错误的。如果它不起作用，请使用任何其他选择器或使用完整的 xpath。希望这可以帮助。 :)