【问题标题】:Scrapy Request url must be str or unicode, got NoneType:Scrapy Request url 必须是 str 或 unicode,得到 NoneType:
【发布时间】:2021-06-08 23:58:31
【问题描述】:

我尝试使用 scrapy 创建我的第一个蜘蛛 scraper 我使用 Dmoz 作为测试,我收到一条错误消息: TypeError: Request url must be str or unicode, got NoneType 但在调试中我可以看到正确的网址

代码:

import scrapy
import urlparse


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = ["http://www.dmoz.org/search?q=france&all=no&t=regional&cat=all"]

    def parse(self, response):
        sites = response.css('#site-list-content > div.site-item > div.title-and-desc')
        
        for site in sites:
            yield {
                'name': site.css('a > div.site-title::text').extract_first().strip(),
                'url': site.xpath('a/@href').extract_first().strip(),
                'description': site.css('div.site-descr::text').extract_first().strip(),
            }

        nxt = response.css('#subcategories-div > div.previous-next > div.next-page')
        next_page = nxt.css('a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)          

        yield scrapy.Request(next_page, callback=self.parse)

日志:

2016-10-18 11:17:03 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/search?q=france&start=20&type=next&all=no&t=regional&cat=all> (referer: http://www.dmoz.org/search?q=france&all=no&t=regional&cat=all)
2016-10-18 11:17:03 [scrapy] ERROR: Spider error processing <GET http://www.dmoz.org/search?q=france&start=20&type=next&all=no&t=regional&cat=all> (referer: http://www.dmoz.org/search?q=france&all=no&t=regional&cat=all)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/ENV/bin/tutorial/dirbot/spiders/dmoz.py", line 25, in parse
    yield scrapy.Request(next_page, callback=self.parse)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 25, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 51, in _set_url
    raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType:
2016-10-18 11:17:03 [scrapy] INFO: Closing spider (finished)
2016-10-18 11:17:03 [scrapy] INFO: Stored json feed (20 items) in: test.json
2016-10-18 11:17:03 [scrapy] INFO: Dumping Scrapy stats:

【问题讨论】:

  • yield scrapy.Request(next_page, callback=self.parse) 需要在 if 中,您检查 if next_page is not None 但仍然提出请求
  • 您好,我在if中添加了它,我没有错误但它只导出第一页并且似乎没有执行回调2016-10-18 11:45:44 [scrapy] DEBUG: Crawled (200) &lt;GET http://www.dmoz.org/search?q=france&amp;start=20&amp;type=next&amp;all=no&amp;t=regional&amp;cat=all&gt; (referer: http://www.dmoz.org/search?q=france&amp;all=no&amp;t=regional&amp;cat=all) 2016-10-18 11:45:44 [scrapy] INFO: Closing spider (finished)
  • 那么next_page 可能永远找不到任何东西,您是否验证过它曾经找到任何东西?
  • 在之前的调试中我发布了正确的 url,然后我假设它找到了它 (&start=20)

标签: python web-scraping scrapy


【解决方案1】:

错误在您的代码中:

if next_page is not None:
    next_page = response.urljoin(next_page)          

yield scrapy.Request(next_page, callback=self.parse)

正如 Padraic Cunningham 在他的提交中提到的那样:yield Request 不管 next_pageNone 还是用 URL 填充。

您可以通过将代码更改为以下方式来解决您的问题:

if next_page is not None:
    next_page = response.urljoin(next_page)          
    yield scrapy.Request(next_page, callback=self.parse)

您将yield 放在if 块中的位置。

顺便说一句,您可以将if 更改为以下内容:

if next_page:

因为 Python 的真理。

由于您的蜘蛛停止工作,请尝试通过 scrapy shell 调试您的应用程序,您可以在其中查看您的 CSS 查询是否返回值。您还可以将 else 添加到之前的 if 块中,该块会在控制台中记录/打印一条语句,表明未找到 next_page,以便您知道网站或您的 CSS 查询有问题。

【讨论】:

    猜你喜欢
    • 2016-10-02
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-03-09
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多