【问题标题】:error 403 in scrapy while crawling抓取时scrapy中的错误403
【发布时间】:2018-06-10 18:39:46
【问题描述】:

这是我为抓取“blablacar”网站而编写的代码。

# -*- coding: utf-8 -*-
import scrapy


class BlablaSpider(scrapy.Spider):
    name = 'blabla'

    allowed_domains = ['blablacar.in']
    start_urls = ['http://www.blablacar.in/ride-sharing/new-delhi/chandigarh']

    def parse(self, response):
        print(response.text)

在运行上面的时候,我得到了错误

2018-06-11 00:07:05 [scrapy.extensions.telnet] 调试:Telnet 控制台 收听 127.0.0.1:6023 2018-06-11 00:07:06 [scrapy.core.engine] 调试:爬网(403)http://www.blablacar.in/robots.txt> (参考:无)2018-06-11 00:07:06 [scrapy.core.engine] 调试: 爬网(403)http://www.blablacar.in/ride-sharing/new-delhi/chandigarh>(推荐人: 无)2018-06-11 00:07:06 [scrapy.spidermiddlewares.httperror] 信息: 忽略响应 http://www.blablacar.in/ride-sharing/new-delhi/chandigarh>:HTTP 状态码未处理或不允许 2018-06-11 00:07:06 [scrapy.core.engine] INFO:关闭蜘蛛(已完成)

【问题讨论】:

    标签: python-3.x web-scraping scrapy web-crawler data-extraction


    【解决方案1】:

    您需要配置用户代理。我使用已配置的用户代理在我的站点中运行您的代码,并得到状态代码 200。

    1 在 settings.py 附近放置一个名为 utils.py 的新文件

    import random
    
    user_agent_list = [
        # Chrome
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
        'Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
        # Firefox
        'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
        'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
        'Mozilla/5.0 (Windows NT 6.2; WOW64; Trident/7.0; rv:11.0) like Gecko',
        'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)',
        'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
        'Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko',
        'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
        'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)',
        'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0'
    ]
    
    
    def get_random_agent():
        return random.choice(user_agent_list)
    

    2 添加到您的 settings.py 文件中:

    from <SCRAPY_PROJECT>.utils import get_random_agent
    
    USER_AGENT = get_random_agent()
    

    【讨论】:

      【解决方案2】:

      According to Scrapy documentation,可以使用handle_httpstatus_list蜘蛛属性。

      在你的情况下:

      class BlablaSpider(scrapy.Spider):
          name = 'blabla'
      
          allowed_domains = ['blablacar.in']
          start_urls = ['http://www.blablacar.in/ride-sharing/new-delhi/chandigarh']
          handle_httpstatus_list = [403]
      

      【讨论】:

        【解决方案3】:

        通常在 html 中,403 错误表示您无权访问该页面。

        换个网站重试,如果没有出现同样的错误,可能是网站响应造成的

        【讨论】:

        • 那么如何从一个在抓取过程中显示 403 错误的网站抓取
        猜你喜欢
        • 1970-01-01
        • 2023-03-19
        • 2018-06-13
        • 2020-10-03
        • 1970-01-01
        • 1970-01-01
        • 2018-10-12
        • 2018-06-18
        • 1970-01-01
        相关资源
        最近更新 更多