【问题标题】:Scrapy ignore request for a specific domainScrapy忽略特定域的请求
【发布时间】:2017-06-17 00:38:08
【问题描述】:

我尝试抓取 craiglist.org (https://forums.craigslist.org/) 的论坛类别。 我的蜘蛛:

class CraigslistSpider(scrapy.Spider):
    name = "craigslist"
    allowed_domains = ["forums.craigslist.org"]
    start_urls = ['http://geo.craigslist.org/iso/us/']

    def error_handler(self, failure):
        print failure

    def parse(self, response):
        yield Request('https://forums.craigslist.org/',
                  self.getForumPage,
                  dont_filter=True,
                  errback=self.error_handler)

    def getForumPage(self, response):
        print "forum page"

我通过错误回调收到此消息:

[失败实例:回溯:: /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:455:callback /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:563:_startRunCallbacks /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:649:_runCallbacks /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:1316:gotResult --- --- /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:1258:_inlineCallbacks /usr/local/lib/python2.7/site-packages/twisted/python/failure.py:389:throwExceptionIntoGenerator /usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py:37:process_request /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:649:_runCallbacks /usr/local/lib/python2.7/site-packages/scrapy/downloadermiddlewares/robotstxt.py:46:process_request_2 ]

但我只有在 Craigslist 的论坛部分才有这个问题。这可能是因为论坛部分的 https 与网站的其他部分相反。 所以,不可能得到回应......

一个想法?

【问题讨论】:

    标签: python python-2.7 request scrapy


    【解决方案1】:

    我发布了一个解决问题的方法。

    我使用了 urllib2 库。看:

    import urllib2
    from scrapy.http import HtmlResponse
    
    class CraigslistSpider(scrapy.Spider):
    name = "craigslist"
    allowed_domains = ["forums.craigslist.org"]
    start_urls = ['http://geo.craigslist.org/iso/us/']
    
    def error_handler(self, failure):
        print failure
    
    def parse(self, response):
        # Get a valid request with urllib2
        req = urllib2.Request('https://forums.craigslist.org/')
        # Get the content of this request
        pageContent = urllib2.urlopen(req).read()
        # Parse the content in a HtmlResponse compatible with Scrapy
        response = HtmlResponse(url=response.url, body=pageContent)
        print response.css(".forumlistcolumns li").extract()
    

    使用此解决方案,您可以在有效的 Scrapy 请求中解析一个好的请求并正常使用它。 可能有更好的方法,但这个方法很实用。

    【讨论】:

      【解决方案2】:

      我认为您正在处理robots.txt。尝试用

      运行你的蜘蛛
      custom_settings = {
          "ROBOTSTXT_OBEY": False
      }
      

      您也可以使用命令行设置对其进行测试:scrapy crawl craigslist -s ROBOTSTXT_OBEY=False

      【讨论】:

        猜你喜欢
        • 2017-10-31
        • 1970-01-01
        • 2019-07-22
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-07-12
        • 1970-01-01
        相关资源
        最近更新 更多