Scrapy spider 几分钟后返回 200 响应答案

【问题标题】：Scrapy spider returns 200 response after a few mintuesScrapy spider 几分钟后返回 200 响应
【发布时间】：2017-02-23 18:41:19
【问题描述】：

我在尝试废弃网站时遇到动态内容问题。我刚刚使用 Docker 将 Splash 添加到我的 Scrapy 中：

https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/

很遗憾，由于动态内容（可能？），我仍然没有捕获内容。

我的代码运行，捕获内容，然后在抓取大约 4000 个页面后，它只返回接下来的 6000 个页面的错误，其中大部分都有数据：

[scrapy.core.engine] DEBUG: Crawled (200) <GET http://www...> (referer: None)

这是我的蜘蛛代码：

import scrapy
from scrapy_splash import SplashRequest

class PeopleSpider(scrapy.Spider):
 name="people"
 start_urls=[
  'http://www.canada411.ca/res/%s/' % page for page in xrange(5192080000,5192090000)   
 ]
 def start_requests(self):
  for url in self.start_urls:
    yield SplashRequest(url, self.parse,
     endpoint='render.html',
     args={'wait': 2},
    )
 def parse(self,response):
  for people in response.css('div#contact'):
   yield{
    'name': people.css('h1.vcard__name::text').extract_first().strip().title(),
    'address': people.css('div.vcard__address::text').extract_first().strip().split(',')[0].strip(),
    'city': people.css('div.vcard__address::text').extract_first().strip().split(',')[1].strip().split(' ')[0].strip(),
    'province': people.css('div.vcard__address::text').extract_first().strip().split(',')[1].strip().split(' ')[1].strip(),
    'postal code': people.css('div.vcard__address::text').extract_first().split(',')[2].strip().replace(' ',''),
    'phone': people.css('span.vcard__label::text').extract_first().replace('(','').replace(')','').replace('-','').replace(' ',''),
   }

【问题讨论】：

可能您正在抓取的网站已开始显示验证码
有意思，有什么解决办法吗？
我无法发布代码/解决方案，我可以建议您在未获取数据时将响应的 HTML 保存在文件中，然后在浏览器中打开该 HTML 文件以查看名称、地址等原因该页面上不存在
我做了：如果不是 response.meta.get('solve_captcha',False): print "CAPTCHA"，你是对的，这是一个 CAPTCHA 问题

标签： scrapy scrapy-spider splash-screen

【解决方案1】：

当您没有获取数据时，将响应的 HTML 保存在一个文件中，然后在浏览器中打开该 HTML 文件以查看为什么该页面上不存在 name、address 等。

我怀疑由于来自同一 IP 的持续请求，他们正在显示验证码。

如果他们显示验证码，您可以使用代理服务来避免验证码，

还创建一个DownloadMiddleware 和process_request 函数内部，检查是否有验证码，然后使用dont_filter=True 参数再次抓取该链接。

编辑

您可以使用此代码写入文件，顺便说一句，只需 google，您会发现很多使用 Python 写入文件的方法。

with open('response.html', '2+') as the_file:
     the_file.write(response.body)

【讨论】：