Scrapy 和验证码答案

【问题标题】：Scrapy & captchaScrapy 和验证码
【发布时间】：2015-01-14 16:54:59
【问题描述】：

我在站点https://www.barefootstudent.com/jobs 中使用scrapy 提交表单（任何指向页面的链接等http://www.barefootstudent.com/los_angeles/jobs/full_time/full_time_nanny_needed_in_venice_217021）

我的 scapy 机器人成功登录，但我无法避免验证码。对于表单提交，我使用 scrapy.FormRequest.from_reponse

frq = scrapy.FormRequest.from_response(response, formdata={'message': 'itttttttt', 
                                   'security': captcha, 'name': 'fx',
                                   'category_id': '2', 'email': 'ololo%40gmail.com', 'item_id': '216640_2', 'location': '18', 'send_message': 'Send%20Message'
                                   }, callback=self.afterForm)

    yield frq

我想从此页面加载验证码图像，并手动输入脚本运行时。等等

captcha = raw_input("put captcha in manually>")

我试试

 urllib.urlretrieve(captcha, "./captcha.jpg")

但是这个方法加载了不正确的验证码（网站拒绝了我的输入）。我尝试在一个运行脚本中反复调用 urllib.urlretieve 并且每次他返回不同的验证码:(

之后我尝试使用 ImagePipeline。但我的问题是 return item（下载图像）仅在函数执行完成后才会发生，即使我使用 yeild。

 item = BfsItem()
 item['image_urls'] = [captcha]
 yield item
 captcha = raw_input("put captcha in manually>")  
 frq = scrapy.FormRequest.from_response(response, formdata={'message': 'itttttttt', 
                                   'security': captcha, 'name': 'fx',
                                   'category_id': '2', 'email': 'ololo%40gmail.com', 'item_id': '216640_2', 'location': '18', 'send_message': 'Send%20Message'
                                   }, callback=self.afterForm)
 yield frq

那一刻，当我的脚本请求输入时，图片没有下载！

我如何修改我的脚本并在手动输入验证码后调用 FormRequest？

非常感谢！

【问题讨论】：

标签： python scrapy captcha

【解决方案1】：

我正在使用并且通常效果很好的方法看起来像这样（只是一个要点，您需要添加您的具体细节）：

第 1 步 - 获取验证码网址（并保留表单的响应以供以后使用）

def parse_page_with_captcha(response):
    captcha_url = response.xpath(...)
    data_for_later = {'captcha_form': response} # store the response for later use
    return Request(captcha_url, callback=self.parse_captcha_download, meta=data_for_later)

第 2 步 - 现在 scrapy 将下载图像，我们必须在 scrapy 回调中正确处理它

def parse_captcha_download(response):
    captcha_target_filename = 'filename.png'
    # save the image for processing
    i = Image.open(StringIO(response.body))
    i.save(captcha_target_filename)

    # process the captcha (OCR, or sending it to a decaptcha service, etc ...)
    captcha_text = solve_captcha(captcha_target_filename)

    # and now we have all the data we need for building the form request
    captcha_form = response.meta['captcha_form']

    return scrapy.FormRequest.from_response(captcha_form, formdata={'message': 'itttttttt', 
                               'security': captcha_text, 'name': 'fx',
                               'category_id': '2', 'email': 'ololo%40gmail.com', 'item_id': '216640_2', 'location': '18', 'send_message': 'Send%20Message'
                               }, callback=self.afterForm)

重要细节

受验证码保护的表单需要某种方式将验证码图像与查看并回答此验证码的特定用户/客户联系起来。这通常使用基于 cookie 的会话或隐藏在验证码表单中的特殊参数/图像标记来完成。

爬虫代码必须小心不要破坏这个链接，否则它会回答一些验证码而不是它必须回答的验证码。

为什么 Verz1Lka 发布的两个示例无法正常工作？

urllib.urlretrieve 方法完全在scrapy 之外工作。虽然这通常是一个坏主意（这不会利用scrapys调度等的好处），但这里的主要问题是：这个请求将完全在目标站点用来跟踪哪个会话cookie、url参数等之外工作验证码已发送到特定浏览器。

另一方面，使用图像管道的方法在 Scrapy 的规则中运行良好，但这些图像下载计划在以后完成，因此验证码下载在需要时将不可用。

【讨论】：

嘿@Will 修复你的刮刀，你能帮我解决这个问题吗stackoverflow.com/questions/57236421/…

【解决方案2】：

您正在下载不同的验证码图像，因为您没有使用输入表单 URL 时收到的相同 cookie。 Scrapy 自己管理 cookie，所以最好也使用 scrapy 下载图像。 https://doc.scrapy.org/en/latest/topics/media-pipeline.html

【讨论】：

关于 cookie 你是对的（见我的回答），但不是关于使用媒体管道。这不起作用，因为刮板无法在需要时访问下载的验证码图像。