【发布时间】:2019-05-03 23:01:21
【问题描述】:
我有一个可以成功登录ancestry.com 的scrapy sider。然后,我使用该经过身份验证的会话返回一个新链接,并可以成功抓取新链接的第一页。当我尝试转到第二页时,就会出现问题。我收到一条 302 重定向调试消息,并且这个 url:https://secure.ancestry.com/error/reqvalidation.aspx?aspxerrorpath=http%3a%2f%2fsearch.ancestry.com%2ferror%2fPageNotFound&msg=&ti=0>.
我遵循了文档并在此处遵循了一些建议,以使我走到这一步。每个页面都需要一个会话令牌吗?如果是这样,我该怎么做?
import scrapy
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import FormRequest
from scrapy.loader import ItemLoader
from ..items import AncItem
class AncestrySpider(CrawlSpider):
name = 'ancestry'
def start_requests(self):
return[
FormRequest(
'https://www.ancestry.com/account/signin?returnUrl=https%3A%2F%2Fwww.ancestry.com',
formdata={"username": "foo", "password": "bar"},
callback=self.after_login
)
]
def after_login(self, response):
if "authentication failed".encode() in response.body:
self.log("Login failed", level=log.ERROR)
return
else:
return Request(url='https://www.ancestry.com/search/collections/nypl/?name=_Wang&count=50&name_x=_1',
callback=self.parse)
def parse(self, response):
all_products = response.xpath("//tr[@class='tblrow record']")
for product in all_products:
loader = ItemLoader(item=AncItem(), selector=product, response=response)
loader.add_css('Name', '.srchHit')
loader.add_css('Arrival_Date', 'td:nth-child(3)')
loader.add_css('Birth_Year', 'td:nth-child(4)')
loader.add_css('Port_of_Departure', 'td:nth-child(5)')
loader.add_css('Ethnicity_Nationality', 'td:nth-child(6)')
loader.add_css('Ship_Name', 'td:nth-child(7)')
yield loader.load_item()
next_page = response.xpath('//a[@class="ancBtn sml green icon iconArrowRight"]').extract_first()
if next_page is not None:
next_page_link = response.urljoin(next_page)
yield scrapy.Request( url=next_page_link, callback=self.parse)
我厌倦了添加一些请求标头信息。我尝试将 cookie 信息添加到请求标头中,但这不起作用。我试过只使用 POST 包中列出的 USER 代理。
现在我只得到 50 个结果。爬完所有页面后我应该会得到数百个。
【问题讨论】:
-
抓取结果:
-
调试:从
ancestry.com/search/collections/nypl/… 3E> 重定向 (302) 到 search.ancestry.com/search/collections/nypl/… 50&fsk=MDs0OTs1MA-61--61-%22%3E%3C/a%3E>
标签: python-3.x authentication scrapy session-cookies