【发布时间】:2021-02-05 22:25:43
【问题描述】:
我想刮掉这个网站的所有链接:https://www.formatic-centre.fr/formation/
显然,下一页是使用 AJAX 动态加载的。我需要使用来自 scrapy 的 FormRequest 来模拟这些请求。
那是我做的,我用开发工具查找参数:ajax1
我将这些参数放入 FormRequest 但显然如果它不起作用,我需要包含标题,我所做的就是:ajax2
但它也不起作用..我猜我做错了什么但是什么?
这是我的脚本,如果你愿意的话(对不起,它很长,因为我把所有的参数和标题都放好了):
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html
from scrapy.http import FormRequest
class LinkSpider(scrapy.Spider):
name = "link"
#allow_domains = ['https://www.formatic-centre.fr/']
start_urls = ['https://www.formatic-centre.fr/formation/']
rules = (Rule(LinkExtractor(allow=r'formation'), callback="parse", follow= True),)
def parse(self, response):
card = response.xpath('//a[@class="title"]')
for a in card:
yield {'links': a.xpath('@href').get()}
return [FormRequest(url="https://www.formatic-centre.fr/formation/",
formdata={'action' : "swlabscore",
'module[0]' : "top.Top_Controller",
'module[1]' : "ajax_get_course_pagination",
'page' : "2",
'layout' : "course",
'limit_post' : "",
'offset_post' : "0",
'sort_by' : "",
'pagination' : "yes",
'location_slug' : "",
'columns' : "2",
'paged' : "",
'cur_limit' : "",
'rows': "0",
'btn_content' : "En+savoir+plus",
'uniq_id' : "block-13759488265f916bca45c89",
'ZmfUNQ': "63y[Jt",
'PmhpIuZ_cTnUxqg' : "7v@IahmJNMplbCu",
'cZWVDbSPzTXRe' : "n9oa2k5u4GHWm",
'eOBITfdGRuriQ' : "hBPN5nObe.ktH",
"Accept" : "*/*",
"Accept-Encoding" : "gzip, deflate, br",
"Accept-Language" : "fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3",
"Connection" : "keep-alive",
"Content-Length" : "1010",
"Content-Type" : "application/x-www-form-urlencoded; charset=UTF-8",
"Cookie" : "_ga=GA1.2.815964309.1603392091; _gid=GA1.2.1686929506.1603392091; jlFYkafUWiyJe=LGAWcXg_wUjFo; z-byDgTnkdcQJSNH=03d1yiqH%40h8uZNtw; YeAhrFumyo-HQwpn=5uOhD6viWy%5BYeq3o",
"Host" : "www.formatic-centre.fr",
"Origin" : "https://www.formatic-centre.fr",
"Referer" : "https://www.formatic-centre.fr/formation/",
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0",
"X-Requested-With" : "XMLHttpRequest",
"access-control-allow-credentials" : "true",
"access-control-allow-origin" : "https://www.formatic-centre.fr",
"cache-control" : "no-cache, must-revalidate, max-age=0",
"content-encoding": "gzip",
"content-length" :"2497",
"content-type" :"text/html; charset=UTF-8",
"date" :"Thu, 22 Oct 2020 18:42:54 GMT",
"expires" :"Wed, 11 Jan 1984 05:00:00 GMT",
"referrer-policy": "strict-origin-when-cross-origin",
"server": "Apache",
"set-cookie" : "jlFYkafUWiyJe=LGAWcXg_wUjFo; expires=Fri, 23-Oct-2020 18:42:54 GMT; Max-Age=86400; path=/; secure",
"set-cookie" : "z-byDgTnkdcQJSNH=03d1yiqH%40h8uZNtw; expires=Fri, 23-Oct-2020 18:42:54 GMT; Max-Age=86400; path=/; secure",
"set-cookie" : "YeAhrFumyo-HQwpn=5uOhD6viWy%5BYeq3o; expires=Fri, 23-Oct-2020 18:42:54 GMT; Max-Age=86400; path=/; secure",
"strict-transport-security" : "max-age=15552001; preload",
"vary" : "Accept-Encoding",
"x-content-type-options" : "nosniff",
"X-Firefox-Spdy" : "h2",
"x-frame-options" : "SAMEORIGIN",
"x-robots-tag" : "noindex"})]
第一页的脚本工作,我获得了链接,但是当他需要使用FormRequest时,什么也没发生,我无法获得下一页的链接。
有什么想法吗?
编辑:我没看到,但终端告诉我这个错误:
2020-10-23 03:51:30 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://www.formatic-centre.fr/formation/> (referer: https://www.formatic-centre.fr/formation/) ['partial']
2020-10-23 03:51:30 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://www.formatic-centre.fr/formation/>: HTTP status code is not handled or not allowed
也许有帮助?
【问题讨论】:
标签: python-3.x web-scraping scrapy