使用scrapy和FormRequest抓取所有页面答案

【问题标题】：Crawling all page with scrapy and FormRequest使用scrapy和FormRequest抓取所有页面
【发布时间】：2021-02-05 22:25:43
【问题描述】：

我想刮掉这个网站的所有链接：https://www.formatic-centre.fr/formation/

显然，下一页是使用 AJAX 动态加载的。我需要使用来自 scrapy 的 FormRequest 来模拟这些请求。

那是我做的，我用开发工具查找参数：ajax1

我将这些参数放入 FormRequest 但显然如果它不起作用，我需要包含标题，我所做的就是：ajax2

但它也不起作用..我猜我做错了什么但是什么？

这是我的脚本，如果你愿意的话（对不起，它很长，因为我把所有的参数和标题都放好了）：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html
from scrapy.http import FormRequest

class LinkSpider(scrapy.Spider):
    name = "link"
    #allow_domains = ['https://www.formatic-centre.fr/']
    start_urls = ['https://www.formatic-centre.fr/formation/']

    rules = (Rule(LinkExtractor(allow=r'formation'), callback="parse", follow= True),)

    def parse(self, response):
        card = response.xpath('//a[@class="title"]')
        for a in card:
            yield {'links': a.xpath('@href').get()}

        return [FormRequest(url="https://www.formatic-centre.fr/formation/",
            formdata={'action' :    "swlabscore",
                        'module[0]' : "top.Top_Controller",
                        'module[1]' : "ajax_get_course_pagination",
                        'page' :    "2",
                        'layout' :  "course",
                        'limit_post' :  "",
                        'offset_post' : "0",
                        'sort_by' : "",
                        'pagination' :  "yes",
                        'location_slug' :   "",
                        'columns' : "2",
                        'paged' :   "",
                        'cur_limit' :   "",
                        'rows': "0",
                        'btn_content' : "En+savoir+plus",
                        'uniq_id' : "block-13759488265f916bca45c89",
                        'ZmfUNQ': "63y[Jt",
                        'PmhpIuZ_cTnUxqg' : "7v@IahmJNMplbCu",
                        'cZWVDbSPzTXRe' : "n9oa2k5u4GHWm",
                        'eOBITfdGRuriQ' :   "hBPN5nObe.ktH",
                        "Accept" : "*/*",
                        "Accept-Encoding" : "gzip, deflate, br",
                        "Accept-Language" : "fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3",
                        "Connection" : "keep-alive",
                        "Content-Length" : "1010",
                        "Content-Type" : "application/x-www-form-urlencoded; charset=UTF-8",
                        "Cookie" : "_ga=GA1.2.815964309.1603392091; _gid=GA1.2.1686929506.1603392091; jlFYkafUWiyJe=LGAWcXg_wUjFo; z-byDgTnkdcQJSNH=03d1yiqH%40h8uZNtw; YeAhrFumyo-HQwpn=5uOhD6viWy%5BYeq3o",
                        "Host" : "www.formatic-centre.fr",
                        "Origin" : "https://www.formatic-centre.fr",
                        "Referer" : "https://www.formatic-centre.fr/formation/",
                        "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0",
                        "X-Requested-With" : "XMLHttpRequest",
                        "access-control-allow-credentials" : "true",
                        "access-control-allow-origin" : "https://www.formatic-centre.fr",
                        "cache-control" : "no-cache, must-revalidate, max-age=0",
                        "content-encoding": "gzip",
                        "content-length" :"2497",
                        "content-type" :"text/html; charset=UTF-8",
                        "date" :"Thu, 22 Oct 2020 18:42:54 GMT",
                        "expires" :"Wed, 11 Jan 1984 05:00:00 GMT",
                        "referrer-policy": "strict-origin-when-cross-origin",
                        "server": "Apache",
                        "set-cookie" : "jlFYkafUWiyJe=LGAWcXg_wUjFo; expires=Fri, 23-Oct-2020 18:42:54 GMT; Max-Age=86400; path=/; secure",
                        "set-cookie" : "z-byDgTnkdcQJSNH=03d1yiqH%40h8uZNtw; expires=Fri, 23-Oct-2020 18:42:54 GMT; Max-Age=86400; path=/; secure",
                        "set-cookie" : "YeAhrFumyo-HQwpn=5uOhD6viWy%5BYeq3o; expires=Fri, 23-Oct-2020 18:42:54 GMT; Max-Age=86400; path=/; secure",
                        "strict-transport-security" : "max-age=15552001; preload",
                        "vary" : "Accept-Encoding",
                        "x-content-type-options" : "nosniff",
                        "X-Firefox-Spdy" : "h2",
                        "x-frame-options" : "SAMEORIGIN",
                        "x-robots-tag" : "noindex"})]

第一页的脚本工作，我获得了链接，但是当他需要使用FormRequest时，什么也没发生，我无法获得下一页的链接。

有什么想法吗？

编辑：我没看到，但终端告诉我这个错误：

2020-10-23 03:51:30 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://www.formatic-centre.fr/formation/> (referer: https://www.formatic-centre.fr/formation/) ['partial']
2020-10-23 03:51:30 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://www.formatic-centre.fr/formation/>: HTTP status code is not handled or not allowed

也许有帮助？

【问题讨论】：

标签： python-3.x web-scraping scrapy

【解决方案1】：

您对如何格式化和发送headers 和payload 本身存在一些问题。

此外，您必须不断更改页面，以便服务器知道您在哪里以及要发回什么响应。

我不想建立一个新的 scrapy 项目，但这是我获得所有链接的方式，所以希望这会推动你朝着正确的方向前进：

如果它感觉像一个黑客，好吧，因为它是一个。

from urllib.parse import urlencode
import requests
from bs4 import BeautifulSoup


headers = {
    "accept": "*/*",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en-GB,en-US;q=0.9,en;q=0.8",
    "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
    "origin": "https://www.formatic-centre.fr",
    "referer": "https://www.formatic-centre.fr/formation/",
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.99 Safari/537.36",
    "x-requested-with": "XMLHttpRequest",
}

raw_string = "action=swlabscore&module%5B%5D=top.Top_Controller&module%5B%5D=ajax_get_course_pagination&params%5B0%5D%5Bpage%5D=2&params%5B0%5D%5Batts%5D%5Blayout%5D=course&params%5B0%5D%5Batts%5D%5Blimit_post%5D=&params%5B0%5D%5Batts%5D%5Boffset_post%5D=0&params%5B0%5D%5Batts%5D%5Bsort_by%5D=&params%5B0%5D%5Batts%5D%5Bpagination%5D=yes&params%5B0%5D%5Batts%5D%5Blocation_slug%5D=&params%5B0%5D%5Batts%5D%5Bcolumns%5D=2&params%5B0%5D%5Batts%5D%5Bpaged%5D=&params%5B0%5D%5Batts%5D%5Bcur_limit%5D=&params%5B0%5D%5Batts%5D%5Brows%5D=0&params%5B0%5D%5Batts%5D%5Bbtn_content%5D=En+savoir+plus&params%5B0%5D%5Batts%5D%5Buniq_id%5D=block-13759488265f916bca45c89&params%5B0%5D%5Batts%5D%5Bthumb-size%5D%5Blarge%5D=swedugate-thumb-300x225&params%5B0%5D%5Batts%5D%5Bthumb-size%5D%5Bno-image%5D=thumb-300x225.gif&params%5B0%5D%5Batts%5D%5Bthumb-size%5D%5Bsmall%5D=swedugate-thumb-300x225&params%5B0%5D%5Blayout_course%5D=style-grid&ZmfUNQ=63y[Jt&PmhpIuZ_cTnUxqg=7v@IahmJNMplbCu&cZWVDbSPzTXRe=n9oa2k5u4GHWm&eOBITfdGRuriQ=hBPN5nObe.ktH"

payloadd = [
    ('action', 'swlabscore'),
    ('module[]', 'top.Top_Controller'),
    ('module[]', 'ajax_get_course_pagination'),
    ('params[0][page]', '1'),
    ('params[0][atts][layout]', 'course'),
    ('params[0][atts][offset_post]', '0'),
    ('params[0][atts][pagination]', 'yes'),
    ('params[0][atts][columns]', '2'),
    ('params[0][atts][rows]', '0'),
    ('params[0][atts][btn_content]', 'En savoir plus'),
    ('params[0][atts][uniq_id]', 'block-13759488265f916bca45c89'),
    ('params[0][atts][thumb-size][large]', 'swedugate-thumb-300x225'),
    ('params[0][atts][thumb-size][no-image]', 'thumb-300x225.gif'),
    ('params[0][atts][thumb-size][small]', 'swedugate-thumb-300x225'),
    ('params[0][layout_course]', 'style-grid'),
    ('ZmfUNQ', '63y[Jt'),
    ('PmhpIuZ_cTnUxqg', '7v@IahmJNMplbCu'),
    ('cZWVDbSPzTXRe', 'n9oa2k5u4GHWm'),
    ('eOBITfdGRuriQ', 'hBPN5nObe.ktH'),
]

all_links = []
for page in range(1, 10):
    payloadd.pop(3)
    payloadd.insert(3, ('params[0][page]', str(page)))
    response = requests.post(
        "https://www.formatic-centre.fr/wp-admin/admin-ajax.php?",
        headers=headers,
        data=urlencode(payloadd)
    )
    print(f"Getting links from page {page}...")
    soup = BeautifulSoup(response.text, "html.parser").find_all("a", class_="btn btn-green")
    links = [i["href"] for i in soup]
    print('\n'.join(links))
    all_links.extend(links)


with open("formatic-center_links.txt", "w") as f:
    f.writelines("\n".join(all_links) + "\n")

这会生成一个文件，其中包含EN SAVOIR PLUS 按钮下的所有链接。

https://www.formatic-centre.fr/formation/les-regles-juridiques-du-teletravail/
https://www.formatic-centre.fr/formation/mieux-gerer-son-stress-en-periode-du-covid-19/
https://www.formatic-centre.fr/formation/dynamiser-vos-equipes-special-post-confinement/
https://www.formatic-centre.fr/formation/conduire-ses-entretiens-specifique-post-confinement/
https://www.formatic-centre.fr/formation/cours-excel/
https://www.formatic-centre.fr/formation/autocad-3d-2/
https://www.formatic-centre.fr/formation/concevoir-et-developper-une-strategie-marketing/
https://www.formatic-centre.fr/formation/preparer-soutenance/
https://www.formatic-centre.fr/formation/mettre-en-place-une-campagne-adwords/
https://www.formatic-centre.fr/formation/utiliser-google-analytics/
and so on ...

【讨论】：

哇，多么好的答案，我会检查一下，我会与您保持联系。 :) 非常感谢
这工作就像一个魅力，干得好！只是出于好奇，scrapy 不可行吗？你认为这个脚本可以被其他网站重复使用吗？如果我们假设网站是使用 AJAX 动态加载的，我们当然会更改标题等等。
说实话，我对scrapy 不是很熟悉，但我相信它也可以做到。如果您觉得答案有用，请接受并/或投票。
赞成并接受 :) 你认为这个脚本可以重复用于其他网站吗？如果我们假设网站是使用 AJAX 动态加载的，我们当然会更改标题等等。
哦，我不确定。最好逐个使用 AJAX。