【问题标题】:Crawling all page with scrapy and FormRequest使用scrapy和FormRequest抓取所有页面
【发布时间】:2021-02-05 22:25:43
【问题描述】:

我想刮掉这个网站的所有链接:https://www.formatic-centre.fr/formation/

显然,下一页是使用 AJAX 动态加载的。我需要使用来自 scrapy 的 FormRequest 来模拟这些请求。

那是我做的,我用开发工具查找参数:ajax1

我将这些参数放入 FormRequest 但显然如果它不起作用,我需要包含标题,我所做的就是:ajax2

但它也不起作用..我猜我做错了什么但是什么?

这是我的脚本,如果你愿意的话(对不起,它很长,因为我把所有的参数和标题都放好了):

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html
from scrapy.http import FormRequest

class LinkSpider(scrapy.Spider):
    name = "link"
    #allow_domains = ['https://www.formatic-centre.fr/']
    start_urls = ['https://www.formatic-centre.fr/formation/']

    rules = (Rule(LinkExtractor(allow=r'formation'), callback="parse", follow= True),)

    def parse(self, response):
        card = response.xpath('//a[@class="title"]')
        for a in card:
            yield {'links': a.xpath('@href').get()}

        return [FormRequest(url="https://www.formatic-centre.fr/formation/",
            formdata={'action' :    "swlabscore",
                        'module[0]' : "top.Top_Controller",
                        'module[1]' : "ajax_get_course_pagination",
                        'page' :    "2",
                        'layout' :  "course",
                        'limit_post' :  "",
                        'offset_post' : "0",
                        'sort_by' : "",
                        'pagination' :  "yes",
                        'location_slug' :   "",
                        'columns' : "2",
                        'paged' :   "",
                        'cur_limit' :   "",
                        'rows': "0",
                        'btn_content' : "En+savoir+plus",
                        'uniq_id' : "block-13759488265f916bca45c89",
                        'ZmfUNQ': "63y[Jt",
                        'PmhpIuZ_cTnUxqg' : "7v@IahmJNMplbCu",
                        'cZWVDbSPzTXRe' : "n9oa2k5u4GHWm",
                        'eOBITfdGRuriQ' :   "hBPN5nObe.ktH",
                        "Accept" : "*/*",
                        "Accept-Encoding" : "gzip, deflate, br",
                        "Accept-Language" : "fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3",
                        "Connection" : "keep-alive",
                        "Content-Length" : "1010",
                        "Content-Type" : "application/x-www-form-urlencoded; charset=UTF-8",
                        "Cookie" : "_ga=GA1.2.815964309.1603392091; _gid=GA1.2.1686929506.1603392091; jlFYkafUWiyJe=LGAWcXg_wUjFo; z-byDgTnkdcQJSNH=03d1yiqH%40h8uZNtw; YeAhrFumyo-HQwpn=5uOhD6viWy%5BYeq3o",
                        "Host" : "www.formatic-centre.fr",
                        "Origin" : "https://www.formatic-centre.fr",
                        "Referer" : "https://www.formatic-centre.fr/formation/",
                        "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0",
                        "X-Requested-With" : "XMLHttpRequest",
                        "access-control-allow-credentials" : "true",
                        "access-control-allow-origin" : "https://www.formatic-centre.fr",
                        "cache-control" : "no-cache, must-revalidate, max-age=0",
                        "content-encoding": "gzip",
                        "content-length" :"2497",
                        "content-type" :"text/html; charset=UTF-8",
                        "date" :"Thu, 22 Oct 2020 18:42:54 GMT",
                        "expires" :"Wed, 11 Jan 1984 05:00:00 GMT",
                        "referrer-policy": "strict-origin-when-cross-origin",
                        "server": "Apache",
                        "set-cookie" : "jlFYkafUWiyJe=LGAWcXg_wUjFo; expires=Fri, 23-Oct-2020 18:42:54 GMT; Max-Age=86400; path=/; secure",
                        "set-cookie" : "z-byDgTnkdcQJSNH=03d1yiqH%40h8uZNtw; expires=Fri, 23-Oct-2020 18:42:54 GMT; Max-Age=86400; path=/; secure",
                        "set-cookie" : "YeAhrFumyo-HQwpn=5uOhD6viWy%5BYeq3o; expires=Fri, 23-Oct-2020 18:42:54 GMT; Max-Age=86400; path=/; secure",
                        "strict-transport-security" : "max-age=15552001; preload",
                        "vary" : "Accept-Encoding",
                        "x-content-type-options" : "nosniff",
                        "X-Firefox-Spdy" : "h2",
                        "x-frame-options" : "SAMEORIGIN",
                        "x-robots-tag" : "noindex"})]

第一页的脚本工作,我获得了链接,但是当他需要使用FormRequest时,什么也没发生,我无法获得下一页的链接。

有什么想法吗?

编辑:我没看到,但终端告诉我这个错误:

2020-10-23 03:51:30 [scrapy.core.engine] DEBUG: Crawled (400) <POST https://www.formatic-centre.fr/formation/> (referer: https://www.formatic-centre.fr/formation/) ['partial']
2020-10-23 03:51:30 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://www.formatic-centre.fr/formation/>: HTTP status code is not handled or not allowed

也许有帮助?

【问题讨论】:

    标签: python-3.x web-scraping scrapy


    【解决方案1】:

    您对如何格式化和发送headerspayload 本身存在一些问题。

    此外,您必须不断更改页面,以便服务器知道您在哪里以及要发回什么响应。

    我不想建立一个新的 scrapy 项目,但这是我获得所有链接的方式,所以希望这会推动你朝着正确的方向前进:

    如果它感觉像一个黑客,好吧,因为它是一个。

    from urllib.parse import urlencode
    import requests
    from bs4 import BeautifulSoup
    
    
    headers = {
        "accept": "*/*",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "en-GB,en-US;q=0.9,en;q=0.8",
        "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
        "origin": "https://www.formatic-centre.fr",
        "referer": "https://www.formatic-centre.fr/formation/",
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.99 Safari/537.36",
        "x-requested-with": "XMLHttpRequest",
    }
    
    raw_string = "action=swlabscore&module%5B%5D=top.Top_Controller&module%5B%5D=ajax_get_course_pagination&params%5B0%5D%5Bpage%5D=2&params%5B0%5D%5Batts%5D%5Blayout%5D=course&params%5B0%5D%5Batts%5D%5Blimit_post%5D=&params%5B0%5D%5Batts%5D%5Boffset_post%5D=0&params%5B0%5D%5Batts%5D%5Bsort_by%5D=&params%5B0%5D%5Batts%5D%5Bpagination%5D=yes&params%5B0%5D%5Batts%5D%5Blocation_slug%5D=&params%5B0%5D%5Batts%5D%5Bcolumns%5D=2&params%5B0%5D%5Batts%5D%5Bpaged%5D=&params%5B0%5D%5Batts%5D%5Bcur_limit%5D=&params%5B0%5D%5Batts%5D%5Brows%5D=0&params%5B0%5D%5Batts%5D%5Bbtn_content%5D=En+savoir+plus&params%5B0%5D%5Batts%5D%5Buniq_id%5D=block-13759488265f916bca45c89&params%5B0%5D%5Batts%5D%5Bthumb-size%5D%5Blarge%5D=swedugate-thumb-300x225&params%5B0%5D%5Batts%5D%5Bthumb-size%5D%5Bno-image%5D=thumb-300x225.gif&params%5B0%5D%5Batts%5D%5Bthumb-size%5D%5Bsmall%5D=swedugate-thumb-300x225&params%5B0%5D%5Blayout_course%5D=style-grid&ZmfUNQ=63y[Jt&PmhpIuZ_cTnUxqg=7v@IahmJNMplbCu&cZWVDbSPzTXRe=n9oa2k5u4GHWm&eOBITfdGRuriQ=hBPN5nObe.ktH"
    
    payloadd = [
        ('action', 'swlabscore'),
        ('module[]', 'top.Top_Controller'),
        ('module[]', 'ajax_get_course_pagination'),
        ('params[0][page]', '1'),
        ('params[0][atts][layout]', 'course'),
        ('params[0][atts][offset_post]', '0'),
        ('params[0][atts][pagination]', 'yes'),
        ('params[0][atts][columns]', '2'),
        ('params[0][atts][rows]', '0'),
        ('params[0][atts][btn_content]', 'En savoir plus'),
        ('params[0][atts][uniq_id]', 'block-13759488265f916bca45c89'),
        ('params[0][atts][thumb-size][large]', 'swedugate-thumb-300x225'),
        ('params[0][atts][thumb-size][no-image]', 'thumb-300x225.gif'),
        ('params[0][atts][thumb-size][small]', 'swedugate-thumb-300x225'),
        ('params[0][layout_course]', 'style-grid'),
        ('ZmfUNQ', '63y[Jt'),
        ('PmhpIuZ_cTnUxqg', '7v@IahmJNMplbCu'),
        ('cZWVDbSPzTXRe', 'n9oa2k5u4GHWm'),
        ('eOBITfdGRuriQ', 'hBPN5nObe.ktH'),
    ]
    
    all_links = []
    for page in range(1, 10):
        payloadd.pop(3)
        payloadd.insert(3, ('params[0][page]', str(page)))
        response = requests.post(
            "https://www.formatic-centre.fr/wp-admin/admin-ajax.php?",
            headers=headers,
            data=urlencode(payloadd)
        )
        print(f"Getting links from page {page}...")
        soup = BeautifulSoup(response.text, "html.parser").find_all("a", class_="btn btn-green")
        links = [i["href"] for i in soup]
        print('\n'.join(links))
        all_links.extend(links)
    
    
    with open("formatic-center_links.txt", "w") as f:
        f.writelines("\n".join(all_links) + "\n")
    
    

    这会生成一个文件,其中包含EN SAVOIR PLUS 按钮下的所有链接。

    https://www.formatic-centre.fr/formation/les-regles-juridiques-du-teletravail/
    https://www.formatic-centre.fr/formation/mieux-gerer-son-stress-en-periode-du-covid-19/
    https://www.formatic-centre.fr/formation/dynamiser-vos-equipes-special-post-confinement/
    https://www.formatic-centre.fr/formation/conduire-ses-entretiens-specifique-post-confinement/
    https://www.formatic-centre.fr/formation/cours-excel/
    https://www.formatic-centre.fr/formation/autocad-3d-2/
    https://www.formatic-centre.fr/formation/concevoir-et-developper-une-strategie-marketing/
    https://www.formatic-centre.fr/formation/preparer-soutenance/
    https://www.formatic-centre.fr/formation/mettre-en-place-une-campagne-adwords/
    https://www.formatic-centre.fr/formation/utiliser-google-analytics/
    and so on ...
    

    【讨论】:

    • 哇,多么好的答案,我会检查一下,我会与您保持联系。 :) 非常感谢
    • 这工作就像一个魅力,干得好!只是出于好奇,scrapy 不可行吗?你认为这个脚本可以被其他网站重复使用吗?如果我们假设网站是使用 AJAX 动态加载的,我们当然会更改标题等等。
    • 说实话,我对scrapy 不是很熟悉,但我相信它也可以做到。如果您觉得答案有用,请接受并/或投票。
    • 赞成并接受 :) 你认为这个脚本可以重复用于其他网站吗?如果我们假设网站是使用 AJAX 动态加载的,我们当然会更改标题等等。
    • 哦,我不确定。最好逐个使用 AJAX。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2012-01-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多