无法使用未更改的 url 抓取网站页面 - python答案

【问题标题】：unable to scrape website pages with unchanged url - python无法使用未更改的 url 抓取网站页面 - python
【发布时间】：2020-11-10 16:02:36
【问题描述】：

我正在尝试获取此网站“https://slotcatalog.com/en/The-Best-Slots#anchorFltrList”中所有游戏的名称。为此，我使用以下代码：

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

url = "https://slotcatalog.com/en/The-Best-Slots#anchorFltrList"

page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

data = []
table = soup.find_all('div', attrs={'class':'providerCard'})

for game in range(0,len(table)-1):
    print(table[game].find('a')['title'])

我得到了我想要的。我想在网站上所有可用的页面上复制相同的内容，但鉴于 url 没有改变，我查看了点击不同页面时页面上发生的网络 (XMR) 事件并尝试发送请求使用以下代码：

for page_no in range(1, 100):
    data = {
            "blck":"fltrGamesBlk",
            "ajax":"1",
            "lang":"end",
            "p":str(page_no),
            "translit":"The-Best-Slots",
            "tag":"TOP",
            "dt1":"",
            "dt2":"",
            "sorting":"SRANK",
            "cISO":"GB",
            "dt_period":"",
            "rtp_1":"50.00",
            "rtp_2":"100.00",
            "max_exp_1":"2.00",
            "max_exp_2":"250000.00",
            "min_bet_1":"0.01",
            "min_bet_2":"5.00",
            "max_bet_1":"3.00",
            "max_bet_2":"10000.00"
        }
     page = requests.post('https://slotcatalog.com/index.php', 
                         data=data, 
                         headers={'Host' : 'slotcatalog.com',
                                  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0'    
                })


    soup = BeautifulSoup(page.content, 'html.parser')
    for row in soup.find_all('div', attrs={'class':'providerCard'}):
        name = row.find('a')['title']
        print(name)

result : ("KeyError: 'title'") - 意味着它没有找到类“providerCard”。对网站的请求是否以错误的方式完成？如果是这样，我应该在哪里更改代码？提前谢谢

【问题讨论】：

stackoverflow.com/help/someone-answers

标签： python web-scraping beautifulsoup python-requests

【解决方案1】：

好的，所以，你有一个错字。 XD 这是来自有效负载的 "lang":"end"，但它应该是 "lang": "en" 等等。

无论如何，我已经对您的代码进行了一些清理，并且它可以按预期工作。如果需要，您可以继续循环播放所有游戏。

import requests
from bs4 import BeautifulSoup

headers = {
    "referer": "https://slotcatalog.com/en/The-Best-Slots",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/50.0.2661.102 Safari/537.36",
    "x-requested-with": "XMLHttpRequest",
}

payload = {
    "blck": "fltrGamesBlk",
    "ajax": "1",
    "lang": "en",
    "p": 1,
    "translit": "The-Best-Slots",
    "tag": "TOP",
    "dt1": "",
    "dt2": "",
    "sorting": "SRANK",
    "cISO": "EN",
    "dt_period": "",
    "rtp_1": "50.00",
    "rtp_2": "100.00",
    "max_exp_1": "2.00",
    "max_exp_2": "250000.00",
    "min_bet_1": "0.01",
    "min_bet_2": "5.00",
    "max_bet_1": "3.00",
    "max_bet_2": "10000.00"
}
page = requests.post(
    "https://slotcatalog.com/index.php",
    data=payload,
    headers=headers,
)
soup = BeautifulSoup(page.content, "html.parser")
print([i.get("title") for i in soup.find_all("a", {"class": "providerName"})])

输出（仅适用于第 1 页）：

['Starburst', 'Bonanza', 'Rainbow Riches', 'Book of Dead', "Fishin' Frenzy", 'Wolf Gold', 'Twin Spin', 'Slingo Rainbow Riches', "Gonzo's Quest", "Gonzo's Quest Megaways", 'Eye of Horus (Reel Time Gaming)', 'Age of the Gods God of Storms', 'Lightning Roulette', 'Buffalo Blitz', "Fishin' Frenzy Megaways", 'Fluffy Favourites', 'Blue Wizard', 'Legacy of Dead', '9 Pots of Gold', 'Buffalo Blitz II', 'Cleopatra (IGT)', 'Quantum Roulette', 'Reel King Mega', 'Mega Moolah', '7s Deluxe', "Rainbow Riches Pick'n'Mix", "Shaman's Dream"]

【讨论】：