用 bs4 Beautiful Soup 刮多页 - 只刮第一页答案

【问题标题】：Scraping multiple pages with bs4 Beautiful Soup - only scrapes the first page用 bs4 Beautiful Soup 刮多页 - 只刮第一页
【发布时间】：2021-01-12 15:10:47
【问题描述】：

*** 我的代码仅供练习！

我正在尝试从他们的网站https://www.premierleague.com/ 上抓取 FPL 中每个玩家的姓名和团队，但我遇到了一些代码问题。

问题是它只获取 url 末尾带有“-1”的页面，而我什至没有将其包含在我的页面列表中！

页面没有任何逻辑 - 基本网址是https://www.premierleague.com/players?se=363&cl=，而“=”后面的数字似乎是随机的。所以我创建了一个数字列表并使用 for 循环将其添加到 url：

我的代码：

import requests
from bs4 import BeautifulSoup
import pandas

plplayers = []

pl_url = 'https://www.premierleague.com/players?se=363&cl='
pages_list = ['1', '2', '131', '34']
for page in pages_list:
    r = requests.get(pl_url + page)
    c = r.content
    soup = BeautifulSoup(c, 'html.parser')
    player_names = soup.find_all('a', {'class': 'playerName'})



    for x in player_names:
        player_d = {}
        player_teams = []
        player_href = x.get('href')
        player_info_url = 'https://www.premierleague.com/' + player_href
        player_r = requests.get(player_info_url, headers=headers)
        player_c = player_r.content
        player_soup = BeautifulSoup(player_c, 'html.parser')
        team_tag = player_soup.find_all('td', {'class': 'team'})
        for team in team_tag:
            try:
                team_name = team.find('span', {'class': 'long'}).text
                if '(Loan)' in team_name:
                    team_name.replace('  (Loan) ', '')
                if team_name not in player_teams:
                    player_teams.append(team_name)
                player_d['NAME'] = x.text
                player_d['TEAMS'] = player_teams
            except:
                pass
        plplayers.append(player_d)


df = pandas.DataFrame(plplayers)
df.to_csv('plplayers.txt')

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

我会对此发表评论，但我是新手，没有足够的声誉，所以我必须将其保留在答案中。

看起来当您请求存储在 player_r 中时，您指定了 headers 参数，但实际上并没有创建 headers 变量。

如果您将 player_r = requests.get(player_info_url, headers=headers) 替换为 player_r = requests.get(player_info_url)，您的代码应该可以完美运行。至少，在我的机器上是这样。

【讨论】：