【问题标题】:Scraping multiple pages with bs4 Beautiful Soup - only scrapes the first page用 bs4 Beautiful Soup 刮多页 - 只刮第一页
【发布时间】:2021-01-12 15:10:47
【问题描述】:

*** 我的代码仅供练习!

我正在尝试从他们的网站https://www.premierleague.com/ 上抓取 FPL 中每个玩家的姓名和团队,但我遇到了一些代码问题。

问题是它只获取 url 末尾带有“-1”的页面,而我什至没有将其包含在我的页面列表中!

页面没有任何逻辑 - 基本网址是https://www.premierleague.com/players?se=363&cl=,而“=”后面的数字似乎是随机的。所以我创建了一个数字列表并使用 for 循环将其添加到 url:

我的代码:

import requests
from bs4 import BeautifulSoup
import pandas

plplayers = []

pl_url = 'https://www.premierleague.com/players?se=363&cl='
pages_list = ['1', '2', '131', '34']
for page in pages_list:
    r = requests.get(pl_url + page)
    c = r.content
    soup = BeautifulSoup(c, 'html.parser')
    player_names = soup.find_all('a', {'class': 'playerName'})



    for x in player_names:
        player_d = {}
        player_teams = []
        player_href = x.get('href')
        player_info_url = 'https://www.premierleague.com/' + player_href
        player_r = requests.get(player_info_url, headers=headers)
        player_c = player_r.content
        player_soup = BeautifulSoup(player_c, 'html.parser')
        team_tag = player_soup.find_all('td', {'class': 'team'})
        for team in team_tag:
            try:
                team_name = team.find('span', {'class': 'long'}).text
                if '(Loan)' in team_name:
                    team_name.replace('  (Loan) ', '')
                if team_name not in player_teams:
                    player_teams.append(team_name)
                player_d['NAME'] = x.text
                player_d['TEAMS'] = player_teams
            except:
                pass
        plplayers.append(player_d)


df = pandas.DataFrame(plplayers)
df.to_csv('plplayers.txt')

【问题讨论】:

    标签: python web-scraping beautifulsoup


    【解决方案1】:

    我会对此发表评论,但我是新手,没有足够的声誉,所以我必须将其保留在答案中。

    看起来当您请求存储在 player_r 中时,您指定了 headers 参数,但实际上并没有创建 headers 变量。

    如果您将 player_r = requests.get(player_info_url, headers=headers) 替换为 player_r = requests.get(player_info_url),您的代码应该可以完美运行。至少,在我的机器上是这样。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2014-02-08
      • 2018-04-22
      • 2023-03-13
      • 1970-01-01
      • 2023-04-09
      • 2019-07-21
      • 2021-12-08
      相关资源
      最近更新 更多