【发布时间】:2021-01-12 15:10:47
【问题描述】:
*** 我的代码仅供练习!
我正在尝试从他们的网站https://www.premierleague.com/ 上抓取 FPL 中每个玩家的姓名和团队,但我遇到了一些代码问题。
问题是它只获取 url 末尾带有“-1”的页面,而我什至没有将其包含在我的页面列表中!
页面没有任何逻辑 - 基本网址是https://www.premierleague.com/players?se=363&cl=,而“=”后面的数字似乎是随机的。所以我创建了一个数字列表并使用 for 循环将其添加到 url:
我的代码:
import requests
from bs4 import BeautifulSoup
import pandas
plplayers = []
pl_url = 'https://www.premierleague.com/players?se=363&cl='
pages_list = ['1', '2', '131', '34']
for page in pages_list:
r = requests.get(pl_url + page)
c = r.content
soup = BeautifulSoup(c, 'html.parser')
player_names = soup.find_all('a', {'class': 'playerName'})
for x in player_names:
player_d = {}
player_teams = []
player_href = x.get('href')
player_info_url = 'https://www.premierleague.com/' + player_href
player_r = requests.get(player_info_url, headers=headers)
player_c = player_r.content
player_soup = BeautifulSoup(player_c, 'html.parser')
team_tag = player_soup.find_all('td', {'class': 'team'})
for team in team_tag:
try:
team_name = team.find('span', {'class': 'long'}).text
if '(Loan)' in team_name:
team_name.replace(' (Loan) ', '')
if team_name not in player_teams:
player_teams.append(team_name)
player_d['NAME'] = x.text
player_d['TEAMS'] = player_teams
except:
pass
plplayers.append(player_d)
df = pandas.DataFrame(plplayers)
df.to_csv('plplayers.txt')
【问题讨论】:
标签: python web-scraping beautifulsoup