【发布时间】:2015-04-11 10:03:12
【问题描述】:
我正在尝试从 EPSN 解析球员级别的 NBA 得分数据。以下是我尝试的初始部分:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date
request = requests.get('http://espn.go.com/nba/boxscore?gameId=400277722')
soup = BeautifulSoup(request.text,'html.parser')
table = soup.find_all('table')
BeautifulSoup 似乎给了我一个奇怪的结果。源代码中的最后一个“表”包含玩家数据,这就是我要提取的内容。在线查看源代码显示,该表在第 421 行关闭,这是两支球队的盒子得分之后。但是,如果我们查看“汤”,则会在迈阿密统计数据之前添加一条关闭表格的行。这发生在在线源代码的第 350 行。
解析器“html.parser”的输出是:
Game 1: Tuesday, October 30thCeltics107FinalHeat120Recap »Boxscore »
Game 2: Sunday, January 27thHeat98Final2OTCeltics100Recap »Boxscore »
Game 3: Monday, March 18thHeat105FinalCeltics103Recap »Boxscore »
Game 4: Friday, April 12thCeltics101FinalHeat109Recap »Boxscore »
1 2 3 4 T
BOS 25 29 22 31107MIA 31 31 31 27120
Boston Celtics
STARTERS
MIN
FGM-A
3PM-A
FTM-A
OREB
DREB
REB
AST
STL
BLK
TO
PF
+/-
PTS
Kevin Garnett, PF324-80-01-11111220254-49
Brandon Bass, PF286-110-03-4651110012-815
Paul Pierce, SF416-152-49-905552003-1723
Rajon Rondo, PG449-140-22-4077130044-1320
Courtney Lee, SG245-61-10-001110015-711
BENCH
MIN
FGM-A
3PM-A
FTM-A
OREB
DREB
REB
AST
STL
BLK
TO
PF
+/-
PTS
Jared Sullinger, PF81-20-00-001100001-32
Jeff Green, SF230-40-03-403301010-73
Jason Terry, SG252-70-34-400011033-108
Leandro Barbosa, SG166-83-31-201110001+416
Chris Wilcox, PFDNP COACH'S DECISION
Kris Joseph, SFDNP COACH'S DECISION
Jason Collins, CDNP COACH'S DECISION
Darko Milicic, CDNP COACH'S DECISIONTOTALS
FGM-A
3PM-A
FTM-A
OREB
如您所见,它在“OREB”中位居榜首,并且从未进入迈阿密热火队。使用'lxml'解析器的输出是:
Game 1: Tuesday, October 30thCeltics107FinalHeat120Recap »Boxscore »
Game 2: Sunday, January 27thHeat98Final2OTCeltics100Recap »Boxscore »
Game 3: Monday, March 18thHeat105FinalCeltics103Recap »Boxscore »
Game 4: Friday, April 12thCeltics101FinalHeat109Recap »Boxscore »
1 2 3 4T
BOS 25 29 22 31107MIA 31 31 31 27120
这根本不包括盒子分数。我正在使用的完整代码(由于 Daniel Rodriguez)看起来像:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date
games = pd.read_csv('games_13.csv').set_index('id')
BASE_URL = 'http://espn.go.com/nba/boxscore?gameId={0}'
request = requests.get(BASE_URL.format(games.index[0]))
table = BeautifulSoup(request.text,'html.parser').find('table', class_='mod-data')
heads = table.find_all('thead')
headers = heads[0].find_all('tr')[1].find_all('th')[1:]
headers = [th.text for th in headers]
columns = ['id', 'team', 'player'] + headers
players = pd.DataFrame(columns=columns)
def get_players(players, team_name):
array = np.zeros((len(players), len(headers)+1), dtype=object)
array[:] = np.nan
for i, player in enumerate(players):
cols = player.find_all('td')
array[i, 0] = cols[0].text.split(',')[0]
for j in range(1, len(headers) + 1):
if not cols[1].text.startswith('DNP'):
array[i, j] = cols[j].text
frame = pd.DataFrame(columns=columns)
for x in array:
line = np.concatenate(([index, team_name], x)).reshape(1,len(columns))
new = pd.DataFrame(line, columns=frame.columns)
frame = frame.append(new)
return frame
for index, row in games.iterrows():
print(index)
request = requests.get(BASE_URL.format(index))
table = BeautifulSoup(request.text, 'html.parser').find('table', class_='mod-data')
heads = table.find_all('thead')
bodies = table.find_all('tbody')
team_1 = heads[0].th.text
team_1_players = bodies[0].find_all('tr') + bodies[1].find_all('tr')
team_1_players = get_players(team_1_players, team_1)
players = players.append(team_1_players)
team_2 = heads[3].th.text
team_2_players = bodies[3].find_all('tr') + bodies[4].find_all('tr')
team_2_players = get_players(team_2_players, team_2)
players = players.append(team_2_players)
players = players.set_index('id')
print(players)
players.to_csv('players_13.csv')
我想要的输出示例是:
,id,team,player,MIN,FGM-A,3PM-A,FTM-A,OREB,DREB,REB,AST,STL,BLK,TO,PF,+/-,PTS
0,400277722,Boston Celtics,Brandon Bass,28,6-11,0-0,3-4,6,5,11,1,0,0,1,2,-8,15
0,400277722,Boston Celtics,Paul Pierce,41,6-15,2-4,9-9,0,5,5,5,2,0,0,3,-17,23
...
0,400277722,Miami Heat,Shane Battier,29,2-4,2-3,0-0,0,2,2,1,1,0,0,3,+12,6
0,400277722,Miami Heat,LeBron James,29,10-16,2-4,4-5,1,9,10,3,2,0,0,2,+12,26
【问题讨论】:
-
浏览所有的html非常不方便。你到底想刮什么?给出一些示例输出。
-
今天早上我意识到这很混乱而且没有帮助。对于那个很抱歉。我在上面发布了一些输出 - 第一个块几乎是我想要的,除了它在不包括表格的其余部分(即迈阿密热火)的情况下结束。这有帮助吗?
-
@DrDunkenstein,你应该尝试使用不同的解析器来处理 beautifulsoup
-
仅供参考 - 您可以从 nba.com 获取 json 格式的 boxscore 数据;不需要 HTML 抓取。
标签: python web-scraping beautifulsoup