使用 BeautifulSoup 解析 NBA Boxscore 数据的问题答案

【问题标题】：Problems Parsing NBA Boxscore Data with BeautifulSoup使用 BeautifulSoup 解析 NBA Boxscore 数据的问题
【发布时间】：2015-04-11 10:03:12
【问题描述】：

我正在尝试从 EPSN 解析球员级别的 NBA 得分数据。以下是我尝试的初始部分：

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date

request = requests.get('http://espn.go.com/nba/boxscore?gameId=400277722')
soup = BeautifulSoup(request.text,'html.parser')
table = soup.find_all('table')

BeautifulSoup 似乎给了我一个奇怪的结果。源代码中的最后一个“表”包含玩家数据，这就是我要提取的内容。在线查看源代码显示，该表在第 421 行关闭，这是两支球队的盒子得分之后。但是，如果我们查看“汤”，则会在迈阿密统计数据之前添加一条关闭表格的行。这发生在在线源代码的第 350 行。

解析器“html.parser”的输出是：

Game 1: Tuesday, October 30thCeltics107FinalHeat120Recap »Boxscore »
Game 2: Sunday, January 27thHeat98Final2OTCeltics100Recap »Boxscore »
Game 3: Monday, March 18thHeat105FinalCeltics103Recap »Boxscore »
Game 4: Friday, April 12thCeltics101FinalHeat109Recap »Boxscore »

1 2 3 4 T

BOS 25 29 22 31107MIA 31 31 31 27120

Boston Celtics
STARTERS    
MIN
FGM-A
3PM-A
FTM-A
OREB
DREB
REB
AST
STL
BLK
TO
PF
+/-
PTS

Kevin Garnett, PF324-80-01-11111220254-49
Brandon Bass, PF286-110-03-4651110012-815
Paul Pierce, SF416-152-49-905552003-1723
Rajon Rondo, PG449-140-22-4077130044-1320
Courtney Lee, SG245-61-10-001110015-711
BENCH
MIN
FGM-A
3PM-A
FTM-A
OREB
DREB
REB
AST
STL
BLK
TO
PF
+/-
PTS

Jared Sullinger, PF81-20-00-001100001-32
Jeff Green, SF230-40-03-403301010-73
Jason Terry, SG252-70-34-400011033-108
Leandro Barbosa, SG166-83-31-201110001+416
Chris Wilcox, PFDNP COACH'S DECISION
Kris Joseph, SFDNP COACH'S DECISION
Jason Collins, CDNP COACH'S DECISION
Darko Milicic, CDNP COACH'S DECISIONTOTALS
FGM-A
3PM-A  
FTM-A
OREB

如您所见，它在“OREB”中位居榜首，并且从未进入迈阿密热火队。使用'lxml'解析器的输出是：

Game 1: Tuesday, October 30thCeltics107FinalHeat120Recap »Boxscore »
Game 2: Sunday, January 27thHeat98Final2OTCeltics100Recap »Boxscore »
Game 3: Monday, March 18thHeat105FinalCeltics103Recap »Boxscore »
Game 4: Friday, April 12thCeltics101FinalHeat109Recap »Boxscore »

1 2 3 4T

BOS 25 29 22 31107MIA 31 31 31 27120

这根本不包括盒子分数。我正在使用的完整代码（由于 Daniel Rodriguez）看起来像：

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date

games = pd.read_csv('games_13.csv').set_index('id')
BASE_URL = 'http://espn.go.com/nba/boxscore?gameId={0}'

request = requests.get(BASE_URL.format(games.index[0]))
table = BeautifulSoup(request.text,'html.parser').find('table', class_='mod-data')
heads = table.find_all('thead')
headers = heads[0].find_all('tr')[1].find_all('th')[1:]
headers = [th.text for th in headers]
columns = ['id', 'team', 'player'] + headers

players = pd.DataFrame(columns=columns)

def get_players(players, team_name):
    array = np.zeros((len(players), len(headers)+1), dtype=object)
    array[:] = np.nan
    for i, player in enumerate(players):
        cols = player.find_all('td')
        array[i, 0] = cols[0].text.split(',')[0]
        for j in range(1, len(headers) + 1):
            if not cols[1].text.startswith('DNP'):
                array[i, j] = cols[j].text

    frame = pd.DataFrame(columns=columns)
    for x in array:
        line = np.concatenate(([index, team_name], x)).reshape(1,len(columns))
        new = pd.DataFrame(line, columns=frame.columns)
        frame = frame.append(new)
    return frame

for index, row in games.iterrows():
    print(index)
    request = requests.get(BASE_URL.format(index))
    table = BeautifulSoup(request.text, 'html.parser').find('table', class_='mod-data')
    heads = table.find_all('thead')
    bodies = table.find_all('tbody')

    team_1 = heads[0].th.text
    team_1_players = bodies[0].find_all('tr') + bodies[1].find_all('tr')
    team_1_players = get_players(team_1_players, team_1)
    players = players.append(team_1_players)

    team_2 = heads[3].th.text
    team_2_players = bodies[3].find_all('tr') + bodies[4].find_all('tr')
    team_2_players = get_players(team_2_players, team_2)
    players = players.append(team_2_players)

players = players.set_index('id')
print(players)
players.to_csv('players_13.csv')

我想要的输出示例是：

,id,team,player,MIN,FGM-A,3PM-A,FTM-A,OREB,DREB,REB,AST,STL,BLK,TO,PF,+/-,PTS
0,400277722,Boston Celtics,Brandon Bass,28,6-11,0-0,3-4,6,5,11,1,0,0,1,2,-8,15
0,400277722,Boston Celtics,Paul Pierce,41,6-15,2-4,9-9,0,5,5,5,2,0,0,3,-17,23
...
0,400277722,Miami Heat,Shane Battier,29,2-4,2-3,0-0,0,2,2,1,1,0,0,3,+12,6
0,400277722,Miami Heat,LeBron James,29,10-16,2-4,4-5,1,9,10,3,2,0,0,2,+12,26

【问题讨论】：

浏览所有的html非常不方便。你到底想刮什么？给出一些示例输出。
今天早上我意识到这很混乱而且没有帮助。对于那个很抱歉。我在上面发布了一些输出 - 第一个块几乎是我想要的，除了它在不包括表格的其余部分（即迈阿密热火）的情况下结束。这有帮助吗？
@DrDunkenstein，你应该尝试使用不同的解析器来处理 beautifulsoup
仅供参考 - 您可以从 nba.com 获取 json 格式的 boxscore 数据；不需要 HTML 抓取。

标签： python web-scraping beautifulsoup

【解决方案1】：

代码使用默认解析器返回正确的数据，默认解析器可能会lxml 如果你安装了它：

req = requests.get('http://espn.go.com/nba/boxscore?gameId=400277722')
soup = BeautifulSoup(req.content)
table = soup.find_all('table')
print(table)

....................
<td nowrap="" style="text-align:left"><a href="http://espn.go.com/nba/player/_/id/2009/james-jones">James Jones</a>, SF</td><td colspan="14" style="text-align:center">DNP COACH'S DECISION</td></tr><tr align="right" class="odd player-46-6490" valign="middle">
<td nowrap="" style="text-align:left"><a href="http://espn.go.com/nba/player/_/id/6490/terrel-harris">Terrel Harris</a>, SG</td><td colspan="14" style="text-align:center">DNP COACH'S DECISION</td></tr></tbody><thead><tr align="right"><th style="text-align:left;">TOTALS</th><th></th>
<th nowrap="">FGM-A</th>
<th>3PM-A</th>
<th>FTM-A</th>
<th>OREB
</th><th>DREB</th>
<th>REB</th>
<th>AST</th>
<th>STL</th>
<th>BLK</th>
<th>TO</th>
<th>PF</th>
<th> </th>
<th>PTS</th>
</tr></thead><tbody><tr align="right" class="even"><td colspan="2" style="text-align:left"></td><td><strong>43-79</strong></td><td><strong>8-16</strong></td><td><strong>26-32</strong></td><td><strong>5</strong></td><td><strong>31</strong></td><td><strong>36</strong></td><td><strong>25</strong></td><td><strong>8</strong></td><td><strong>5</strong></td><td><strong>8</strong></td><td><strong>20</strong></td><td> </td><td><strong>120</strong></td></tr><tr align="right" class="odd"><td colspan="2" style="text-align:left"><strong></strong></td><td><strong>54.4%</strong></td><td><strong>50.0%</strong></td><td><strong>81.3%</strong></td><td colspan="13"></td></tr><tr bgcolor="#ffffff"><td align="right" colspan="15" style="padding:10px;"><div style="float: right;"><strong>Fast break points:</strong>   12<br/><strong>Points in the paint:</strong>   46<br/><strong>Total Team Turnovers (Points off turnovers):</strong>   8 (6)</div><div style="float: left;">+/- denotes team's net points while the player is on the court.</div></td></tr></tbody></table>]

使用"html.parser" 给出了与您的问题相同的截断输出，但正如您在上面看到的，没有指定它可以正常工作。

它在 python 2.7 和 3.4 上都使用bs4 '4.3.2'，我的lxml 版本是3.3.3.0。

如果您还没有应该更新的最新 bs4，您可以使用诊断方法，该方法将打印出一份报告，显示不同的解析器如何处理文档，并告诉您是否缺少一个解析器Beautiful Soup 可能正在使用：

因此，使用您的 html 获取报告：

from bs4.diagnose import diagnose
diagnose(request.text)

使用正则表达式解析 html 已被充分记录为不是一个很好的方法，对 html 的微不足道的更改和正则表达式可能会破坏。

【讨论】：

【解决方案2】：

BeautifulSoup 也为我截断了部分结果，所以我用 re.findall 替换了 soup.find_all 选项

r = br.open('http://espn.go.com/nba/boxscore?gameId=400277722')
html = r.read()
soup = BeautifulSoup(html)

statnames = re.search('STARTERS</th>.*?PTS</th>',html, re.DOTALL).group()
th = re.findall('th.*</th', statnames) # each th tag contains a statname
names = ['Name', 'Team']
for t in th:
   t = re.sub('.*>','',t)
   t = t.replace('</th','')
   names.append(t)
print names

celts = re.search('Boston Celtics.*?Total Team Turnovers',html,re.DOTALL).group()
heat = re.search('nba-small-mia floatleft.*?Total Team Turnovers',html,re.DOTALL).group()

players = str(soup).split('td nowrap')
for player in players[1:len(players)]:
   try:
       stats = [re.search('[A-Z]?[a-z]?[A-Z][a-z]{1,} [A-Z][a-z]{1,}',player).group()] 
   except:
       stats = [re.search('[A-Z]\.?[A-Z]?\.? [A-Z][a-z]{1,}',player).group()] # player name
       if stats[0] in celts:
          stats.append('Boston Celtics')
       elif stats[0] in heat:
          stats.append('Miami Heat')
   td = re.findall('td.*?/td', player) # each td tag contains a stat
   for t in td:
       t = re.findall('>.*<',t)
       t = re.sub('.*>','',t[0])
       t = t.replace('<','')
       if t!='' and t!='\xc2\xa0':
          stats.append(t)
    print stats

输出 =

['Name', 'Team', 'MIN', 'FGM-A', '3PM-A', 'FTM-A', 'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK', 'TO', 'PF', '+/-', 'PTS']
['Kevin Garnett', 'Boston Celtics', '32', '4-8', '0-0', '1-1', '1', '11', '12', '2', '0', '2', '5', '4', '-4', '9']
['Brandon Bass', 'Boston Celtics', '28', '6-11', '0-0', '3-4', '6', '5', '11', '1', '0', '0', '1', '2', '-8', '15']
['Paul Pierce', 'Boston Celtics', '41', '6-15', '2-4', '9-9', '0', '5', '5', '5', '2', '0', '0', '3', '-17', '23']
['Rajon Rondo', 'Boston Celtics', '44', '9-14', '0-2', '2-4', '0', '7', '7', '13', '0', '0', '4', '4', '-13', '20']
['Courtney Lee', 'Boston Celtics', '24', '5-6', '1-1', '0-0', '0', '1', '1', '1', '0', '0', '1', '5', '-7', '11']
['Jared Sullinger', 'Boston Celtics', '8', '1-2', '0-0', '0-0', '0', '1', '1', '0', '0', '0', '0', '1', '-3', '2']
['Jeff Green', 'Boston Celtics', '23', '0-4', '0-0', '3-4', '0', '3', '3', '0', '1', '0', '1', '0', '-7', '3']
['Jason Terry', 'Boston Celtics', '25', '2-7', '0-3', '4-4', '0', '0', '0', '1', '1', '0', '3', '3', '-10', '8']
['Leandro Barbosa', 'Boston Celtics', '16', '6-8', '3-3', '1-2', '0', '1', '1', '1', '0', '0', '0', '1', '+4', '16']
['Chris Wilcox', 'Boston Celtics', "DNP COACH'S DECISION"]
['Kris Joseph', 'Boston Celtics', "DNP COACH'S DECISION"]
['Jason Collins', 'Boston Celtics', "DNP COACH'S DECISION"]
['Darko Milicic', 'Boston Celtics', "DNP COACH'S DECISION"]
['Shane Battier', 'Miami Heat', '29', '2-4', '2-3', '0-0', '0', '2', '2', '1', '1', '0', '0', '3', '+12', '6']
['LeBron James', 'Miami Heat', '29', '10-16', '2-4', '4-5', '1', '9', '10', '3', '2', '0', '0', '2', '+12', '26']
['Chris Bosh', 'Miami Heat', '37', '8-15', '0-1', '3-4', '2', '8', '10', '1', '0', '3', '1', '3', '+15', '19']
['Mario Chalmers', 'Miami Heat', '36', '3-7', '0-1', '2-2', '0', '1', '1', '11', '3', '0', '1', '3', '+11', '8']
['Dwyane Wade', 'Miami Heat', '35', '10-22', '0-0', '9-11', '2', '1', '3', '4', '2', '1', '4', '3', '-6', '29']
['Udonis Haslem', 'Miami Heat', '11', '0-1', '0-0', '0-0', '0', '3', '3', '0', '0', '0', '1', '1', '-2', '0']
['Rashard Lewis', 'Miami Heat', '19', '4-5', '1-2', '1-2', '0', '5', '5', '1', '0', '1', '0', '1', '+1', '10']
['Norris Cole', 'Miami Heat', '6', '1-2', '1-2', '0-0', '0', '0', '0', '1', '0', '0', '1', '2', '+5', '3']
['Ray Allen', 'Miami Heat', '31', '5-7', '2-3', '7-8', '0', '2', '2', '2', '0', '0', '0', '1', '+9', '19']
['Mike Miller', 'Miami Heat', '7', '0-0', '0-0', '0-0', '0', '0', '0', '1', '0', '0', '0', '1', '+8', '0']
['Josh Harrellson', 'Miami Heat', "DNP COACH'S DECISION"]
['James Jones', 'Miami Heat', "DNP COACH'S DECISION"]
['Terrel Harris', 'Miami Heat', "DNP COACH'S DECISION"]

去抓 D.J.奥古斯丁，最简单（但并非最简洁）的代码是：

try:
    stats = [re.search('[A-Z]?[a-z]?[A-Z][a-z]{1,} [A-Z][a-z]{1,}',player).group()] 
except:
    stats = [re.search('[A-Z]\.?[A-Z]?\.? [A-Z][a-z]{1,}',player).group()]

【讨论】：

谢谢！这适用于一名球员，Darko Milicic。我已经在上面的问题中发布了结果输出。
为达科修复。 Terrel Harris 在 len(players)-1 而不仅仅是 len(players) 之前也失踪了
刚刚添加[A-Z]？[a-z]？行 stats = ... 作为球员姓名——在勒布朗还只是布朗之前
非常感谢 - 但是，当我尝试循环播放所有游戏时，有时会出现错误（对于持续的问题，我深表歉意）。例如，“espn.go.com/nba/boxscore?gameId=400277724”由于 D.J. 导致错误。奥古斯丁。

【解决方案3】：

尝试使用不同的解析器（lxml）：

soup = BeautifulSoup(request.text,'lxml')
tables = soup.find_all('table')

for t in tables:
    print t.text

会更好的检测页面结构

【讨论】：

这似乎也不起作用..我仍然没有得到解析中的播放器分数。