【问题标题】:How to Parse the MLB Team and Player data using Pandas DataFrame?如何使用 Pandas DataFrame 解析 MLB 球队和球员数据?
【发布时间】:2020-12-07 03:02:18
【问题描述】:

我还在学习,可以使用一些帮助。我想解析首发投手和他们各自的球队。

我想要 Pandas Dataframe 中的数据,但不知道如何正确解析数据。任何建议都会非常有帮助。感谢您的宝贵时间!

以下是所需输出的示例:

Game   Team     Name

       OAK     Chris Bassitt
1
       ARI     Zac Gallen


       SEA     Justin Dunn
2
       LAD     Ross Stripling

这是我的代码:

#url = https://www.baseball-reference.com/previews/index.shtml

#Data needed: 1) Team  2) Pitcher Name

import pandas as pd

url = 'https://www.baseball-reference.com/previews/index.shtml'

test = pd.read_html(url)

for t in test:
    name = t[1]
    team = t[0]
   
    print(team)
   
    print(name)

我觉得我必须创建一个 Pandas DataFrame 并附加团队和名称,但是,我不确定如何只解析出所需的输出。

【问题讨论】:

    标签: python pandas dataframe parsing web-scraping


    【解决方案1】:
    • pandas.read_html 返回给定 URL 的所有表的列表
    • 可以使用普通的列表切片和选择方法来选择列表中的数据框
    import pandas as pd
    
    url = 'https://www.baseball-reference.com/previews/index.shtml'
    
    list_of_dataframes = pd.read_html(url)
    
    # select and combine the dataframes for games; every other dataframe from 0 (even)
    games = pd.concat(list_of_dataframes[0::2])
    
    # display(games.head())
                     0   1        2
    0      Cubs (13-6) NaN  Preview
    1  Cardinals (4-4) NaN  12:00AM
    0  Cardinals (4-4) NaN  Preview
    1      Cubs (13-6) NaN   5:15PM
    0   Red Sox (6-16) NaN  Preview
    
    # select the players from list_of_dataframes; every other dataframe from 1 (odd)
    players = list_of_dataframes[1::2]
    
    # add the Game to the dataframes
    for i, df in enumerate(players, 1):
        df['Game'] = i
        players[i-1] = df
    
    # combine all the dataframe
    players = pd.concat(players).reset_index(drop=True)
    
    # create a players column for the name only
    players['name'] = players[1].str.split('(', expand=True)[0]
    
    # rename the colume
    players.rename(columns={0: 'Team'}, inplace=True)
    
    # drop 1
    players.drop(columns=[1], inplace=True)
    
    # display(players.head(6))
      Team  Game               name
    0  CHC     1       Tyson Miller
    1  STL     1         Alex Reyes
    2  STL     2     Kwang Hyun Kim
    3  CHC     2     Kyle Hendricks
    4  BOS     3       Martin Perez
    5  NYY     3  Jordan Montgomery
    

    【讨论】:

    • 太棒了!非常感谢你的洞察力!内容丰富...我会从中学到很多东西。
    • 我非常感谢您的回复,@TrentonMcKinney 先生。出色的工作! =)
    【解决方案2】:

    喜欢那些体育 reference.com 网站。 Trenton 的解决方案是完美的,因此不要更改已接受的答案,而只是想为可能的投手提供这个替代数据源,以防您感兴趣。

    看起来 mlb.com 有一个公开可用的 api 来提取该信息(我假设这可能是棒球参考填充他们可能的投手页面的地方)。但我喜欢这一点的是,您可以获得更多返回分析的数据,它使您可以选择获得更广泛的日期范围来获取历史数据,并可能提前 2 或 3 天(以及)。所以也看看这段代码,玩它,练习它。

    但这可能会让您开始第一次使用机器学习。

    PS:如果您想知道strikeZoneBottomstrikeZoneTop 的含义,请告诉我,如果您甚至费心查看这些数据。我一直无法弄清楚这些是什么意思。

    我也想知道是否有关于球场的数据。就像投手的统计数据一样,有飞球:地球的比例。如果有关于球场的数据,例如,如果您在一个产生大量本垒打的场地中有飞球投手,那么您可能会在飞球传播不那么远的球场中看到同一个投手的不同情况,或者体育场有更深的围栏(基本上本垒打变成警告轨道飞出,反之亦然)??

    代码:

    import requests
    import pandas as pd
    from datetime import datetime, timedelta
    
    url = 'https://statsapi.mlb.com/api/v1/schedule'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
    
    yesterday = datetime.strftime(datetime.now() - timedelta(1), '%Y-%m-%d')
    today = datetime.strftime(datetime.now(), '%Y-%m-%d')
    tomorrow = datetime.strftime(datetime.now() + timedelta(1), '%Y-%m-%d') 
    
    #To get 7 days earlier; notice the minus sign
    #pastDate = datetime.strftime(datetime.now() - timedelta(7), '%Y-%m-%d')
    
    #To get 3 days later; notice the plus sign
    #futureDate = datetime.strftime(datetime.now() + timedelta(3), '%Y-%m-%d')
    
    #hydrate parameter is to get back certain data elements. Not sure how to alter it exactly yet, would have to play around
    #But without hydrate, it doesn't return probable pitchers
    payload = {
    'sportId': '1',
    'startDate': today, #<-- Change these to get a wider range of games (to also get historical stats for machine learning)
    'endDate': today, #<-- Change these to get a wider range of games (to possible probable pitchers for next few days. just need to adjust timedelta above)
    'hydrate': 'team(leaders(showOnPreview(leaderCategories=[homeRuns,runsBattedIn,battingAverage],statGroup=[pitching,hitting]))),linescore(matchup,runners),flags,liveLookin,review,broadcasts(all),venue(location),decisions,person,probablePitcher,stats,homeRuns,previousPlay,game(content(media(featured,epg),summary),tickets),seriesStatus(useOverride=true)'}
    
    jsonData = requests.get(url, headers=headers, params=payload).json()
    dates = jsonData['dates']
    
    rows = []
    for date in dates:
        games = date['games']
        for game in games:
            dayNight = game['dayNight']
            gameDate = game['gameDate']
            city = game['venue']['location']['city']
            venue = game['venue']['name']
            teams = game['teams']
            for k, v in teams.items():
                row = {}
                
                row.update({'dayNight':dayNight, 
                        'gameDate':gameDate, 
                        'city':city, 
                        'venue':venue})
                
                homeAway = k
                teamName = v['team']['name']
                
                if 'probablePitcher' not in v.keys():
                    row.update({'homeAway':homeAway,
                               'teamName':teamName})
                    rows.append(row)
                    
                else:
                    probablePitcher = v['probablePitcher']
                    fullName = probablePitcher['fullName']
                    pitchHand = probablePitcher['pitchHand']['code']
                    strikeZoneBottom = probablePitcher['strikeZoneBottom']
                    strikeZoneTop = probablePitcher['strikeZoneTop']
                    
                    row.update({'homeAway':homeAway,
                               'teamName':teamName, 
                               'probablePitcher':fullName,
                               'pitchHand':pitchHand,
                               'strikeZoneBottom':strikeZoneBottom,
                               'strikeZoneTop':strikeZoneTop})
                    
                    stats = probablePitcher['stats']
                    for stat in stats:
                        if stat['type']['displayName'] == 'statsSingleSeason' and stat['group']['displayName'] == 'pitching':
                            playerStats = stat['stats']
                            
                            row.update(playerStats)
                            rows.append(row)
                        
    df = pd.DataFrame(rows)            
    

    输出:前 10 行

    print (df.head(10).to_string())
       airOuts  atBats  balks  baseOnBalls  blownSaves  catchersInterference  caughtStealing         city  completeGames dayNight  doubles  earnedRuns    era              gameDate  gamesFinished  gamesPitched  gamesPlayed  gamesStarted  groundOuts groundOutsToAirouts  hitBatsmen  hitByPitch  hits hitsPer9Inn  holds homeAway  homeRuns homeRunsPer9  inheritedRunners  inheritedRunnersScored inningsPitched  intentionalWalks  losses   obp  outs  pickoffs pitchHand probablePitcher  rbi  runs runsScoredPer9  sacBunts  sacFlies  saveOpportunities  saves  shutouts stolenBasePercentage  stolenBases  strikeOuts  strikeZoneBottom  strikeZoneTop strikeoutWalkRatio strikeoutsPer9Inn               teamName  triples                        venue walksPer9Inn  whip  wildPitches winPercentage  wins
    0     15.0    44.0    0.0          9.0         0.0                   0.0             0.0    Baltimore            0.0      day      2.0         8.0   6.00  2020-08-19T17:05:00Z            0.0           3.0          3.0           3.0         9.0                0.60         0.0         0.0  10.0        7.50    0.0     away       3.0         2.25               0.0                     0.0           12.0               0.0     1.0  .358  36.0       0.0         R    Tanner Roark  0.0   8.0           6.00       0.0       0.0                0.0    0.0       0.0                1.000          1.0        10.0             1.589          3.467               1.11              7.50      Toronto Blue Jays      0.0  Oriole Park at Camden Yards         6.75  1.58          0.0          .500   1.0
    1     18.0    74.0    0.0          3.0         0.0                   0.0             0.0    Baltimore            0.0      day      5.0         8.0   4.00  2020-08-19T17:05:00Z            0.0           4.0          4.0           4.0        18.0                1.00         1.0         1.0  22.0       11.00    0.0     home       1.0         0.50               0.0                     0.0           18.0               0.0     2.0  .329  54.0       1.0         L    Tommy Milone  0.0  11.0           5.50       1.0       1.0                0.0    0.0       0.0                1.000          1.0        18.0             1.535          3.371               6.00              9.00      Baltimore Orioles      1.0  Oriole Park at Camden Yards         1.50  1.39          1.0          .333   1.0
    2     14.0    59.0    0.0          2.0         0.0                   0.0             0.0       Boston            0.0      day      3.0         7.0   4.02  2020-08-19T17:35:00Z            0.0           3.0          3.0           3.0        14.0                1.00         0.0         0.0  17.0        9.77    0.0     away       2.0         1.15               0.0                     0.0           15.2               0.0     2.0  .311  47.0       0.0         R    Jake Arrieta  0.0   7.0           4.02       0.0       0.0                0.0    0.0       0.0                 .---          0.0        14.0             1.627          3.549               7.00              8.04  Philadelphia Phillies      0.0                  Fenway Park         1.15  1.21          2.0          .333   1.0
    3      2.0    14.0    1.0          3.0         0.0                   0.0             0.0       Boston            0.0      day      1.0         5.0  22.50  2020-08-19T17:35:00Z            0.0           1.0          1.0           1.0         1.0                0.50         0.0         0.0   7.0       31.50    0.0     home       2.0         9.00               0.0                     0.0            2.0               0.0     1.0  .588   6.0       0.0         L       Kyle Hart  0.0   7.0          31.50       0.0       0.0                0.0    0.0       0.0                 .---          0.0         4.0             1.681          3.575               1.33             18.00         Boston Red Sox      0.0                  Fenway Park        13.50  5.00          0.0          .000   0.0
    4      8.0    27.0    0.0          0.0         0.0                   0.0             0.0      Chicago            0.0      day      0.0         2.0   2.57  2020-08-19T18:20:00Z            0.0           1.0          1.0           1.0         7.0                0.88         0.0         0.0   6.0        7.71    0.0     away       0.0         0.00               0.0                     0.0            7.0               0.0     0.0  .222  21.0       0.0         R   Jack Flaherty  0.0   2.0           2.57       0.0       0.0                0.0    0.0       0.0                 .---          0.0         6.0             1.627          3.549               -.--              7.71    St. Louis Cardinals      0.0                Wrigley Field         0.00  0.86          0.0         1.000   1.0
    5     13.0    65.0    0.0          6.0         0.0                   0.0             1.0      Chicago            0.0      day      2.0         6.0   2.84  2020-08-19T18:20:00Z            0.0           3.0          3.0           3.0        28.0                2.15         1.0         1.0  10.0        4.74    0.0     home       2.0         0.95               0.0                     0.0           19.0               0.0     1.0  .236  57.0       0.0         R      Alec Mills  0.0   6.0           2.84       0.0       0.0                0.0    0.0       0.0                 .000          0.0        14.0             1.627          3.549               2.33              6.63           Chicago Cubs      0.0                Wrigley Field         2.84  0.84          0.0          .667   2.0
    6      NaN     NaN    NaN          NaN         NaN                   NaN             NaN      Chicago            NaN    night      NaN         NaN    NaN  2020-08-19T03:33:00Z            NaN           NaN          NaN           NaN         NaN                 NaN         NaN         NaN   NaN         NaN    NaN     away       NaN          NaN               NaN                     NaN            NaN               NaN     NaN   NaN   NaN       NaN       NaN             NaN  NaN   NaN            NaN       NaN       NaN                NaN    NaN       NaN                  NaN          NaN         NaN               NaN            NaN                NaN               NaN           Chicago Cubs      NaN                Wrigley Field          NaN   NaN          NaN           NaN   NaN
    7      NaN     NaN    NaN          NaN         NaN                   NaN             NaN      Chicago            NaN    night      NaN         NaN    NaN  2020-08-19T03:33:00Z            NaN           NaN          NaN           NaN         NaN                 NaN         NaN         NaN   NaN         NaN    NaN     home       NaN          NaN               NaN                     NaN            NaN               NaN     NaN   NaN   NaN       NaN       NaN             NaN  NaN   NaN            NaN       NaN       NaN                NaN    NaN       NaN                  NaN          NaN         NaN               NaN            NaN                NaN               NaN    St. Louis Cardinals      NaN                Wrigley Field          NaN   NaN          NaN           NaN   NaN
    8     13.0    92.0    0.0          8.0         0.0                   0.0             1.0  Kansas City            0.0      day      6.0        10.0   3.91  2020-08-19T21:05:00Z            0.0           4.0          4.0           4.0        24.0                1.85         0.0         0.0  25.0        9.78    0.0     away       1.0         0.39               0.0                     0.0           23.0               0.0     2.0  .327  69.0       0.0         R   Luis Castillo  0.0  12.0           4.70       0.0       1.0                0.0    0.0       0.0                 .000          0.0        31.0             1.589          3.467               3.88             12.13        Cincinnati Reds      1.0             Kauffman Stadium         3.13  1.43          0.0          .000   0.0
    9     10.0    36.0    0.0          5.0         0.0                   0.0             0.0  Kansas City            0.0      day      0.0         0.0   0.00  2020-08-19T21:05:00Z            0.0           2.0          2.0           2.0        11.0                1.10         1.0         1.0   5.0        4.09    0.0     home       0.0         0.00               0.0                     0.0           11.0               0.0     0.0  .262  33.0       0.0         R     Brad Keller  0.0   0.0           0.00       0.0       0.0                0.0    0.0       0.0                 .---          0.0        10.0             1.681          3.575               2.00              8.18     Kansas City Royals      0.0             Kauffman Stadium         4.09  0.91          0.0         1.000   2.0
    

    【讨论】:

    • 惊人的发现@chitown88!这将非常有帮助。 =) 我会调查所要求的信息并希望尽快回复。谢谢你的时间,先生。感激不尽。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-07-09
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-12-05
    • 1970-01-01
    相关资源
    最近更新 更多