【问题标题】:Scraping NBA advanced stats with python beautifulsoup使用 python beautifulsoup 抓取 NBA 高级统计数据
【发布时间】:2019-03-05 02:43:59
【问题描述】:

我想抓取 NBA 高级统计数据。首先,我只想能够抓取团队的名称,但我遇到了一个问题,即它没有收集任何信息。我可能在 find_all 函数中寻找错误的东西。任何帮助表示赞赏!

import requests
from bs4 import BeautifulSoup

url = "https://stats.nba.com/teams/elbow-touch/?sort=ELBOW_TOUCHES&dir=-1"
result = requests.get(url)
c = result.content

soup = Beaut ifulSoup(c,"html.parser")

title = soup.title.text
print(title)

teams = soup.find_all('td',{'class':'team'})

for element in teams:
    print(element.text)

我要抓取的网站:

【问题讨论】:

    标签: python web-scraping beautifulsoup


    【解决方案1】:

    该站点是动态的,因此,您需要使用selenium

    from selenium import webdriver
    from bs4 import BeautifulSoup as soup 
    d = webdriver.Chrome('/path/to/chromedriver')
    d.get('https://stats.nba.com/teams/elbow-touch/?sort=ELBOW_TOUCHES&dir=-1')
    s = soup(d.page_source, 'html.parser').find('table', {'class':'table'})
    headers, [_, *data] = [i.text for i in s.find_all('th')], [[i.text for i in b.find_all('td')] for b in s.find_all('tr')]
    final_data = [i for i in data if len(i) > 1]
    

    现在,final_data 存储所有团队结果:

    [['Houston Rockets', '63', '38', '25', '242.0', '367.0', '8.8', '2.4', '3.8', '64.2', '0.4', '0.7', '62.8', '5.5', '-', '3.7', '-', '0.5', '14.0', '0.5', '5.4', '0.3', '-'], ['Milwaukee Bucks', '63', '48', '15', '241.2', '409.5', '9.5', '2.3', '3.6', '62.4', '0.7', '1.0', '73.3', '5.4', '-', '4.3', '-', '0.6', '13.0', '0.5', '5.2', '0.4', '-'], ['New York Knicks', '62', '13', '49', '241.6', '420.4', '9.5', '2.0', '3.4', '56.8', '0.7', '1.0', '69.8', '4.8', '-', '4.7', '-', '0.6', '13.7', '0.5', '5.3', '0.5', '-'], ['Charlotte Hornets', '63', '29', '34', '242.0', '409.7', '9.6', '1.7', '3.5', '50.0', '1.1', '1.5', '71.9', '4.7', '-', '4.6', '-', '0.7', '14.2', '0.4', '4.5', '0.7', '-'], ['Detroit Pistons', '62', '31', '31', '242.8', '437.0', '10.0', '1.6', '3.2', '51.3', '0.9', '1.2', '75.3', '4.4', '-', '5.0', '-', '0.9', '17.6', '0.7', '6.8', '0.6', '-'], ['Washington Wizards', '62', '25', '37', '243.2', '420.2', '10.5', '2.5', '4.3', '58.4', '0.9', '1.2', '76.4', '6.1', '-', '4.6', '-', '0.7', '15.5', '0.6', '5.6', '0.5', '-'], ['Atlanta Hawks', '64', '22', '42', '242.3', '434.9', '11.0', '2.2', '3.7', '58.6', '1.2', '1.5', '77.3', '5.7', '-', '5.3', '-', '0.7', '12.9', '0.7', '6.5', '0.7', '-'], ['Brooklyn Nets', '65', '32', '33', '243.8', '440.3', '11.2', '2.5', '4.4', '58.3', '1.2', '1.7', '70.8', '6.4', '-', '4.6', '-', '0.7', '14.9', '0.9', '7.9', '0.8', '-'], ['San Antonio Spurs', '64', '35', '29', '241.6', '402.3', '11.3', '2.3', '4.1', '55.5', '0.8', '1.0', '85.7', '5.6', '-', '5.8', '-', '1.1', '18.7', '0.5', '4.8', '0.4', '-'], ['Boston Celtics', '64', '38', '26', '241.6', '420.8', '11.5', '2.5', '4.2', '58.4', '0.5', '0.7', '71.7', '5.5', '-', '5.7', '-', '0.9', '15.0', '0.6', '5.6', '0.3', '-'], ['Toronto Raptors', '64', '46', '18', '242.3', '418.0', '11.5', '3.5', '5.9', '59.6', '1.2', '1.5', '78.1', '8.3', '-', '4.1', '-', '0.7', '16.3', '0.4', '3.7', '0.7', '-'], ['Portland Trail Blazers', '63', '39', '24', '241.6', '409.8', '11.8', '2.4', '4.6', '51.9', '1.2', '1.5', '80.2', '6.1', '-', '5.5', '-', '1.0', '18.8', '0.7', '5.7', '0.7', '-'], ['Utah Jazz', '61', '36', '25', '240.8', '435.9', '11.9', '2.0', '3.8', '51.1', '1.4', '2.2', '66.7', '5.4', '-', '5.9', '-', '1.0', '17.1', '0.7', '5.9', '1.0', '-'], ['Minnesota Timberwolves', '63', '29', '34', '241.6', '412.4', '12.0', '2.9', '5.0', '57.3', '1.3', '1.6', '79.8', '7.3', '-', '5.2', '-', '1.0', '19.5', '0.6', '5.2', '0.7', '-'], ['Chicago Bulls', '63', '18', '45', '243.2', '411.3', '12.4', '2.8', '4.8', '57.9', '0.7', '0.9', '77.6', '6.4', '-', '6.3', '-', '0.8', '12.4', '0.6', '4.5', '0.4', '-'], ['LA Clippers', '65', '36', '29', '241.9', '430.4', '12.4', '2.9', '5.1', '56.9', '1.0', '1.5', '69.5', '7.0', '-', '5.4', '-', '0.9', '15.9', '0.7', '5.5', '0.6', '-'], ['Miami Heat', '62', '28', '34', '240.4', '426.1', '12.6', '2.0', '4.0', '50.2', '0.7', '1.3', '56.8', '4.9', '-', '7.0', '-', '1.1', '15.4', '0.4', '3.4', '0.5', '-'], ['New Orleans Pelicans', '65', '29', '36', '240.0', '435.0', '12.6', '3.5', '6.4', '54.8', '1.2', '1.6', '74.5', '8.4', '-', '4.4', '-', '0.9', '20.4', '0.7', '5.2', '0.8', '-'], ['Phoenix Suns', '64', '13', '51', '242.3', '435.8', '12.9', '2.8', '5.0', '56.7', '1.0', '1.3', '73.5', '6.8', '-', '6.2', '-', '0.8', '13.7', '0.6', '4.7', '0.6', '-'], ['Oklahoma City Thunder', '63', '39', '24', '242.0', '364.8', '13.6', '3.2', '5.8', '54.5', '1.0', '1.4', '65.9', '7.5', '-', '5.8', '-', '0.9', '14.7', '0.7', '4.8', '0.6', '-'], ['Dallas Mavericks', '62', '27', '35', '240.8', '435.4', '13.9', '1.8', '3.1', '55.9', '1.2', '1.6', '76.5', '5.0', '-', '8.6', '-', '1.1', '13.1', '0.8', '5.7', '0.7', '-'], ['Golden State Warriors', '63', '44', '19', '241.6', '442.3', '13.9', '2.8', '4.8', '57.0', '1.2', '1.5', '81.7', '6.9', '-', '7.2', '-', '1.6', '21.7', '0.8', '5.8', '0.7', '-'], ['Orlando Magic', '63', '28', '35', '241.2', '405.0', '14.0', '3.2', '5.7', '55.8', '1.1', '1.4', '80.9', '7.7', '-', '6.5', '-', '1.4', '21.8', '0.6', '4.0', '0.7', '-'], ['Los Angeles Lakers', '63', '30', '33', '241.6', '405.9', '14.2', '3.3', '5.7', '57.8', '1.1', '1.6', '67.0', '7.8', '-', '6.3', '-', '1.3', '20.7', '0.9', '6.3', '0.7', '-'], ['Denver Nuggets', '62', '42', '20', '240.8', '435.2', '15.0', '3.1', '5.3', '59.1', '1.1', '1.5', '72.5', '7.5', '-', '7.4', '-', '1.7', '22.3', '1.0', '6.4', '0.7', '-'], ['Indiana Pacers', '64', '41', '23', '240.4', '431.7', '15.3', '4.4', '7.2', '60.6', '1.4', '1.9', '74.2', '10.4', '-', '5.8', '-', '1.2', '20.9', '0.9', '6.0', '0.9', '-'], ['Cleveland Cavaliers', '64', '16', '48', '241.2', '407.3', '16.1', '2.3', '4.5', '51.6', '0.9', '1.1', '80.0', '5.6', '-', '10.0', '-', '1.2', '12.3', '0.5', '3.4', '0.4', '-'], ['Philadelphia 76ers', '63', '40', '23', '242.0', '446.9', '16.6', '2.5', '4.7', '52.7', '1.4', '1.7', '82.6', '6.6', '-', '9.6', '-', '1.8', '18.6', '0.7', '4.3', '0.7', '-'], ['Sacramento Kings', '62', '31', '31', '240.8', '425.2', '16.7', '3.2', '6.3', '50.3', '1.1', '1.6', '65.3', '7.5', '-', '8.0', '-', '1.5', '18.3', '1.0', '6.2', '0.7', '-'], ['Memphis Grizzlies', '65', '25', '40', '241.9', '452.1', '20.5', '3.4', '6.7', '51.3', '1.5', '1.9', '81.1', '8.6', '-', '11.2', '-', '1.6', '14.1', '0.8', '4.1', '0.8', '-']]
    

    仅获得团队:

    teams = [a for a, *_ in final_data]
    

    输出:

    ['Houston Rockets', 'Milwaukee Bucks', 'New York Knicks', 'Charlotte Hornets', 'Detroit Pistons', 'Washington Wizards', 'Atlanta Hawks', 'Brooklyn Nets', 'San Antonio Spurs', 'Boston Celtics', 'Toronto Raptors', 'Portland Trail Blazers', 'Utah Jazz', 'Minnesota Timberwolves', 'Chicago Bulls', 'LA Clippers', 'Miami Heat', 'New Orleans Pelicans', 'Phoenix Suns', 'Oklahoma City Thunder', 'Dallas Mavericks', 'Golden State Warriors', 'Orlando Magic', 'Los Angeles Lakers', 'Denver Nuggets', 'Indiana Pacers', 'Cleveland Cavaliers', 'Philadelphia 76ers', 'Sacramento Kings', 'Memphis Grizzlies']
    

    要获得特定的统计数据,最简单的方法是通过将标题值绑定到数据列表来创建字典列表:

    data_attrs = [dict(zip(headers, i)) for i in final_data]
    all_touches = [i['Touches'] for i in data_attrs]
    

    【讨论】:

    • 感谢您的帮助。有没有什么地方可以指导我理解你在这些行中做了什么:headers, [_, *data] = [i.text for i in s.find_all('th')], [[i.text for i in b.find_all('td')] for b in s.find_all('tr')] final_data = [i for i in data if len(i) > 1] *不确定如何格式化这是一条评论跨度>
    • 另外,我如何使用此代码访问特定的统计信息?
    • @BrennanMosher 这些行由列表推导组成,从bs4 对象形成表数据。 [[i.text for i in b.find_all('td')] for b in s.find_all('tr')] 是一个嵌套理解,将表数据创建为列表列表,而[i for i in data if len(i) > 1] 通过检查子列表的长度来删除在解析过程中发现的所有无关元素。有关列表推导的更多信息,请参阅here。要访问特定的统计数据,请查看我最近的编辑。
    • @Ajax1234 - 如果你有时间和精力,你能解释一下` [_, *data] `的作用吗?
    • @JackFleeting 没问题。 [_, *data] 被称为解包,因为它的分配结果[[i.text for i in b.find_all('td')] for b in s.find_all('tr')] 有一个空列表作为第一个元素(在标题上迭代的结果),并且使用_throwaway variable,是一种更简洁的方法删除它并将剩余的结果存储在列表中。查看更多关于拆包here
    【解决方案2】:

    另一种方法是向站点 API 发送 get 请求并接收 json 响应。通过更改参数,您可以获得不同的结果。

    您可以在 chrome 开发者工具下查找浏览器将请求发送到的位置。

    import requests
    
    url = "https://stats.nba.com/stats/leaguedashptstats?"
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"
    }
    
    params = {
        "PerMode": "PerGame",
        "PlayerOrTeam": "Team",
        "PtMeasureType": "ElbowTouch",
        "Season": "2018-19",
        "SeasonType": "Regular Season",
        "StarterBench": "",
        "PlayerPosition": "",
        "PlayerExperience": "",
        "GameScope": "",
        "VsConference": "",
        "VsDivision": "",
        "DateFrom": "",
        "DateTo": "",
        "SeasonSegment": "",
        "Location": "",
        "Outcome": "",
        "LastNGames": "0",
        "Month": "0",
        "OpponentTeamID": "0"
    }
    
    r = requests.get(url, params=params, headers=headers)
    data = r.json()
    results = data['resultSets'][0]['rowSet']
    
    for result in results:
        print(result)
    

    【讨论】:

      【解决方案3】:

      @Ajax1234 答案的变体可以让您将整个表格加载到数据框中:

      import pandas as pd
      
      pd.read_html(str(s))
      

      还有你的桌子。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2017-05-26
        • 2021-05-07
        • 1970-01-01
        • 1970-01-01
        • 2019-10-21
        • 2018-05-24
        • 2021-12-17
        • 2020-06-03
        相关资源
        最近更新 更多