【问题标题】:Scraping table by beautiful soup 4美汤刮桌4
【发布时间】:2020-02-29 11:58:27
【问题描述】:

您好,我正在尝试在此 url 中抓取此表:https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc

此表中有 50 行。但是,如果您单击 Show more(就在表下方),则会显示更多行。我漂亮的汤代码工作正常,但问题是它只检索前 50 行。它不会检索单击Show more 后出现的行。如何获取包括前 50 行在内的所有行以及单击Show more 后出现的行? 代码如下:

#Request to get the target wiki page
rqst = requests.get("https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc")
soup = BeautifulSoup(rqst.content,'lxml')
table = soup.find_all('table')
NFL_player_stats = pd.read_html(str(table))
players = NFL_player_stats[0]
players.shape
out[0]:  (50,1) 

【问题讨论】:

  • 页面使用JavaScript 显示更多行,但BeautifulSoup 不运行JavaScript。您可能需要Selenium 来控制可以运行JavaScript 的网络浏览器。或者,您必须找到 JavaScript/AJAX 用于获取新数据的 url,然后将此 url 与 requests 一起使用。在Chrome/Firefox 中使用DevTools,您可以看到从浏览器到服务器的所有请求。
  • Firefox 中使用DevTools 我看到它从site.web.api.espn.com/apis/common/v3/sports/football/nfl/… 获取下一页的数据(JSON 格式)。如果您更改 page= 中的值,那么您应该获得其他页面。

标签: python web-scraping beautifulsoup


【解决方案1】:

Firefox 中使用DevTools 我看到它从以下页面获取数据(JSON 格式)

https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page=2

如果您更改page= 中的值,则可以获取其他页面。

import requests

url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page='

for page in range(1, 4):
    print('\n---', page, '---\n')

    r = requests.get(url + str(page))
    data = r.json()

    #print(data.keys())

    for item in data['athletes']:
        print(item['athlete']['displayName'])

结果:

--- 1 ---

Ezekiel Elliott
Saquon Barkley
Todd Gurley II
Joe Mixon
Chris Carson
Christian McCaffrey
Derrick Henry
Adrian Peterson
Phillip Lindsay
Nick Chubb
Lamar Miller
James Conner
David Johnson
Jordan Howard
Sony Michel
Marlon Mack
Melvin Gordon
Alvin Kamara
Peyton Barber
Kareem Hunt
Matt Breida
Tevin Coleman
Aaron Jones
Doug Martin
Frank Gore
Gus Edwards
Lamar Jackson
Isaiah Crowell
Mark Ingram II
Kerryon Johnson
Josh Allen
Dalvin Cook
Latavius Murray
Carlos Hyde
Austin Ekeler
Deshaun Watson
Kenyan Drake
Royce Freeman
Dion Lewis
LeSean McCoy
Mike Davis
Josh Adams
Alfred Blue
Cam Newton
Jamaal Williams
Tarik Cohen
Leonard Fournette
Alfred Morris
James White
Mitchell Trubisky

--- 2 ---

Rashaad Penny
LeGarrette Blount
T.J. Yeldon
Alex Collins
C.J. Anderson
Chris Ivory
Marshawn Lynch
Russell Wilson
Blake Bortles
Wendell Smallwood
Marcus Mariota
Bilal Powell
Jordan Wilkins
Kenneth Dixon
Ito Smith
Nyheim Hines
Dak Prescott
Jameis Winston
Elijah McGuire
Patrick Mahomes
Aaron Rodgers
Jeff Wilson Jr.
Zach Zenner
Raheem Mostert
Corey Clement
Jalen Richard
Damien Williams
Jaylen Samuels
Marcus Murphy
Spencer Ware
Cordarrelle Patterson
Malcolm Brown
Giovani Bernard
Chase Edmonds
Justin Jackson
Duke Johnson
Taysom Hill
Kalen Ballage
Ty Montgomery
Rex Burkhead
Jay Ajayi
Devontae Booker
Chris Thompson
Wayne Gallman
DJ Moore
Theo Riddick
Alex Smith
Robert Woods
Brian Hill
Dwayne Washington

--- 3 ---

Ryan Fitzpatrick
Tyreek Hill
Andrew Luck
Ryan Tannehill
Josh Rosen
Sam Darnold
Baker Mayfield
Jeff Driskel
Rod Smith
Matt Ryan
Tyrod Taylor
Kirk Cousins
Cody Kessler
Darren Sproles
Josh Johnson
DeAndre Washington
Trenton Cannon
Javorius Allen
Jared Goff
Julian Edelman
Jacquizz Rodgers
Kapri Bibbs
Andy Dalton
Ben Roethlisberger
Dede Westbrook
Case Keenum
Carson Wentz
Brandon Bolden
Curtis Samuel
Stevan Ridley
Keith Ford
Keenan Allen
John Kelly
Kenjon Barner
Matthew Stafford
Tyler Lockett
C.J. Beathard
Cameron Artis-Payne
Devonta Freeman
Brandin Cooks
Isaiah McKenzie
Colt McCoy
Stefon Diggs
Taylor Gabriel
Jarvis Landry
Tavon Austin
Corey Davis
Emmanuel Sanders
Sammy Watkins
Nathan Peterman

编辑:获取所有数据为DataFrame

import requests
import pandas as pd

url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page='

df = pd.DataFrame() # emtpy DF at start

for page in range(1, 4):
    print('page:', page)

    r = requests.get(url + str(page))
    data = r.json()

    #print(data.keys())

    for item in data['athletes']:
        player_name = item['athlete']['displayName']
        position = item['athlete']['position']['abbreviation']
        gp = item['categories'][0]['totals'][0]
        other_values = item['categories'][2]['totals']
        row = [player_name, position, gp] + other_values

        df = df.append( [row] ) # append one row

df.columns = ['NAME', 'POS', 'GP', 'ATT', 'YDS', 'AVG', 'LNG', 'BIG', 'TD', 'YDS/G', 'FUM', 'LST', 'FD']

print(len(df)) # 150
print(df.head(20))

【讨论】:

  • 谢谢,请问您是如何提出这种扩展形式的链接的,因为我必须调整代码以获取位于表中的其余列
  • 我在 Firefox 中使用 DevTool 来查看从浏览器发送到服务器的所有请求。我按Show More时有这个网址
  • 我试图检索其余列但无法...请您帮我检索其余列:从“POS”到“FD”
  • 您可能需要创建更复杂的for-loop 并获取值。例如POSitem['athlete']['position'] 中,但它是一个包含许多元素的字典,你必须选择你需要的 - item['athlete']['position']['abbreviation'] 给出QB 但'item['athlete']['position']['name'] 给出全名'Quarterback'
  • 您可以使用print()type(item)item.keys() 来检查字典中使用了哪些键。或者您可以将data 保存在文件中,以便在文本编辑器中查看所有数据。
猜你喜欢
  • 2019-03-13
  • 1970-01-01
  • 1970-01-01
  • 2019-02-15
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-09-12
  • 2021-01-15
相关资源
最近更新 更多