【问题标题】:Why can't I webscrape the table that I want?为什么我不能抓取我想要的表格?
【发布时间】:2020-06-01 01:12:45
【问题描述】:

我是 BeautifulSoup 的新手,我想尝试一些网络抓取。对于我的小项目,我想从维基百科获得金州勇士队的胜率。我打算把包含这些信息的表格做成熊猫,这样我就可以多年来绘制它。但是,我的代码选择了 Table Key 表而不是 Seasons 表。我知道这是因为它们是同一类型的表(wikitable),但我不知道如何解决这个问题。我确信我缺少一个简单的解释。有人可以解释如何修复我的代码并解释我将来如何选择哪些表进行网络抓取?谢谢!

c_data = "https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons" #wikipedia page
c_page = urllib.request.urlopen(c_data)
c_soup = BeautifulSoup(c_page, "lxml")
c_table=c_soup.find('table', class_='wikitable') #this is the problem
c_year = []
c_rate = []
for row in c_table.findAll('tr'): #setup for dataframe
  cells=row.findAll('td')
  if len(cells)==13:
    c_year = c_year.append(cells[0].find(text=True))
    c_rate = c_rate.append(cells[9].find(text=True))
print(c_year, c_rate)

【问题讨论】:

  • 我还导入了beautifulsoup和urllib.request

标签: python python-3.x dataframe beautifulsoup wikipedia


【解决方案1】:

使用pd.read_html获取所有表格

  • 此函数返回数据帧列表
    • tables[0]tables[17],在这种情况下
import pandas as pd

# read tables
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons')

print(len(tables))
>>> 18

tables[0]
          0                                             1
0       AHC                  NBA All-Star Game Head Coach
1      AMVP            All-Star Game Most Valuable Player
2       COY                             Coach of the Year
3      DPOY                  Defensive Player of the Year
4    Finish          Final position in division standings
5        GB  Games behind first-place team in division[b]
6   Italics                            Season in progress
7    Losses               Number of regular season losses
8       EOY                         Executive of the Year
9      FMVP                   Finals Most Valuable Player
10      MVP                          Most Valuable Player
11      ROY                            Rookie of the Year
12      SIX                         Sixth Man of the Year
13     SPOR                           Sportsmanship Award
14     Wins                 Number of regular season wins

# display all dataframes in tables
for i, table in enumerate(tables):
    print(f'Table {i}')
    display(table)
    print('\n')

选择特定表

df_i_want = tables[x]  # x is the specified table, 0 indexed

# delete tables
del(tables)

【讨论】:

    猜你喜欢
    • 2021-09-10
    • 2011-01-06
    • 1970-01-01
    • 2022-01-12
    • 2023-02-21
    • 1970-01-01
    • 2014-09-08
    • 2021-05-27
    • 2020-12-16
    相关资源
    最近更新 更多