在 Python 中使用 BeautifulSoup 和 Requests 抓取表数据答案

【问题标题】：Table data scraping using BeautifulSoup and Requests in Python在 Python 中使用 BeautifulSoup 和 Requests 抓取表数据
【发布时间】：2020-09-02 10:48:10
【问题描述】：

我正在尝试使用 beautifulsoup 和请求从以下站点抓取表格数据： https://www.worldometers.info/world-population/

我在运行代码时遇到这种错误：

> Traceback (most recent call last):   File
> "d:\python\population\worldpop.py", line 16, in <dictcomp>
>     result=[{ header[index]:cells.text for index,cells in enumerate(row.find_all('td'))} for row in    rows_data] IndexError:
> list index out of range

当然，我知道在访问超出范围的项目时会发生这种类型的错误，但是对于这个特定的问题，我遇到了麻烦。我期待这个问题的适当解决方案。

#worknig 从 worldometers.info 抓取表格数据并将其转换为 csv 文件。

from bs4 import BeautifulSoup
import requests
import pandas

url='https://www.worldometers.info/world-population/'

def world_population():
    page=requests.get(url)
    soup=BeautifulSoup(page.content,'html.parser')
    pop_data=soup.find('table', class_='table table-striped table-bordered table-hover table-condensed 
    table-list')
    header=[heading.text for heading in pop_data.find_all('th')]
    #print(header)
    rows_data=[row for row in pop_data.find_all('tr')]

    result=[{ header[index]:cells.text for index,cells in enumerate(row.find_all('td'))} for row in 
    rows_data]
    
    df=pandas.DataFrame(result)
    df.to_csv('pop.csv')

world_population()

【问题讨论】：

你要去哪张桌子？

标签： python-3.x pandas web-scraping beautifulsoup python-requests

【解决方案1】：

您可以使用 pandas 的 .read_html() 来解析 table> 标签。它将以数据框列表的形式返回给您一个表列表。那么就只需要从索引值中拉出你想要的表就行了。

import requests
import pandas as pd

url='https://www.worldometers.info/world-population/'

def world_population():
    page=requests.get(url)
    df = pd.read_html(page.text)[0]
    df.to_csv('pop.csv')

world_population()

输出：

print(df.to_string())
   Year (July 1)    Population Yearly % Change Yearly Change Median Age Fertility Rate Density (P/Km²) Urban Pop % Urban Population
            2020 7,794,798,739          1.05 %    81,330,639       30.9           2.47              52      56.2 %    4,378,993,944
            2019 7,713,468,100          1.08 %    82,377,060       29.8           2.51              52      55.7 %    4,299,438,618
0           2018    7631091040          1.10 %      83232115       29.8           2.51              51      55.3 %       4219817318
1           2017    7547858925          1.12 %      83836876       29.8           2.51              51      54.9 %       4140188594
2           2016    7464022049          1.14 %      84224910       29.8           2.51              50      54.4 %       4060652683
3           2015    7379797139          1.19 %      84594707       30.0           2.52              50      54.0 %       3981497663
4           2010    6956823603          1.24 %      82983315       28.0           2.58              47      51.7 %       3594868146
5           2005    6541907027          1.26 %      79682641       27.0           2.65              44      49.2 %       3215905863
6           2000    6143493823          1.35 %      79856169       26.0           2.78              41      46.7 %       2868307513
7           1995    5744212979          1.52 %      83396384       25.0           3.01              39      44.8 %       2575505235
8           1990    5327231061          1.81 %      91261864       24.0           3.44              36      43.0 %       2290228096
9           1985    4870921740          1.79 %      82583645       23.0           3.59              33      41.2 %       2007939063
10          1980    4458003514          1.79 %      75704582       23.0           3.86              30      39.3 %       1754201029
11          1975    4079480606          1.97 %      75808712       22.0           4.47              27      37.7 %       1538624994
12          1970    3700437046          2.07 %      72170690       22.0           4.93              25      36.6 %       1354215496
13          1965    3339583597          1.93 %      60926770       22.0           5.02              22        N.A.             N.A.
14          1960    3034949748          1.82 %      52385962       23.0           4.90              20      33.7 %       1023845517
15          1955    2773019936          1.80 %      47317757       23.0           4.97              19        N.A.             N.A.

【讨论】：