使用 python 从 Wikipedia 中抓取表格？答案

【问题标题】：Scrape tables from Wikipedia using python?使用 python 从 Wikipedia 中抓取表格？
【发布时间】：2020-04-06 07:58:02
【问题描述】：

我正在尝试从这个 Wikipedia 页面中抓取表格数据：https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Nepal 我尝试使用 pandas pd.read_html 语法，但它不适用于我要抓取的表（尼泊尔各地区确诊的 COVID-19 病例）。

我尝试使用 Beautifulsoup 和 pandas 来抓取数据，但它不起作用

url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Nepal'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
table = soup.find('table', {'class': 'wikitable'})
dfs=pd.read_html(table)
dfs[0]

【问题讨论】：

标签： python pandas web-scraping beautifulsoup

【解决方案1】：

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Nepal'
# dfs = pd.read_html("https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Nepal", flavor="lxml")
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find('table', {'class': 'wikitable'})
dfs = pd.read_html(str(table).replace("2;", "2"))
print(dfs[0])

这可行，您需要将表格转换为字符串，read_html 才能正常工作。

由于某种原因，rowspan 和colspan 属性显示为"2;"，我找不到修复它的好方法-pd.read_html() 不喜欢这样，所以我只使用.replace()。

理论上这应该完成同样的事情，但更短更容易，但它与rowspan有同样的问题：

dfs = pd.read_html("https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Nepal", flavor="lxml")
print(dfs[0])  # whatever the index of the table is

这似乎是read_html（pandas 版本 1.0.3）的一个可能错误。

【讨论】：