无法使用 BeautifulSoup 抓取 HTML 表格并使用 Python 将其加载到 Pandas 数据框中答案

【问题标题】：Unable to webscrape HTML table with BeautifulSoup and load it into a Pandas dataframe with Python无法使用 BeautifulSoup 抓取 HTML 表格并使用 Python 将其加载到 Pandas 数据框中
【发布时间】：2020-05-17 17:15:07
【问题描述】：

我的目标是访问以下网页 https://www.countries-ofthe-world.com/world-currencies.html 上的表格，并将其转换为包含“国家或地区”、“货币”和“ISO-4217”列的 Pandas 数据框。

我能够正确访问列，但我很难弄清楚如何将每一行附加到数据帧。大家对我如何做到这一点有什么建议吗？例如，在网页上，表格的第一行是字母“A”。但是，我需要数据框中的第一行是Afghanistan、Afghan afghani 和AFN。

这是我目前所拥有的：

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.countries-ofthe-world.com/world-currencies.html"
req = Request(url, headers={"User-Agent":"Mozilla/5.0"})
webpage=urlopen(req).read()
soup = BeautifulSoup(webpage, "html.parser")
table = soup.find("table", {"class":"codes"})
rows = table.find_all('tr')
columns = [v.text for v in rows[0].find_all('th')] 
print(columns) # ['Country or territory', 'Currency', 'ISO-4217']

请也看看这张图片。

感谢大家的宝贵时间。

托尼

【问题讨论】：

可能要检查 response.status_code - 我从该站点获得 403 Forbidden，所以 response.text 不会有任何有用的内容。
谢谢！我会仔细看看的。我现在明白了。
我解决了错误的请求。请查看更新后的问题。

标签： python-3.x pandas web-scraping beautifulsoup

【解决方案1】：

有了你的修复，pd.read_html 可以很容易地解析它：

url = "https://www.countries-ofthe-world.com/world-currencies.html"
req = Request(url, headers={"User-Agent":"Mozilla/5.0"})
webpage = urlopen(req).read()

df = pd.read_html(webpage)[0]
print(df.head())

         Country or territory        Currency ISO-4217
0                           A               A        A
1                 Afghanistan  Afghan afghani      AFN
2  Akrotiri and Dhekelia (UK)   European euro      EUR
3     Aland Islands (Finland)   European euro      EUR
4                     Albania    Albanian lek      ALL

它有那些字母表头，但你可以去掉那些像df = df[df['Currency'] != df['ISO-4217']]这样的东西

【讨论】：