使用 Beautiful Soup 和 Python 从 wiki 抓取表格数据答案

【问题标题】：Scraping table data from wiki using Beautiful Soup and Python使用 Beautiful Soup 和 Python 从 wiki 抓取表格数据
【发布时间】：2019-12-06 04:39:41
【问题描述】：

如何使用 python 中的美丽汤从以下 wiki 页面的前两个表中提取 Alpha-3 代码？

https://en.wikipedia.org/wiki/List_of_territorial_entities_where_English_is_an_official_language

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

r = requests.get('https://en.wikipedia.org/wiki/List_of_territorial_entities_where_English_is_an_official_language')
soup = bs(r.content, 'lxml')

table = soup.find_all('table', class_='wikitable')[0]

output_rows = []
for table_row in table.findAll('tr'):
    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
        output_row.append(column.text)
    output_rows.append(output_row)

output_rows[1][2].rstrip('\n')
output_rows[2][2].rstrip('\n')
output_rows[3][2].rstrip('\n')
output_rows[4][2].rstrip('\n')

【问题讨论】：

请出示您目前编写的代码。
你的预期输出是什么
我只想要一个数组中的所有 Alpha-3 代码

标签： python-3.x web-scraping beautifulsoup

【解决方案1】：

使用 pandas 获取表，然后只附加前 2 个表（如果您想要所有数据），或者只获取 Alpha-3 列。

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_territorial_entities_where_English_is_an_official_language'
dfs = pd.read_html(url)

df = pd.DataFrame()
for table in dfs[:3]:
    df = df.append(table, sort=True).reset_index(drop=True)

alpha3 = list(df['Alpha-3 code'].dropna())

输出：

print (alpha3)
['AUS', 'NZL', 'GBR', 'USA', 'ATG', 'BHS', 'BRB', 'BLZ', 'BWA', 'BDI', 'CMR', 'CAN', 'COK', 'DMA', 'SWZ', 'FJI', 'GMB', 'GHA', 'GRD', 'GUY', 'IND', 'IRL', 'JAM', 'KEN', 'KIR', 'LSO', 'LBR', 'MWI', 'MLT', 'MHL', 'MUS', 'FSM', 'NAM', 'NGA', 'NIU', 'PAK', 'PLW', 'PNG', 'PHL', 'RWA', 'KNA', 'LCA', 'VCT', 'WSM', 'SYC', 'SLE', 'SGP', 'SLB', 'ZAF', 'SSD', 'SDN', 'TZA', 'TON', 'TTO', 'TUV', 'UGA', 'VUT', 'ZMB', 'ZWE']

【讨论】：

我是 Python 的初学者。非常感谢您的帮助。