使用 Beautiful Soup 从 Wikipedia 表创建 CSV 文件答案

【问题标题】：Creating a CSV File from a Wikipedia table using Beautiful Soup使用 Beautiful Soup 从 Wikipedia 表创建 CSV 文件
【发布时间】：2021-05-25 19:22:52
【问题描述】：

我正在尝试使用 Beautiful Soup 从this Wikipedia Page 的表格中刮取前 3 列。

我实现了here找到的解决方案。

import requests
import lxml
import pandas as pd
from bs4 import BeautifulSoup

#requesting the page
url = 'https://en.wikipedia.org/wiki/List_of_winners_and_shortlisted_authors_of_the_Booker_Prize'
page = requests.get(url).text 

#parsing the page
soup = BeautifulSoup(page, "lxml")

#selecting the table that matches the given class
table = soup.find('table',class_="sortable wikitable")

df = pd.read_html(str(table))
df = pd.concat(df)
print(df)
df.to_csv("booker.csv", index = False)

它就像一个魅力。给了我我正在寻找的输出：

Expected Output 1

但是，上面的解决方案使用了 pandas。

我想在不使用 pandas 的情况下创建相同的输出。

我提到了解决方案here，但我得到的输出如下所示：

Output 2

这是生成“输出 2”的代码：

import requests
import lxml
from bs4 import BeautifulSoup

#requesting the page
url = 'https://en.wikipedia.org/wiki/List_of_winners_and_shortlisted_authors_of_the_Booker_Prize'
page = requests.get(url).text 

#parsing the page
soup = BeautifulSoup(page, "lxml")

#selecting the table that matches the given class
table = soup.find('table',class_="sortable wikitable")

with open('output.csv', 'w', newline="") as file:
    writer = csv.writer(file)
    writer.writerow(['Year','Author','Title'])
    for tr in table.find_all('tr'):
        try:
            td_1 = tr.find_all('td')[0].get_text(strip=True)
        except IndexError:
            td_1 = ""
        try:
            td_2 = tr.find_all('td')[1].get_text(strip=True)
        except IndexError:
            td_2 = ""
        try:
            td_3 = tr.find_all('td')[3].get_text(strip=True)
        except IndexError:
            td_3 = ""
        writer.writerow([td_1, td_2,td_3])

所以我的问题是：如何在不使用 Pandas 的情况下获得预期的输出？

P.S：我尝试像这样解析表中的行：

import requests
import lxml
from bs4 import BeautifulSoup

#requesting the page
url = 'https://en.wikipedia.org/wiki/List_of_winners_and_shortlisted_authors_of_the_Booker_Prize'
page = requests.get(url).text 

#parsing the page
soup = BeautifulSoup(page, "lxml")

#selecting the table that matches the given class
table = soup.find('table',class_="sortable wikitable")

rows = table.find_all('tr')

for row in rows:
    cell = row.td
    if cell is not None:
        print(cell.get_text())
        print(cell.next_sibling.next_sibling.get_text())
    else:
        print("heehee")

但是我得到的输出是这样的：

heehee
1969
Barry England
Nicholas Mosley
Iris Murdoch
Muriel Spark
Gordon Williams
1970
A. L. Barker
Elizabeth Bowen
Iris Murdoch
William Trevor
Terence Wheeler
1970   Awarded in 2010 as the  Lost Man Booker Prize[a]
Nina Bawden
Shirley Hazzard
Mary Renault
Muriel Spark
Patrick White
1971
Thomas Kilroy
Doris Lessing
Mordecai Richler
Derek Robinson
Elizabeth Taylor
1972
Susan Hill
Thomas Keneally

【问题讨论】：

标签： python-3.x web-scraping beautifulsoup

【解决方案1】：

尝试以下方法以获得您想要的结果。确保您的 bs4 版本是最新的或至少高于 4.7.0 以支持我在脚本中使用的伪 css 选择器。

import csv
import lxml
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_winners_and_shortlisted_authors_of_the_Booker_Prize'

page = requests.get(url) 
soup = BeautifulSoup(page.text, "lxml")
with open('output.csv', 'w', newline="") as file:
    writer = csv.writer(file)
    writer.writerow(['Year','Author','Title'])
    for row in soup.select('table.wikitable > tbody > tr')[1:]:
        try:
            year = row.select_one("td[rowspan]").get_text(strip=True)
        except AttributeError: year = ""
        try:
            author = row.select_one("td:not([rowspan]) > a[title]").get_text(strip=True)
        except AttributeError: author = ""
        try:
            title = row.select_one("td > i > a[title], td > i").get_text(strip=True)
        except AttributeError: title = ""
        writer.writerow([year,author,title])
        print(year,author,title)

【讨论】：

【解决方案2】：

最简单的方法是直接使用pandas：

import pandas as pd


url = "https://en.wikipedia.org/wiki/List_of_winners_and_shortlisted_authors_of_the_Booker_Prize"
df = pd.read_html(url)[0][["Year", "Author", "Title"]]
print(df)

打印：

                                                  Year                Author                                        Title
0                                                 1969           P. H. Newby                      Something to Answer For
1                                                 1969         Barry England                       Figures in a Landscape
2                                                 1969       Nicholas Mosley                        The Impossible Object
3                                                 1969          Iris Murdoch                        The Nice and the Good
4                                                 1969          Muriel Spark                             The Public Image
5                                                 1969       Gordon Williams                       From Scenes Like These
6                                                 1970        Bernice Rubens                           The Elected Member
7                                                 1970          A. L. Barker                            John Brown's Body
8                                                 1970       Elizabeth Bowen                                    Eva Trout
9                                                 1970          Iris Murdoch                                Bruno's Dream
10                                                1970        William Trevor               Mrs Eckdorf in O'Neill's Hotel
11                                                1970       Terence Wheeler                              The Conjunction
12   1970 Awarded in 2010 as the Lost Man Booker Pr...         J. G. Farrell                                     Troubles
13   1970 Awarded in 2010 as the Lost Man Booker Pr...           Nina Bawden                       The Birds on the Trees
14   1970 Awarded in 2010 as the Lost Man Booker Pr...       Shirley Hazzard                              The Bay of Noon
15   1970 Awarded in 2010 as the Lost Man Booker Pr...          Mary Renault                             Fire From Heaven
16   1970 Awarded in 2010 as the Lost Man Booker Pr...          Muriel Spark                            The Driver's Seat
17   1970 Awarded in 2010 as the Lost Man Booker Pr...         Patrick White                               The Vivisector

...

转为 CSV：

df.to_csv("data.csv", index=None)

创建data.csv:

【讨论】：

这是一个不错的解决方案。我曾要求不使用 Pandas 的解决方案，但这是一个有效的解决方案。