【发布时间】:2021-05-25 19:22:52
【问题描述】:
我正在尝试使用 Beautiful Soup 从this Wikipedia Page 的表格中刮取前 3 列。
我实现了here找到的解决方案。
import requests
import lxml
import pandas as pd
from bs4 import BeautifulSoup
#requesting the page
url = 'https://en.wikipedia.org/wiki/List_of_winners_and_shortlisted_authors_of_the_Booker_Prize'
page = requests.get(url).text
#parsing the page
soup = BeautifulSoup(page, "lxml")
#selecting the table that matches the given class
table = soup.find('table',class_="sortable wikitable")
df = pd.read_html(str(table))
df = pd.concat(df)
print(df)
df.to_csv("booker.csv", index = False)
它就像一个魅力。给了我我正在寻找的输出:
但是,上面的解决方案使用了 pandas。
我想在不使用 pandas 的情况下创建相同的输出。
我提到了解决方案here,但我得到的输出如下所示:
这是生成“输出 2”的代码:
import requests
import lxml
from bs4 import BeautifulSoup
#requesting the page
url = 'https://en.wikipedia.org/wiki/List_of_winners_and_shortlisted_authors_of_the_Booker_Prize'
page = requests.get(url).text
#parsing the page
soup = BeautifulSoup(page, "lxml")
#selecting the table that matches the given class
table = soup.find('table',class_="sortable wikitable")
with open('output.csv', 'w', newline="") as file:
writer = csv.writer(file)
writer.writerow(['Year','Author','Title'])
for tr in table.find_all('tr'):
try:
td_1 = tr.find_all('td')[0].get_text(strip=True)
except IndexError:
td_1 = ""
try:
td_2 = tr.find_all('td')[1].get_text(strip=True)
except IndexError:
td_2 = ""
try:
td_3 = tr.find_all('td')[3].get_text(strip=True)
except IndexError:
td_3 = ""
writer.writerow([td_1, td_2,td_3])
所以我的问题是:如何在不使用 Pandas 的情况下获得预期的输出?
P.S:我尝试像这样解析表中的行:
import requests
import lxml
from bs4 import BeautifulSoup
#requesting the page
url = 'https://en.wikipedia.org/wiki/List_of_winners_and_shortlisted_authors_of_the_Booker_Prize'
page = requests.get(url).text
#parsing the page
soup = BeautifulSoup(page, "lxml")
#selecting the table that matches the given class
table = soup.find('table',class_="sortable wikitable")
rows = table.find_all('tr')
for row in rows:
cell = row.td
if cell is not None:
print(cell.get_text())
print(cell.next_sibling.next_sibling.get_text())
else:
print("heehee")
但是我得到的输出是这样的:
heehee
1969
Barry England
Nicholas Mosley
Iris Murdoch
Muriel Spark
Gordon Williams
1970
A. L. Barker
Elizabeth Bowen
Iris Murdoch
William Trevor
Terence Wheeler
1970 Awarded in 2010 as the Lost Man Booker Prize[a]
Nina Bawden
Shirley Hazzard
Mary Renault
Muriel Spark
Patrick White
1971
Thomas Kilroy
Doris Lessing
Mordecai Richler
Derek Robinson
Elizabeth Taylor
1972
Susan Hill
Thomas Keneally
【问题讨论】:
标签: python-3.x web-scraping beautifulsoup