【发布时间】:2016-12-21 05:27:41
【问题描述】:
我很感激所有关于 Python/beautifulSoup/scraping 的问题和答案,但我对这种情况的了解不多,我被困住了。目前,我的代码可以成功地遍历搜索结果的页面并创建一个 csv 文档,但是当涉及到每个单独的表时,它只会复制第一行,然后再移动到下一个结果页面。
例如,this page。目前,我的输出如下所示:
Brian Benoit,25-Jun-16,Conservative,12-May-16,25-Jun-16,Medicine Hat--Cardston--Warner,b'Medicine Hat--Cardston--Warner',Nikolai Punko
它应该看起来像这样:
Brian Benoit,25-Jun-16,Conservative,12-May-16,25-Jun-16,Medicine Hat--Cardston--Warner,b'Medicine Hat--Cardston--Warner',Nikolai Punko
Paul Hinman,25-Jun-16,Conservative,12-May-16,25-Jun-16,Medicine Hat--Cardston--Warner,b'Welling, Alberta',Robert B. Barfuss
Michael Jones,25-Jun-16,Conservative,12-May-16,25-Jun-16,Medicine Hat--Cardston--Warner,b'Raymond, Alberta',Dawn M. Hamon
(对于表中的所有行,依此类推。)
我的问题是:在继续下一个结果页面之前,如何让它循环并刮掉 每个 行?谢谢。
这是我的代码:
from bs4 import BeautifulSoup
import requests
import re
import csv
url = "http://www.elections.ca/WPAPPS/WPR/EN/NC?province=-1&distyear=2013&district=-1&party=-1&pageno={}&totalpages=55&totalcount=1368&secondaryaction=prev25"
with open('scrapeAllRows.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
for i in range(1, 56):
print(i)
r = requests.get(url.format(i))
data = r.text
soup = BeautifulSoup(data, "html.parser")
links = []
for link in soup.find_all('a', href=re.compile('selectedid=')):
links.append("http://www.elections.ca" + link.get('href'))
for link in links:
r = requests.get(link)
data = r.text
cat = BeautifulSoup(data, "html.parser")
header = cat.find_all('span')
tables = cat.find_all("table")[0].find_all("td")
row = [
#"name":
re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="name/1")[0].contents[0]).strip(),
#"date":
header[2].contents[0],
#"party":
re.sub("[\n\r/]", "", cat.find("legend").contents[2]).strip(),
#"start_date":
header[3].contents[0],
#"end_date":
header[5].contents[0],
#"electoral district":
re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip(),
#"registered association":
re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip().encode('latin-1'),
#"elected":
re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="elected/1")[0].contents[0]).strip(),
#"address":
re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="address/1")[0].contents[0]).strip(),
#"financial_agent":
re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="fa/1")[0].contents[0]).strip()]
csv_output.writerow(row)
【问题讨论】:
标签: python python-3.x csv web-scraping beautifulsoup