网页抓取：输出 CSV 混乱答案

【问题标题】：Web scraping: output CSV is messed up网页抓取：输出 CSV 混乱
【发布时间】：2017-05-05 09:30:52
【问题描述】：

此代码旨在遍历所有结果页面，然后遍历每个页面上的结果表，并从表中抓取所有数据以及存储在表外的一些信息。

但是，生成的 CSV 文件似乎没有任何合理的组织，每一行在不同的列中都有不同类别的信息。 What I am after is for each row to contain all the categories of information defined (date, party, start date, end date, electoral district, registered association, whether or not the candidate was elected, name of candidate, address, and financial agent ）。其中一些数据存储在每页的表格中，而其余数据（日期、政党、地区、注册协会）存储在表格之外，需要与每页上每个表格行中的每个候选人相关联。 Additionally, there does not seem to be any output for 'elected', 'address', or 'financial agent,' and I am not sure where I am going wrong.

如果您能帮助我弄清楚如何修复我的代码以实现此输出，我将不胜感激。如下：

from bs4 import BeautifulSoup
import requests
import re
import csv

url = "http://www.elections.ca/WPAPPS/WPR/EN/NC?province=-1&distyear=2013&district=-1&party=-1&pageno={}&totalpages=55&totalcount=1368&secondaryaction=prev25"

rows = []

for i in range(1, 56):
    print(i)
    r  = requests.get(url.format(i))
    data = r.text
    soup = BeautifulSoup(data, "html.parser")
    links = []

    for link in soup.find_all('a', href=re.compile('selectedid=')):
        links.append("http://www.elections.ca" + link.get('href'))

    for link in links:
        r  = requests.get(link)
        data = r.text
        cat = BeautifulSoup(data, "html.parser")
        header = cat.find_all('span')
        tables = cat.find_all("table")[0].find_all("td")        

        rows.append({
            #"date": 
            header[2].contents[0],
            #"party": 
            re.sub("[\n\r/]", "", cat.find("legend").contents[2]).strip(),
            #"start_date": 
            header[3].contents[0],
            #"end_date": 
            header[5].contents[0],
            #"electoral district": 
            re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip(),
            #"registered association": 
            re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip().encode('latin-1'),
            #"elected": 
            re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="elected/1")[0].contents[0]).strip(),
            #"name": 
            re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="name/1")[0].contents[0]).strip(),
            #"address": 
            re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="address/1")[0].contents[0]).strip(),
            #"financial_agent": 
            re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="fa/1")[0].contents[0]).strip()
        })

with open('scrapeOutput.csv', 'w') as f_output:
   csv_output = csv.writer(f_output)
   csv_output.writerows(rows)

【问题讨论】：

您要抓取所有条目吗？还是您想按搜索条件之一（省/地区、重新分配年份、选区、政党、协会关键字、参赛者关键字、比赛日期）进行过滤？
您使用哪个分隔符？另外，如果您在某些字段中有分隔符，您是否使用quotechar？
@IvanChaer 我想在没有任何过滤的情况下抓取所有内容，到目前为止，我的代码基本上可以做到这一点——这只是获取存储在每个页面上的所有信息的问题，加上 csv 输出问题。
我们不能直接从详情页刮下来吗（elections.ca/WPAPPS/WPR/EN/NC/…）？这样我们就可以避免从两个不同的 url 中抓取一个项目。
谢谢大家！输出表看起来很棒。我创建了第二个问题，关于如何重写我的代码，以便刮掉每个表中所有参赛者的姓名，而不仅仅是第一个。如果你有什么想法，那就here

标签： python html python-3.x csv web-scraping

【解决方案1】：

我认为你的字典有点乱，你没有分配键。提醒一下，如果你把字典转换成列表，python会根据key按字母顺序对它们进行排序。但是使用 csv 库，您无需所有这些操作即可轻松打印 csv。

所以分配键：

rows.append({
        "date": 
        header[2].contents[0],
        "party": 
        re.sub("[\n\r/]", "", cat.find("legend").contents[2]).strip(),
        "start_date": 
        header[3].contents[0],
        "end_date": 
        header[5].contents[0],
        "electoral district": 
        re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip(),
        "registered association": 
        re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip().encode('latin-1'),
        "elected": 
        re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="elected/1")[0].contents[0]).strip(),
        "name": 
        re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="name/1")[0].contents[0]).strip(),
        "address": 
        re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="address/1")[0].contents[0]).strip(),
        "financial_agent": 
        re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="fa/1")[0].contents[0]).strip()
    })

然后用 DictWriter 编写你的 csv ：

with open('scrapeOutput.csv', 'w') as f_output:
    csv_output = csv.DictWriter(f_output, rows[0].keys())
    csv_output.writeheader() # Write header to understand the csv
    csv_output.writerows(rows)

I tested this and it's working, but be careful some of your fields like address or elected are empty :)

再见！

【讨论】：

【解决方案2】：

我建议您在进行过程中一次将一行写入输出 CSV 文件，而不是等到最后。此外，最好使用列表而不是字典来保存数据。这样可以保持列的顺序。

from bs4 import BeautifulSoup
import requests
import re
import csv


url = "http://www.elections.ca/WPAPPS/WPR/EN/NC?province=-1&distyear=2013&district=-1&party=-1&pageno={}&totalpages=55&totalcount=1368&secondaryaction=prev25"

with open('scrapeOutput.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)

    for i in range(1, 56):
        print(i)
        r  = requests.get(url.format(i))
        data = r.text
        soup = BeautifulSoup(data, "html.parser")
        links = []

        for link in soup.find_all('a', href=re.compile('selectedid=')):
            links.append("http://www.elections.ca" + link.get('href'))

        for link in links:
            r  = requests.get(link)
            data = r.text
            cat = BeautifulSoup(data, "html.parser")
            header = cat.find_all('span')
            tables = cat.find_all("table")[0].find_all("td")        

            row = [
                #"date": 
                header[2].contents[0],
                #"party": 
                re.sub("[\n\r/]", "", cat.find("legend").contents[2]).strip(),
                #"start_date": 
                header[3].contents[0],
                #"end_date": 
                header[5].contents[0],
                #"electoral district": 
                re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip(),
                #"registered association": 
                re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip().encode('latin-1'),
                #"elected": 
                re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="elected/1")[0].contents[0]).strip(),
                #"name": 
                re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="name/1")[0].contents[0]).strip(),
                #"address": 
                re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="address/1")[0].contents[0]).strip(),
                #"financial_agent": 
                re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="fa/1")[0].contents[0]).strip()]

            csv_output.writerow(row)    
            print(row)

这将导致 CSV 开始如下：

"December 08, 2016",Green Party,"September 21, 2016","December 08, 2016",Calgary Midnapore,b'Calgary Midnapore',,Ryan Zedic,,
"November 29, 2016",NDP-New Democratic Party,"August 24, 2016","November 29, 2016",Ottawa--Vanier,b'Ottawa--Vanier',,Emilie Taman,,
"September 28, 2016",Green Party,"September 04, 2016","September 28, 2016",Medicine Hat--Cardston--Warner,b'Medicine Hat--Cardston--Warner',,Kelly Dawson,,

【讨论】：

【解决方案3】：

如果你想爬，你可能想看看CrawlSpider，来自scrapy。我也使用lxml.html 只是因为它提供了更大的灵活性。

要安装这些库，您可以使用：

pip install scrapy

pip install lxml

要搭建一个基本的scrapy项目，您可以使用command：

scrapy startproject elections

然后添加蜘蛛和项目：

选举/蜘蛛/spider.py

from scrapy.spiders import CrawlSpider, Rule
from elections.items import ElectionsItem
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import Selector

from lxml import html

class ElectionsSpider(CrawlSpider):
    name = "elections"
    allowed_domains = ["elections.ca"]
    start_urls = ["http://www.elections.ca/WPAPPS/WPR/EN/NC/Details?province=-1&distyear=2013&district=-1&party=-1&pageno=1&totalpages=55&totalcount=1372&viewall=1"]

    rules = (

        Rule(LxmlLinkExtractor(
                allow = ('http://www.elections.ca/WPAPPS/WPR/EN/NC/Details.*'),
            ),
            callback='parse_item',
            follow=True
        ),


      )

    def unindent(self, string):
        return ''.join(map(str.strip, string.encode('utf8').splitlines(1)))

    def parse_item(self, response):

        item = ElectionsItem()

        original_html = Selector(response).extract()

        lxml_obj = html.fromstring(original_html)

        for entry in lxml_obj.xpath('.//fieldset[contains(@class,"wpr-detailgroup")]'):


            date = entry.xpath('.//legend[contains(@class,"wpr-ltitle")]/span[contains(@class,"date")]')
            if date:
                item['date'] = self.unindent(date[0].text.strip())
            party = entry.xpath('.//legend[contains(@class,"wpr-ltitle")]')
            if party:
                item['party'] = self.unindent(party[0].text.strip())
            start_date = entry.xpath('.//div[contains(@class,"group")]/span[contains(@class,"date")][1]')
            if start_date:
                item['start_date'] = self.unindent(start_date[0].text.strip())
            end_date = entry.xpath('.//div[contains(@class,"group")]/span[contains(@class,"date")][2]')
            if end_date:
                item['end_date'] = self.unindent(end_date[0].text.strip())
            electoral_district = entry.xpath('.//div[contains(@class,"wpr-title")][contains(text(),"Electoral district:")]')
            if electoral_district:
                item['electoral_district'] = self.unindent(electoral_district[0].tail.strip())
            registered_association = entry.xpath('.//div[contains(@class,"wpr-title")][contains(text(),"Registered association:")]')
            if registered_association:
                item['registered_association'] = self.unindent(registered_association[0].tail.strip())

            for candidate in entry.xpath('.//table[contains(@class, "wpr-datatable")]//tr[not(@class)]'):

                item['elected'] = len(candidate.xpath('.//img[contains(@alt, "contestant won this nomination contest")]'))
                candidate_name = candidate.xpath('.//td[contains(@headers,"name")]')
                if candidate_name:
                    item['candidate_name'] = self.unindent(candidate_name[0].text.strip())
                item['address'] = self.unindent(candidate.xpath('.//td[contains(@headers,"address")]')[0].text_content().strip())
                item['financial_agent'] = self.unindent(candidate.xpath('.//td[contains(@headers,"fa")]')[0].text_content().strip())

                yield item

选举/items.py

from scrapy.item import Item, Field

class ElectionsItem(Item):

    date = Field()
    party = Field()
    start_date = Field()
    end_date = Field()
    electoral_district = Field()
    registered_association = Field()
    elected = Field()
    candidate_name = Field()
    address = Field()
    financial_agent = Field()

选举/设置.py

BOT_NAME = 'elections'

SPIDER_MODULES = ['elections.spiders']
NEWSPIDER_MODULE = 'elections.spiders'

ITEM_PIPELINES = {
   'elections.pipelines.ElectionsPipeline': 300,
}

选举/管道.py

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exporters import CsvItemExporter

class electionsPipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}

    def spider_opened(self, spider):
        file = open('%s_ads.csv' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = CsvItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

你可以通过运行command来运行蜘蛛：

scrapy runspider elections/spiders/spider.py

从项目的根目录。

它应该在项目的根目录中创建一个elections.csv，如下所示：

financial_agent,end_date,candidate_name,registered_association,electoral_district,elected,address,date,party,start_date
"Jan BalcaThornhill, OntarioL4J 1V9","September 09, 2015",Leslyn Lewis,,Scarborough--Rouge Park,1,"Markham, OntarioL6B 0K9","September 09, 2015",,"September 07, 2015"
"Mark HicksPasadena, Newfoundland and LabradorA0L 1K0","September 08, 2015",Roy Whalen,,Long Range Mountains,1,"Deer Lake, Newfoundland and LabradorA8A 3H6","September 08, 2015",,"August 21, 2015"
,"September 08, 2015",Wayne Ruth,,Long Range Mountains,0,"Kippens, Newfoundland and LabradorA2N 3B8","September 08, 2015",,"August 21, 2015"
,"September 08, 2015",Mark Krol,,St. John's South--Mount Pearl,1,"Woodbridge, OntarioL4L 1Y5","September 08, 2015",,"August 24, 2015"
,"September 08, 2015",William MacDonald Alexander,,Bow River,1,"Calgary, AlbertaT2V 0M1","September 08, 2015",,"September 04, 2015"
(...)

【讨论】：