【发布时间】:2017-09-14 07:58:48
【问题描述】:
我在生成的 csv 输出文件中的每一行 scrapy 输出之间出现了不需要的空白行。
我已经从 python2 迁移到 python 3,我使用的是 Windows 10。因此,我正在为 python3 调整我的 scrapy 项目。
我目前(目前也是唯一的)问题是,当我将 scrapy 输出写入 CSV 文件时,每行之间都会出现一个空行。这已在此处的几篇文章中突出显示(与 Windows 有关),但我无法找到可行的解决方案。
碰巧,我还在 piplines.py 文件中添加了一些代码,以确保 csv 输出按照给定的列顺序而不是随机顺序。因此,我可以使用普通的scrapy crawl charleschurch 来运行这段代码,而不是使用scrapy crawl charleschurch -o charleschurch2017xxxx.csv
有人知道如何在 CSV 输出中跳过/省略这个空行吗?
我的 pipelines.py 代码如下(我可能不需要 import csv 行,但我怀疑我可以为最终答案做):
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import csv
from scrapy import signals
from scrapy.exporters import CsvItemExporter
class CSVPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = ["plotid","plotprice","plotname","name","address"]
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
我在settings.py文件中添加了这一行(不确定300的相关性):
ITEM_PIPELINES = {'CharlesChurch.pipelines.CSVPipeline': 300 }
我的scrapy代码如下:
import scrapy
from urllib.parse import urljoin
from CharlesChurch.items import CharleschurchItem
class charleschurchSpider(scrapy.Spider):
name = "charleschurch"
allowed_domains = ["charleschurch.com"]
start_urls = ["https://www.charleschurch.com/county-durham_willington/the-ridings-1111"]
def parse(self, response):
for sel in response.xpath('//*[@id="aspnetForm"]/div[4]'):
item = CharleschurchItem()
item['name'] = sel.xpath('//*[@id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/span[1]/b/text()').extract()
item['address'] = sel.xpath('//*[@id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/div/*[@itemprop="postalCode"]/text()').extract()
plotnames = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__name"]/a/text()').extract()
plotnames = [plotname.strip() for plotname in plotnames]
plotids = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__name"]/a/@href').extract()
plotids = [plotid.strip() for plotid in plotids]
plotprices = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__price"]/text()').extract()
plotprices = [plotprice.strip() for plotprice in plotprices]
result = zip(plotnames, plotids, plotprices)
for plotname, plotid, plotprice in result:
item['plotname'] = plotname
item['plotid'] = plotid
item['plotprice'] = plotprice
yield item
【问题讨论】:
-
你能试试把这行
file = open('%s_items.csv' % spider.name, 'w+b')改成file = open('%s_items.csv' % spider.name, 'w', newline="")吗? -
@Jean-FrançoisFabre 我在尝试时收到错误
TypeError: write() argument must be str, not bytes。 -
好的,然后
file = open('%s_items.csv' % spider.name, 'wb', newline="") -
@Jean-FrançoisFabre 给出错误
ValueError: binary mode doesn't take a newline argument
标签: python csv web-scraping scrapy