【问题标题】:type while Creating one csv for one url with scrapy使用scrapy为一个url创建一个csv时输入
【发布时间】:2017-05-18 09:22:28
【问题描述】:

这是我的网络爬虫,它生成一个包含标题、网址和名称的项目

import scrapy
from ..items import ContentsPageSFBItem

class BasicSpider(scrapy.Spider):
    name = "contentspage_sfb"
    #allowed_domains = ["web"]
    start_urls = [
        'https://www.safaribooksonline.com/library/view/shell-programming-in/9780134496696/',
        'https://www.safaribooksonline.com/library/view/cisa-certified-information/9780134677453/'
    ]

    def parse(self, response):
            item = ContentsPageSFBItem()

            #from scrapy.shell import inspect_response
            #inspect_response(response, self)

            content_items = response.xpath('//ol[@class="detail-toc"]//a/text()').extract()

            for content_item in content_items:

                item['content_item'] = content_item
                item["full_url"] = response.url
                item['title'] = response.xpath('//title[1]/text()').extract()

                yield item

代码完美运行。但是,由于爬网的性质,会生成大量数据。我的目的是将结果划分为解析一个 url,并将结果存储在一个 csv 文件中。我正在使用以下代码

from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter


class ContentspageSfbPipeline(object):
    def __init__(self):
        self.files = {}

    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline


    def spider_opened(self, contentspage_sfb):
        file = open('results/%s.csv' % contentspage_sfb.url, 'w+b')
        self.files[contentspage_sfb] = file
        self.exporter = CsvItemExporter(file)
        self.exporter.fields_to_export = ['item']
        self.exporter.start_exporting()


    def spider_closed(self, contentspage_sfb):
        self.exporter.finish_exporting()
        file = self.files.pop(contentspage_sfb)
        file.close()


    def process_item(self, item, contentspage_sfb):
        self.exporter.export_item(item)
        return item

但是,我得到一个错误:

TypeError: unbound method from_crawler() must be called with ContentspageSfbPipeline instance as first argument (got Crawler instance instead)

按照建议,我在 from_crawler 函数之前添加了装饰器。但是,现在我得到属性错误。

Traceback (most recent call last):
  File "/home/eadaradhiraj/program_files/venv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/eadaradhiraj/program_files/pycharm_projects/javascriptlibraries/javascriptlibraries/pipelines.py", line 39, in process_item
    self.exporter.export_item(item)
AttributeError: 'ContentspageSfbPipeline' object has no attribute 'exporter'

我的代码基于How to split output from a list of urls in scrapy

【问题讨论】:

    标签: python csv scrapy web-crawler


    【解决方案1】:

    您缺少 @classmethod 装饰器,用于您的 from_crawler() 方法。

    查看相关的Meaning of @classmethod and @staticmethod for beginner? 了解类方法。

    此外,您无需在管道中连接任何信号。根据official docs,管道可以包含open_spiderclose_spider 方法

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-08-26
      • 1970-01-01
      • 2018-06-02
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-02-20
      • 2017-04-23
      相关资源
      最近更新 更多