【问题标题】:Scrapy - Error in writing filesScrapy - 写入文件时出错
【发布时间】:2015-08-21 02:57:14
【问题描述】:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from Erowid.items import ErowidItem
import os

class ExperiencesSpider(CrawlSpider):
    name = "experiences"
    allowed_domains = ["www.erowid.org"]
    start_urls = ['https://www.erowid.org/experiences/exp_list.shtml']

    rules = [Rule(LinkExtractor(allow =('subs/exp_[a-zA-Z]+.shtml')),callback='parse_item', follow = True)

    Rule(LinkExtractor(allow =('subs/exp_[a-zA-Z]+.shtml')), follow = True)    

    ]

    def parse_item(self, response):
        filename = str(response.url)[44:-6]
        selectors = response.css('table')
        if not os.path.exists('drugs-%s' % (filename)): ##Make the file
            os.makedirs('drugs-%s' % (filename))
        list_of_experience = selectors.xpath('//table[@class="exp-cat-table"]/tr/td/a/@href').extract()

        for item in list_of_experience:
            request_url = str(item)
            Request(url="http://www.erowid.org" + request_url, callback = 'request_experience')
            def request_experience(self, response):
                selectors = response.css('div')
                for selector in selectors:
                    experience = ErowidItem()
                    experience['Author'] = selector.xpath('//div[@class="author"]/a/text()').extract()
                    experience['Title'] = selector.xpath('//div[@class="title"]/text()').extract()
                    experience['Substance'] = selector.xpath('//div[@class="substance"]/text()').extract()
                    experience['Text'] = selector.xpath("//div[@class = 'report-text-surround']/text()").extract()

                    title = str(experience['Substance']) + " "+ str(experience['Title'])
                    with open(os.path.join('drugs-%s' % (filename), title),"a") as fid:
                        fid.write(str(experience) + "\n")

我正在尝试使用 scrapy 从 Erowid 抓取数据,并且我正在尝试格式化数据,以便对于每种物质,我都有一个以“物质 - 经验标题”形式命名的文件。

我的规则让我的蜘蛛爬过包括https://www.erowid.org/experiences/subs/exp_Acacia_confusa.shtml 在内的网站列表。然后,我获取所有指向体验的链接,并通过第二次请求将其提交,旨在从体验中收集数据。

我打算以上述格式存储数据,即“物质 - 经验标题”。对于每种物质,我想创建一个目录,其中包含该页面中的文件。

然而,我的代码创建了目录,但没有写入我想要的文件。

是什么导致了这个错误?

【问题讨论】:

    标签: python operating-system scrapy


    【解决方案1】:

    根据documentation of scrapy.http.Request——

    callback (callable) – 将调用此请求的响应(下载后)作为其第一个参数的函数。

    回调应该是可调用函数,而不是它的字符串,您还需要定义函数,然后再尝试将其作为Request 对象的回调发送。

    例子-

    import scrapy
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http import Request
    from Erowid.items import ErowidItem
    import os
    
    class ExperiencesSpider(CrawlSpider):
        name = "experiences"
        allowed_domains = ["www.erowid.org"]
        start_urls = ['https://www.erowid.org/experiences/exp_list.shtml']
    
        rules = [Rule(LinkExtractor(allow =('subs/exp_[a-zA-Z]+.shtml')),callback='parse_item', follow = True)
    
        Rule(LinkExtractor(allow =('subs/exp_[a-zA-Z]+.shtml')), follow = True)    
    
        ]
    
        def request_experience(self, response):
            selectors = response.css('div')
            for selector in selectors:
                experience = ErowidItem()
                experience['Author'] = selector.xpath('//div[@class="author"]/a/text()').extract()
                experience['Title'] = selector.xpath('//div[@class="title"]/text()').extract()
                experience['Substance'] = selector.xpath('//div[@class="substance"]/text()').extract()
                experience['Text'] = selector.xpath("//div[@class = 'report-text-surround']/text()").extract()
    
                title = str(experience['Substance']) + " "+ str(experience['Title'])
                with open(os.path.join('drugs-%s' % (self.filename), title),"a") as fid:
                    fid.write(str(experience) + "\n")
    
        def parse_item(self, response):
            self.filename = str(response.url)[44:-6]
            selectors = response.css('table')
            if not os.path.exists('drugs-%s' % (self.filename)): ##Make the file
                os.makedirs('drugs-%s' % (self.filename))
            list_of_experience = selectors.xpath('//table[@class="exp-cat-table"]/tr/td/a/@href').extract()
    
            for item in list_of_experience:
                request_url = str(item)
                Request(url="http://www.erowid.org" + request_url, callback = self.request_experience)
    

    【讨论】:

    • 我之前尝试过这样的事情,但问题是现在变量文件名将在 request_experience 中未定义,它仍然无法工作。
    • 您可以将文件名设置为 self 并从那里访问它。为此更新我的代码。
    • 它似乎仍然不适合我。它所做的只是创建目录但不获取文件。
    • 你能不能在 request_experience 函数里放一个 print 什么的,然后检查它是否被打印出来。
    • 我在里面添加了打印体验,但没有打印出来
    猜你喜欢
    • 2018-02-10
    • 2019-10-12
    • 2014-06-08
    • 2015-10-21
    • 2018-02-20
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多