【问题标题】:Removing brackets from Scrapy json output从 Scrapy json 输出中删除括号
【发布时间】:2016-09-18 18:57:21
【问题描述】:

我的代码的最后一部分是将数据从我的 scrapy 管道加载到我的 pandas 数据帧。

示例结果如下:

{"Message": ["\r\n", " Profanity directed toward staff.  ", "\r\n Profanity directed toward warden ", "  \r\n  "], "Desc": "https://www.tdcj.state.tx.us/death_row/dr_info/nicholsjoseph.jpg"}

当加载到数据框时,[] 括号仍然在其中,并带有“\r\n”。快速搜索显示这是因为编码,报废很常见。

谁能给我一个pythonic方法的想法来获得更干净的输出?

我期待类似的东西

{"Message: "Profanity directed toward staff. Profanity directed toward warden", "Desc": "https://www.tdcj.state.tx.us/death_row/dr_info/nicholsjoseph.jpg"}

编辑添加项目类和蜘蛛:

项目.py

from scrapy.item import Item, Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join


class DeathItem(Item):

    firstName = Field()
    lastName = Field()
    Age = Field()
    Date = Field()
    Race = Field()
    County = Field()
    Message = Field(
        input_processor=MapCompose(unicode.strip),
        output_processor=Join())
    Desc = Field()
    Mid = Field()

蜘蛛.py

from urlparse import urljoin
import scrapy
from texasdeath.items import DeathItem


class DeathSpider(scrapy.Spider):
    name = "death"
    allowed_domains = ["tdcj.state.tx.us"]
    start_urls = [
        "https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
    ]
    def parse(self, response):
        sites = response.xpath('//table/tbody/tr')
        for site in sites:
            item = DeathItem()
            item['Mid'] = site.xpath('td[1]/text()').extract()
            item['firstName'] = site.xpath('td[5]/text()').extract()
            item['lastName'] = site.xpath('td[4]/text()').extract()
            item['Age'] = site.xpath('td[7]/text()').extract()
            item['Date'] = site.xpath('td[8]/text()').extract()
            item['Race'] = site.xpath('td[9]/text()').extract()
            item['County'] = site.xpath('td[10]/text()').extract()

            url = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())
            urlLast = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first())

            if url.endswith(("jpg","no_info_available.html")):
                item['Desc'] = url
                if urlLast.endswith("no_last_statement.html"):
                    item['Message'] = "No last statement"
                    yield item
                else:
                    request = scrapy.Request(urlLast, meta={"item" : item}, callback =self.parse_details2)
                    yield request
            else:        
                request = scrapy.Request(url, meta={"item": item,"urlLast" : urlLast}, callback=self.parse_details)
                yield request

    def parse_details(self, response):
        item = response.meta["item"]
        urlLast = response.meta["urlLast"]
        item['Desc'] = response.xpath("//*[@id='body']/p[3]/text()").extract()
        if urlLast.endswith("no_last_statement.html"):
            item["Message"] = "No last statement"
            return item
        else:
            request = scrapy.Request(urlLast, meta={"item": item}, callback=self.parse_details2)
            return request

    def parse_details2(self, response):
        item = response.meta["item"]
        item['Message'] = response.xpath("//div/p[contains(., 'Last Statement:')]/following-sibling::node()/descendant-or-self::text()").extract()
        return item

我基本上希望将纯文本的输出加载到我的 pandas 数据框中。但是所有不需要的字符,例如:[],\r\n\t 都被排除在外。

基本上是为了让数据出现在网络上。

【问题讨论】:

  • 您是否尝试过将 json 转换为 python 字典(例如:json.loads(...))?如果您有有效的 json 数据,则进行字符串替换不是正确的解决方案。您应该将其转换为 python 对象,修改数据,然后可选择将其转回 json。

标签: python json scrapy


【解决方案1】:

您需要调整提取的项目字段的后处理方式。为此,Scrapy 具有带有输入和输出处理器的 Item Loaders。在您的情况下,您需要 Join()MapCompose(unicode.strip)

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join

class MyItemLoader(ItemLoader):
    default_output_processor = TakeFirst()

    message_in = MapCompose(unicode, unicode.strip)
    message_out = Join()

【讨论】:

  • MapCompose(unicode.strip)(['\r\n']) 只是尝试这部分并返回错误,TypeError: descriptor 'strip' requires a 'unicode' object but received a 'str'
  • @BernardL 你能显示蜘蛛、项目和加载器的定义吗?谢谢。
  • @BernardL 好的,谢谢,你能试试message_in = MapCompose(unicode, unicode.strip)吗?
  • 问题是连接甚至不起作用,我仍然返回一个列表。
  • @BernardL 我想我看到了问题——你必须在你的蜘蛛中使用一个项目加载器。请参阅stackoverflow.com/questions/37245846/…
猜你喜欢
  • 2018-10-30
  • 2023-01-15
  • 2015-03-07
  • 2021-08-11
  • 1970-01-01
  • 2021-12-22
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多