【问题标题】:Nested JSON items with scrapy带有scrapy的嵌套JSON项目
【发布时间】:2017-11-23 06:03:10
【问题描述】:

这是我的基本scrapy爬虫:

  def parse(self, response):        
    item = CruiseItem()     

    item['Cruise'] = {}
    item['Cruise']['Cruiseline'] = response.xpath('//title/text()').extract()
    item['Cruise']['Itinerary'] = response.xpath('//*[@id="brochureName1"]/text()').extract()
    item['Cruise']['Price'] = response.xpath('//*[@id="interiorPrice1"]/text()').extract()
    item['Cruise']['PerNight'] = response.xpath('//*[@id="perNightinteriorPrice1"]/text()').extract()

    return item

这非常适合提取我想要的所有正确元素。例如,我的 json 提要结果如下:

[

{
    "Cruise": {
        "Cruiseline": [
            "Ship Name"
        ],
        "Itinerary": [
            "3 Night Bahamas ",
            "4 Night Western Caribbean ",
            "4 Night Bahamas ",
            "3 Night Bahamas ",
            "5 Night Western Caribbean ",
            "5 Night Eastern Caribbean ",
            "7 Night Western Caribbean ",
            "7 Night Southern Caribbean ",
            "6 Night Western Caribbean ",
            "7 Night Western Caribbean ",
            "8 Night Eastern Caribbean "
        ],
        "Price": [
            "$169",
            "$179",
            "$289",
            "$349",
            "$359",
            "$389",
            "$389",
            "$409",
            "$424",
            "$524",
            "$939"
        ],
        "PerNight": [
            "$56/night",
            "$45/night",
            "$72/night",
            "$116/night",
            "$72/night",
            "$78/night",
            "$56/night",
            "$58/night",
            "$71/night",
            "$75/night",
            "$117/night"
        ]
    }
}
]

但是目标 json 输出不同:

[

{
    "Cruise": {
        "Cruiseline": [
            "Ship Name"
        ],
        "Itinerary": [
            "3 Night Bahamas "
        ],
        "Price": [
            "$169"
        ],
        "PerNight": [
            "$56/night"

        ]
    },
    "Cruise": {
        "Cruiseline": [
            "Ship Name"
        ],
        "Itinerary": [
            "4 Night Bahamas "
        ],
        "Price": [
            "$79"
        ],
        "PerNight": [
            "$86/night"
        ]
    }
}
]

基本上我想返回每条邮轮,每艘船、行程、价格和每晚只有 1 条。

这有意义吗?愿意讨论

编辑:几天前问过这个问题,但决定澄清并重新发布。谢谢!

【问题讨论】:

    标签: python json scrapy


    【解决方案1】:

    想通了。

    def parse(self, response):
    
        final_list = []
    
        item = WthItem()
    
        item['ship'] = response.xpath('//*[@id="shipName1"]/text()').extract()
        item['Itinerary'] = response.xpath('//*[@id="brochureName1"]/text()').extract()
        item['Price'] = response.xpath('//*[@id="interiorPrice1"]/text()').extract()
        item['PerNight'] = response.xpath('//*[@id="perNightinteriorPrice1"]/text()').extract()
    
        final_list.append(item)
    
        updated_list = []
    
        for item in final_list:
            for i in range(len(item['ship'])):
                sub_item = {}
                sub_item['entry'] = {}
                sub_item['entry']['ship'] = [item['ship'][i]]
                sub_item['entry']['Itinerary'] = [item['Itinerary'][i]]
                sub_item['entry']['Price'] = [item['Price'][i]]
                sub_item['entry']['PerNight'] = [item['PerNight'][i]]
                updated_list.append(sub_item)
    
                print sub_item
    
            return updated_list
    

    【讨论】:

      【解决方案2】:

      尝试使用此脚本重新格式化数据。格式化后的数据将保存在updated_list

      cruise_list = [
      
      {
          "Cruise": {
              "Cruiseline": [
                  "Ship Name"
              ],
              "Itinerary": [
                  "3 Night Bahamas ",
                  "4 Night Western Caribbean ",
                  "4 Night Bahamas ",
                  "3 Night Bahamas ",
                  "5 Night Western Caribbean ",
                  "5 Night Eastern Caribbean ",
                  "7 Night Western Caribbean ",
                  "7 Night Southern Caribbean ",
                  "6 Night Western Caribbean ",
                  "7 Night Western Caribbean ",
                  "8 Night Eastern Caribbean "
              ],
              "Price": [
                  "$169",
                  "$179",
                  "$289",
                  "$349",
                  "$359",
                  "$389",
                  "$389",
                  "$409",
                  "$424",
                  "$524",
                  "$939"
              ],
              "PerNight": [
                  "$56/night",
                  "$45/night",
                  "$72/night",
                  "$116/night",
                  "$72/night",
                  "$78/night",
                  "$56/night",
                  "$58/night",
                  "$71/night",
                  "$75/night",
                  "$117/night"
              ]
          }
      }
      ]
      
      updated_list = []
      
      for cruise_obj in cruise_list:
          cruise_data = cruise_obj['Cruise']
          for i in range(len(cruise_data['Itinerary'])):
              sub_item = {}
              sub_item['Cruise'] = {}
              sub_item['Cruise']['Cruiseline'] = cruise_data['Cruiseline']
              sub_item['Cruise']['Itinerary'] = [cruise_data['Itinerary'][i]]
              sub_item['Cruise']['Price'] = [cruise_data['Price'][i]]
              sub_item['Cruise']['PerNight'] = [cruise_data['PerNight'][i]]
              updated_list.append(sub_item)
      

      其他一些想法

      • 如果存储在 json 中的唯一内容是 Cruise 对象,那么 Cruise 的初始键有点多余

      • 很多时候,您将不需要的内容存储在数组中。我猜这是一个棘手的问题,但是您应该尝试稍微修改我的脚本以删除奇异值的数组。例如。巡航对象不应该有多个Cruiselines。如果您需要帮助,请告诉我。

      【讨论】:

      • 谢谢你,我愿意尝试你的想法,但是我可能需要一些帮助来重新编写脚本
      • 这没什么用,我不知道应该在哪里实现这个更新的代码
      • 如果我看不到你的整个代码库,我真的不能告诉你代码应该去哪里。我假设parse 在某处被多次调用,因为到目前为止您的最终数据是一个数组。所以基本上,找到您的 json 提要存储在其中的任何变量 - 比如说它称为cruise_list,然后将我的代码粘贴在它后面。 (我的代码仅在您调用数据变量cruise_list 时才有效,因此如果您的数据变量称为x,请在聚合数据之前执行cruise_list = x 之类的操作,或将x 替换为cruise_list
      • 这实际上在常规 python shell 中运行良好。但是不知道如何在scrapy spider中实现
      猜你喜欢
      • 2017-01-29
      • 1970-01-01
      • 2017-05-28
      • 2019-05-07
      • 1970-01-01
      • 2014-06-01
      • 2017-12-22
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多