删除重复值并附加行值的其余部分答案

【问题标题】：Remove duplicate value and append the rest of the row value删除重复值并附加行值的其余部分
【发布时间】：2020-12-07 22:09:53
【问题描述】：

我正在使用下面的代码来爬取页面上的多个链接，并从每个相应的链接中获取数据列表：

carspider.py：

def parse_item(self, response):
    sel = Selector(response)

    item = CarscrapeItem()

    item['carType'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@itemprop="manufacturer"]//text()').get()
    item['model'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@itemprop="model"]//text()').get()
    item['variant'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@class="float--right"]//text()')[3].get()
    item['year'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@class="float--right"]//text()')[4].get()
    item['engineCapacity'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@class="float--right"]//text()')[5].get()
    item['transmission'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@class="float--right"]//text()')[6].get()
    item['seatCapacity'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@class="float--right"]//text()')[7].get()

    yield item

pipelines.py：

def __init__(self):
    dispatcher.connect(self.spider_opened, signals.spider_opened)
    dispatcher.connect(self.spider_closed, signals.spider_closed)
    self.files = {}

def spider_opened(self, spider):
    self.file = open('%s_dataset.json' % spider.name, 'w+b')
    self.exporter = JsonLinesItemExporter(self.file)
    self.exporter.start_exporting()

def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item

我将项目导出到 json 文件，输出是这样的：

{"carType": "Honda", "model": "Civic", "variant": "TC VTEC Premium", "year": "2020", "engineCapacity": "1498 cc", "transmission": "Automatic", "seatCapacity": "5"}
{"carType": "Honda", "model": "Accord", "variant": "TC", "year": "2020", "engineCapacity": "1498 cc", "transmission": "Automatic", "seatCapacity": "5"}

我试着做出这样的输出：

{"carType": "Honda", "model": "Civic", "variant": "TC VTEC Premium", "year": "2020", "engineCapacity": "1498 cc", "transmission": "Automatic", "seatCapacity": "5"
                     "model": "Accord", "variant": "TC", "year": "2020", "engineCapacity": "1498 cc", "transmission": "Automatic", "seatCapacity": "5"}

我想删除重复的汽车类型并将其余的行值附加到现有的汽车类型。我想以这种方式制作推荐系统会更好。可以用 Scrapy 做到这一点吗？我搜索了与重复值相关的响应。大多数情况下，它们是关于重复过滤器的，其他的对我不起作用。

编辑：

因为我想要的输出是不可能实现的。我尝试了 Akshay Jain 给出的建议，这与我想要的输出几乎相似。我终于得到了这个输出：

{
"BMW" : [
{ 
  "colour" : "White", 
  "engineCapacity" : "1998 cc", 
  "model" : "530e", 
  "seatCapacity" : "5", 
  "transmission" : "Automatic", 
  "variant" : "M Sport", 
  "warranty" : "5 years", 
  "year" : "2020"
}
], 
"Subaru" : [
{ 
  "colour" : "Silver", 
  "engineCapacity" : "1998 cc", 
  "model" : "WRX", 
  "seatCapacity" : "5", 
  "transmission" : "Automatic", 
  "variant" : "EyeSight", 
  "warranty" : "5 years", 
  "year" : "2020"
}, 
{ 
  "colour" : "Blue", 
  "engineCapacity" : "1995 cc",
  "model" : "XV", 
  "seatCapacity" : "5", 
  "transmission" : "Automatic",
  "variant" : "GT Edition", 
  "warranty" : "5 years", 
  "year" : "2019"
}, 
{ 
  "colour" : "Grey", 
  "engineCapacity" : "1995 cc", 
  "model" : "XV", 
  "seatCapacity" : "5", 
  "transmission" : "Automatic", 
  "variant" : "GT Edition", 
  "warranty" : "5 years", 
  "year" : "2019"
}, 
{ 
  "colour" : "Silver", 
  "engineCapacity" : "1995 cc", 
  "model" : "Forester", 
  "seatCapacity" : "5", 
  "transmission" : "Automatic", 
  "variant" : "S EyeSight", 
  "warranty" : "5 years", 
  "year" : "2019"
}
]
}

我用下面的代码添加了一个python文件来实现这个结构：

import json
with open("dataset.json", "r+") as json_data:
car = {}
item = json_data
for line in item:
    element = json.loads(line)
    brand = element.get("carType")
    if brand not in car:
        car[brand] = [element]
    else:
        car[brand].append(element)

json_data.seek(0) 
json.dump(car, json_data, sort_keys=True, indent=2, separators=(", ", " : "))
json_data.truncate()

我参考了一些文档和教程，包括https://www.w3schools.com/python/python_json.asp http://www.compciv.org/guides/python/fundamentals/dictionaries-overview/

希望它可以帮助任何人！

【问题讨论】：

标签： python json scrapy

【解决方案1】：

对于你只是一种信息，字典键在 python 中必须是唯一的。所以你期望的输出是不可能的。
建议：您可以通过以下方式存储数据：

car = {
  "Honda": [
    {
      "model": "Civic",
      "variant": "TC VTEC Premium",
      "year": "2020",
      "engineCapacity": "1498cc",
      "transmission": "Automatic",
      "seatCapacity": "5"
    },
    {
      "model": "Accord",
      "variant": "TC",
      "year": "2020",
      "engineCapacity": "1498 cc",
      "transmission": "Automatic",
      "seatCapacity": "5"
    }
  ],
  "BMW": [
    {
      "model": "XYZ",
      "year": "2020",
      "transmission": "Automatic",
      "seatCapacity": "5"
    },
    {
      "model": "ABC",
      "year": "2020",
      "engineCapacity": "1498 cc",
      "transmission": "Automatic",
      "seatCapacity": "5"
    }
  ]
}

您可以使用下面的部分代码从文件中逐行读取数据，您可以编写自己的代码以上述格式存储数据

import json
with open('PATH_TO_FILE/FILE_NAME.json') as f:
  data = f
  for line in f:
    line = json.loads(line)
    # YOUR CODE HERE

【讨论】：

我应该在哪里以及如何修改以实现这种结构？我是 python 新手，任何教程或文档都会有所帮助。还是谢谢你。
你可以参考这个docs.python.org/3/tutorial/datastructures.html#dictionaries
经过几个小时的工作，我终于得到了这个输出！我有点慢，因为我在字典中遇到了一些问题。顺便说一句，非常感谢！非常感谢您的帮助！