【问题标题】:Remove duplicate value and append the rest of the row value删除重复值并附加行值的其余部分
【发布时间】:2020-12-07 22:09:53
【问题描述】:

我正在使用下面的代码来爬取页面上的多个链接,并从每个相应的链接中获取数据列表:

carspider.py:

def parse_item(self, response):
    sel = Selector(response)

    item = CarscrapeItem()

    item['carType'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@itemprop="manufacturer"]//text()').get()
    item['model'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@itemprop="model"]//text()').get()
    item['variant'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@class="float--right"]//text()')[3].get()
    item['year'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@class="float--right"]//text()')[4].get()
    item['engineCapacity'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@class="float--right"]//text()')[5].get()
    item['transmission'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@class="float--right"]//text()')[6].get()
    item['seatCapacity'] = sel.xpath('//div[@class="listing__section  listing__section--key-details  listing__key-details  portable-one-whole  push--bottom"]//span[@class="float--right"]//text()')[7].get()

    yield item

pipelines.py:

def __init__(self):
    dispatcher.connect(self.spider_opened, signals.spider_opened)
    dispatcher.connect(self.spider_closed, signals.spider_closed)
    self.files = {}

def spider_opened(self, spider):
    self.file = open('%s_dataset.json' % spider.name, 'w+b')
    self.exporter = JsonLinesItemExporter(self.file)
    self.exporter.start_exporting()

def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item

我将项目导出到 json 文件,输出是这样的:

{"carType": "Honda", "model": "Civic", "variant": "TC VTEC Premium", "year": "2020", "engineCapacity": "1498 cc", "transmission": "Automatic", "seatCapacity": "5"}
{"carType": "Honda", "model": "Accord", "variant": "TC", "year": "2020", "engineCapacity": "1498 cc", "transmission": "Automatic", "seatCapacity": "5"}

我试着做出这样的输出:

{"carType": "Honda", "model": "Civic", "variant": "TC VTEC Premium", "year": "2020", "engineCapacity": "1498 cc", "transmission": "Automatic", "seatCapacity": "5"
                     "model": "Accord", "variant": "TC", "year": "2020", "engineCapacity": "1498 cc", "transmission": "Automatic", "seatCapacity": "5"}

我想删除重复的汽车类型并将其余的行值附加到现有的汽车类型。我想以这种方式制作推荐系统会更好。可以用 Scrapy 做到这一点吗?我搜索了与重复值相关的响应。大多数情况下,它们是关于重复过滤器的,其他的对我不起作用。

编辑:

因为我想要的输出是不可能实现的。我尝试了 Akshay Jain 给出的建议,这与我想要的输出几乎相似。我终于得到了这个输出:

{
"BMW" : [
{ 
  "colour" : "White", 
  "engineCapacity" : "1998 cc", 
  "model" : "530e", 
  "seatCapacity" : "5", 
  "transmission" : "Automatic", 
  "variant" : "M Sport", 
  "warranty" : "5 years", 
  "year" : "2020"
}
], 
"Subaru" : [
{ 
  "colour" : "Silver", 
  "engineCapacity" : "1998 cc", 
  "model" : "WRX", 
  "seatCapacity" : "5", 
  "transmission" : "Automatic", 
  "variant" : "EyeSight", 
  "warranty" : "5 years", 
  "year" : "2020"
}, 
{ 
  "colour" : "Blue", 
  "engineCapacity" : "1995 cc",
  "model" : "XV", 
  "seatCapacity" : "5", 
  "transmission" : "Automatic",
  "variant" : "GT Edition", 
  "warranty" : "5 years", 
  "year" : "2019"
}, 
{ 
  "colour" : "Grey", 
  "engineCapacity" : "1995 cc", 
  "model" : "XV", 
  "seatCapacity" : "5", 
  "transmission" : "Automatic", 
  "variant" : "GT Edition", 
  "warranty" : "5 years", 
  "year" : "2019"
}, 
{ 
  "colour" : "Silver", 
  "engineCapacity" : "1995 cc", 
  "model" : "Forester", 
  "seatCapacity" : "5", 
  "transmission" : "Automatic", 
  "variant" : "S EyeSight", 
  "warranty" : "5 years", 
  "year" : "2019"
}
]
}

我用下面的代码添加了一个python文件来实现这个结构:

import json
with open("dataset.json", "r+") as json_data:
car = {}
item = json_data
for line in item:
    element = json.loads(line)
    brand = element.get("carType")
    if brand not in car:
        car[brand] = [element]
    else:
        car[brand].append(element)

json_data.seek(0) 
json.dump(car, json_data, sort_keys=True, indent=2, separators=(", ", " : "))
json_data.truncate()

我参考了一些文档和教程,包括https://www.w3schools.com/python/python_json.asp http://www.compciv.org/guides/python/fundamentals/dictionaries-overview/

希望它可以帮助任何人!

【问题讨论】:

    标签: python json scrapy


    【解决方案1】:
    • 对于你只是一种信息,字典键在 python 中必须是唯一的。所以你期望的输出是不可能的。

    • 建议: 您可以通过以下方式存储数据:

    car = {
      "Honda": [
        {
          "model": "Civic",
          "variant": "TC VTEC Premium",
          "year": "2020",
          "engineCapacity": "1498cc",
          "transmission": "Automatic",
          "seatCapacity": "5"
        },
        {
          "model": "Accord",
          "variant": "TC",
          "year": "2020",
          "engineCapacity": "1498 cc",
          "transmission": "Automatic",
          "seatCapacity": "5"
        }
      ],
      "BMW": [
        {
          "model": "XYZ",
          "year": "2020",
          "transmission": "Automatic",
          "seatCapacity": "5"
        },
        {
          "model": "ABC",
          "year": "2020",
          "engineCapacity": "1498 cc",
          "transmission": "Automatic",
          "seatCapacity": "5"
        }
      ]
    }
    

    您可以使用下面的部分代码从文件中逐行读取数据,您可以编写自己的代码以上述格式存储数据

    import json
    with open('PATH_TO_FILE/FILE_NAME.json') as f:
      data = f
      for line in f:
        line = json.loads(line)
        # YOUR CODE HERE
    

    【讨论】:

    • 我应该在哪里以及如何修改以实现这种结构?我是 python 新手,任何教程或文档都会有所帮助。还是谢谢你。
    • 经过几个小时的工作,我终于得到了这个输出!我有点慢,因为我在字典中遇到了一些问题。顺便说一句,非常感谢!非常感谢您的帮助!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2020-03-22
    • 2020-11-12
    • 2022-08-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多