【发布时间】:2020-12-07 22:09:53
【问题描述】:
我正在使用下面的代码来爬取页面上的多个链接,并从每个相应的链接中获取数据列表:
carspider.py:
def parse_item(self, response):
sel = Selector(response)
item = CarscrapeItem()
item['carType'] = sel.xpath('//div[@class="listing__section listing__section--key-details listing__key-details portable-one-whole push--bottom"]//span[@itemprop="manufacturer"]//text()').get()
item['model'] = sel.xpath('//div[@class="listing__section listing__section--key-details listing__key-details portable-one-whole push--bottom"]//span[@itemprop="model"]//text()').get()
item['variant'] = sel.xpath('//div[@class="listing__section listing__section--key-details listing__key-details portable-one-whole push--bottom"]//span[@class="float--right"]//text()')[3].get()
item['year'] = sel.xpath('//div[@class="listing__section listing__section--key-details listing__key-details portable-one-whole push--bottom"]//span[@class="float--right"]//text()')[4].get()
item['engineCapacity'] = sel.xpath('//div[@class="listing__section listing__section--key-details listing__key-details portable-one-whole push--bottom"]//span[@class="float--right"]//text()')[5].get()
item['transmission'] = sel.xpath('//div[@class="listing__section listing__section--key-details listing__key-details portable-one-whole push--bottom"]//span[@class="float--right"]//text()')[6].get()
item['seatCapacity'] = sel.xpath('//div[@class="listing__section listing__section--key-details listing__key-details portable-one-whole push--bottom"]//span[@class="float--right"]//text()')[7].get()
yield item
pipelines.py:
def __init__(self):
dispatcher.connect(self.spider_opened, signals.spider_opened)
dispatcher.connect(self.spider_closed, signals.spider_closed)
self.files = {}
def spider_opened(self, spider):
self.file = open('%s_dataset.json' % spider.name, 'w+b')
self.exporter = JsonLinesItemExporter(self.file)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
我将项目导出到 json 文件,输出是这样的:
{"carType": "Honda", "model": "Civic", "variant": "TC VTEC Premium", "year": "2020", "engineCapacity": "1498 cc", "transmission": "Automatic", "seatCapacity": "5"}
{"carType": "Honda", "model": "Accord", "variant": "TC", "year": "2020", "engineCapacity": "1498 cc", "transmission": "Automatic", "seatCapacity": "5"}
我试着做出这样的输出:
{"carType": "Honda", "model": "Civic", "variant": "TC VTEC Premium", "year": "2020", "engineCapacity": "1498 cc", "transmission": "Automatic", "seatCapacity": "5"
"model": "Accord", "variant": "TC", "year": "2020", "engineCapacity": "1498 cc", "transmission": "Automatic", "seatCapacity": "5"}
我想删除重复的汽车类型并将其余的行值附加到现有的汽车类型。我想以这种方式制作推荐系统会更好。可以用 Scrapy 做到这一点吗?我搜索了与重复值相关的响应。大多数情况下,它们是关于重复过滤器的,其他的对我不起作用。
编辑:
因为我想要的输出是不可能实现的。我尝试了 Akshay Jain 给出的建议,这与我想要的输出几乎相似。我终于得到了这个输出:
{
"BMW" : [
{
"colour" : "White",
"engineCapacity" : "1998 cc",
"model" : "530e",
"seatCapacity" : "5",
"transmission" : "Automatic",
"variant" : "M Sport",
"warranty" : "5 years",
"year" : "2020"
}
],
"Subaru" : [
{
"colour" : "Silver",
"engineCapacity" : "1998 cc",
"model" : "WRX",
"seatCapacity" : "5",
"transmission" : "Automatic",
"variant" : "EyeSight",
"warranty" : "5 years",
"year" : "2020"
},
{
"colour" : "Blue",
"engineCapacity" : "1995 cc",
"model" : "XV",
"seatCapacity" : "5",
"transmission" : "Automatic",
"variant" : "GT Edition",
"warranty" : "5 years",
"year" : "2019"
},
{
"colour" : "Grey",
"engineCapacity" : "1995 cc",
"model" : "XV",
"seatCapacity" : "5",
"transmission" : "Automatic",
"variant" : "GT Edition",
"warranty" : "5 years",
"year" : "2019"
},
{
"colour" : "Silver",
"engineCapacity" : "1995 cc",
"model" : "Forester",
"seatCapacity" : "5",
"transmission" : "Automatic",
"variant" : "S EyeSight",
"warranty" : "5 years",
"year" : "2019"
}
]
}
我用下面的代码添加了一个python文件来实现这个结构:
import json
with open("dataset.json", "r+") as json_data:
car = {}
item = json_data
for line in item:
element = json.loads(line)
brand = element.get("carType")
if brand not in car:
car[brand] = [element]
else:
car[brand].append(element)
json_data.seek(0)
json.dump(car, json_data, sort_keys=True, indent=2, separators=(", ", " : "))
json_data.truncate()
我参考了一些文档和教程,包括https://www.w3schools.com/python/python_json.asp http://www.compciv.org/guides/python/fundamentals/dictionaries-overview/
希望它可以帮助任何人!
【问题讨论】: