【发布时间】:2020-04-02 19:01:15
【问题描述】:
我是 pyspark 的新手。我有一个要求,我需要将 hdfs 位置的大 CSV 文件转换为基于不同 primaryId 的多个嵌套 JSON 文件。
示例输入: data.csv
**PrimaryId,FirstName,LastName,City,CarName,DogName**
100,John,Smith,NewYork,Toyota,Spike
100,John,Smith,NewYork,BMW,Spike
100,John,Smith,NewYork,Toyota,Rusty
100,John,Smith,NewYork,BMW,Rusty
101,Ben,Swan,Sydney,Volkswagen,Buddy
101,Ben,Swan,Sydney,Ford,Buddy
101,Ben,Swan,Sydney,Audi,Buddy
101,Ben,Swan,Sydney,Volkswagen,Max
101,Ben,Swan,Sydney,Ford,Max
101,Ben,Swan,Sydney,Audi,Max
102,Julia,Brown,London,Mini,Lucy
示例输出文件:
文件1: Output_100.json
{
"100": [
{
"City": "NewYork",
"FirstName": "John",
"LastName": "Smith",
"CarName": [
"Toyota",
"BMW"
],
"DogName": [
"Spike",
"Rusty"
]
}
}
文件2: Output_101.json
{
"101": [
{
"City": "Sydney",
"FirstName": "Ben",
"LastName": "Swan",
"CarName": [
"Volkswagen",
"Ford",
"Audi"
],
"DogName": [
"Buddy",
"Max"
]
}
}
文件3: Output_102.json
{
"102": [
{
"City": "London",
"FirstName": "Julia",
"LastName": "Brown",
"CarName": [
"Mini"
],
"DogName": [
"Lucy"
]
}
]
}
我们将不胜感激。
【问题讨论】:
-
你试过了吗?已经?您可以生成一组主 ID,然后遍历每个条目,生成一个字典数组。
-
我并不是真正的编程背景,并尝试了谷歌的一些解决方案,但这不能满足我的要求。这就是我寻求帮助的原因!!!
标签: python json csv hadoop pyspark