【发布时间】:2017-07-28 08:02:55
【问题描述】:
我正在尝试将 JSON 文件转换为扁平化的 CSV 文件。这是我尝试过的:
我不知道如何正确操作 spark sql 中的qualify 列并返回正确的值。
from pyspark.sql.functions import *
dummy = spark.read.json('dummy-3.json')
qualify = dummy.select("user_id", "rec_id", "uut", "hash", explode("qualify").alias("qualify"))
qualify.show()
+-------+------+---+------+--------------------+
|user_id|rec_id|uut| hash| qualify|
+-------+------+---+------+--------------------+
| 1| 2| 12|abc123|[cab321,test-1,of...|
| 1| 2| 12|abc123|[cab123,test-2,of...|
+-------+------+---+------+--------------------+
JSON 示例:
{
"user_id": 1,
"rec_id": 2,
"uut": 12,
"hash": "abc123"
"qualify":[{
"offer": "offer-1",
"name": "test-1",
"hash": "cab321",
"qualified": false"
"rules": [{
"name": "name of rule 1",
"approved": true,
"details": {}
},
{
"name": "name of rule 2",
"approved": false,
"details": {}
}]
},{
"offer": "offer-2",
"name": "test-2",
"hash": "cab123",
"qualified": true
"rules": [{
"name": "name of rule 1",
"approved": true,
"details": {}
},
{
"name": "name of rule 2",
"approved": false,
"details": {}
}]
}
}
JSON 架构:
root
|-- hash: string (nullable = true)
|-- qualify: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- hash: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- offer: string (nullable = true)
| | |-- qualified: boolean (nullable = true)
| | |-- rules: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- approved: boolean (nullable = true)
| | | | |-- name: string (nullable = true)
|-- rec_id: long (nullable = true)
|-- user_id: long (nullable = true)
|-- uut: long (nullable = true)
我尝试将 DataFrame 转换为 RDD 并创建一个映射函数来返回值,但我认为这不是一个好方法。我错了吗?
有没有人研究过类似的问题?
感谢您的帮助。
【问题讨论】:
-
您是否尝试将
qualified.*而非explode放入您的选择查询中?
标签: json csv apache-spark pyspark