【发布时间】:2021-01-24 22:52:02
【问题描述】:
我正在尝试展平两个 JSON 文件(我们称它们为 JSON1 和 JSON2)。以下是它们的外观示例。
现在在一个文件中,列数据类型可以是结构,而在另一个文件中是字符串。最终目标是能够展平这些文件并将数据组合/加入/合并到 CSV 文件中。如何在 Spark 中使用 Python 实现这一点?
JSON1:
{
"result": [
{
"promoted_by": "",
"parent": "",
"number": "310346",
"closed_by": {
"link": "https://abcdev.service-now.com/api/now/table/sys_user/e4b0dd",
"value": "e4b0dd"
}
}
]
}
root
|-- result: struct (nullable = true)
| |-- closed_by: struct (nullable = true)
| | |-- link: string (nullable = true)
| | |-- value: string (nullable = true)
| |-- number: string (nullable = true)
| |-- parent: string (nullable = true)
| |-- promoted_by: string (nullable = true)
JSON2:
{
"result": [
{
"promoted_by": "",
"parent": {
"link": "https://abcdev.service-now.com/api/now/table/sys_user/ab00f1",
"value": "ab00f1"
},
"number": "310348",
"closed_by": ""
}
]
}
root
|-- result: struct (nullable = true)
| |-- closed_by: string (nullable = true)
| |-- number: string (nullable = true)
| |-- parent: struct (nullable = true)
| | |-- link: string (nullable = true)
| | |-- value: string (nullable = true)
| |-- promoted_by: string (nullable = true)
【问题讨论】:
-
您是否尝试在单个数据框中读取这两个文件?你应该得到 Spark 合并的模式。
-
不会被覆盖吗?像这样我读取df1中的文件如下, df1 = spark.read.json("dbfs:/mnt/json1.json") 然后我做 df1 = spark.read.json("dbfs:/mnt/ json2.json")
-
@blackbishop,请将此作为答案。谢谢!
标签: python apache-spark pyspark pyspark-dataframes