【问题标题】:Flattening two JSON with different data types and joining them展平两个具有不同数据类型的 JSON 并将它们连接起来
【发布时间】:2021-01-24 22:52:02
【问题描述】:

我正在尝试展平两个 JSON 文件(我们称它们为 JSON1JSON2)。以下是它们的外观示例。

现在在一个文件中,列数据类型可以是结构,而在另一个文件中是字符串。最终目标是能够展平这些文件并将数据组合/加入/合并到 CSV 文件中。如何在 Spark 中使用 Python 实现这一点?

JSON1:

{
    "result": [
        {
            "promoted_by": "",
            "parent": "",
            "number": "310346",
            "closed_by": {
                "link": "https://abcdev.service-now.com/api/now/table/sys_user/e4b0dd",
                "value": "e4b0dd"
            }
        }
    ]
}

root
 |-- result: struct (nullable = true)
 |    |-- closed_by: struct (nullable = true)
 |    |    |-- link: string (nullable = true)
 |    |    |-- value: string (nullable = true)
 |    |-- number: string (nullable = true)
 |    |-- parent: string (nullable = true)
 |    |-- promoted_by: string (nullable = true)

JSON2:

{
    "result": [
        {
            "promoted_by": "",
            "parent": {
                "link": "https://abcdev.service-now.com/api/now/table/sys_user/ab00f1",
                "value": "ab00f1"
            },
            "number": "310348",
            "closed_by": ""
        }
    ]
}

root
 |-- result: struct (nullable = true)
 |    |-- closed_by: string (nullable = true)
 |    |-- number: string (nullable = true)
 |    |-- parent: struct (nullable = true)
 |    |    |-- link: string (nullable = true)
 |    |    |-- value: string (nullable = true)
 |    |-- promoted_by: string (nullable = true)

【问题讨论】:

  • 您是否尝试在单个数据框中读取这两个文件?你应该得到 Spark 合并的模式。
  • 不会被覆盖吗?像这样我读取df1中的文件如下, df1 = spark.read.json("dbfs:/mnt/json1.json") 然后我做 df1 = spark.read.json("dbfs:/mnt/ json2.json")
  • @blackbishop,请将此作为答案。谢谢!

标签: python apache-spark pyspark pyspark-dataframes


【解决方案1】:

只需将 2 个 JSON 文件读入同一个 DataFrame。 Spark 将自动合并模式。列closed_byparent 都将是struct 类型:

df = spark.read.json("dbfs:/mnt/{json1.json,json2.json}", multiLine=True)

df.printSchema()

#root
# |-- result: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- closed_by: struct (nullable = true)
# |    |    |    |-- link: string (nullable = true)
# |    |    |    |-- value: string (nullable = true)
# |    |    |-- number: string (nullable = true)
# |    |    |-- parent: struct (nullable = true)
# |    |    |    |-- link: string (nullable = true)
# |    |    |    |-- value: string (nullable = true)
# |    |    |-- promoted_by: string (nullable = true)

要展平结构,请使用explode + 星号展开结构:

from pyspark.sql import functions as F

df1 = df.select(F.explode("result").alias("results")).select("results.*") \
        .select(
        F.col("number"),
        F.col("closed_by.value").alias("closed_by_value"),
        F.col("closed_by.link").alias("closed_by_link"),
        F.col("parent.value").alias("parent_value"),
        F.col("parent.link").alias("parent_link"),
        F.col("promoted_by")
    )

df1.printSchema()

#root
# |-- number: string (nullable = true)
# |-- closed_by_value: string (nullable = true)
# |-- closed_by_link: string (nullable = true)
# |-- parent_value: string (nullable = true)
# |-- parent_link: string (nullable = true)
# |-- promoted_by: string (nullable = true)

【讨论】:

  • 假设 JSON 结构在未来发生变化,假设promotion_by 列是struct 类型或列父列是string 类型
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-06-05
  • 2022-10-01
  • 2021-09-12
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多