pyspark - 读取 json 文件答案

【问题标题】：pyspark - read json filespyspark - 读取 json 文件
【发布时间】：2020-05-02 16:13:50
【问题描述】：

我正在尝试读取这个 json 文件。

{
    "data": [{
            "id": "c1",
            "type": "corporate",
            "tenor": "10.3 years",
            "yield": "5.30%",
            "amount_outstanding": 1200000
        },
        {
            "id": "g1",
            "type": "government",
            "tenor": "9.4 years",
            "yield": "3.70%",
            "amount_outstanding": 2500000
        },
]}

代码 df = spark.read.option("multiline", True).json("sample_input.json") df.select(col("data")).show()

但是，这会将所有内容读入单个列。有没有办法可以使用id、type、tenor 和其他列来应用架构？

【问题讨论】：

使用现有代码加载数据后，只需运行df = df.selectExpr('inline(data)')
@jxc 谢谢。它解决了问题

标签： json pyspark

【解决方案1】：

如果您尝试使用许可模式加载多行 json，那么您可以正确查看数据帧，

df = spark.read.option("multiline", "true").option("mode", "PERMISSIVE").json("sample_input.json")
df.printSchema()

root
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- amount_outstanding: long (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- tenor: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- yield: string (nullable = true)

如果您从 json 中删除数据元素并尝试加载，那么它将像这样正确加载。 json内容：

[{
            "id": "c1",
            "type": "corporate",
            "tenor": "10.3 years",
            "yield": "5.30%",
            "amount_outstanding": 1200000
        },
        {
            "id": "g1",
            "type": "government",
            "tenor": "9.4 years",
            "yield": "3.70%",
            "amount_outstanding": 2500000
        }]

输出模式：

df = spark.read.option("multiline", "true").option("mode", "PERMISSIVE").json("sample_input.json")
df.printSchema()

df.printSchema()
root
 |-- amount_outstanding: long (nullable = true)
 |-- id: string (nullable = true)
 |-- tenor: string (nullable = true)
 |-- type: string (nullable = true)
 |-- yield: string (nullable = true)

在这两个选项中，一旦正确填充数据框，您就会拥有正确的数据框，以您想要的方式提取和转换数据。

【讨论】：