【问题标题】:Convert column with string json string to column with dictionary in pyspark在pyspark中将带有字符串json字符串的列转换为带有字典的列
【发布时间】:2020-05-29 08:13:00
【问题描述】:

我的数据框中有一列具有以下结构。

+--------------------+
|                data|
+--------------------+
|{"sbar":{"_id":"5...|
|{"sbar":{"_id":"5...|
|{"sbar":{"_id":"5...|
|{"sbar":{"_id":"5...|
|{"sbar":{"_id":"5...|
+--------------------+
only showing top 5 rows

列内的数据是一个json字符串。我想将该列转换为其他类型(map、struct..)。如何使用 udf 函数执行此操作?我已经创建了一个这样的函数,但无法弄清楚返回类型应该是什么。我尝试了抛出错误的 StructType 和 MapType。这是我的代码。

import json
from pyspark.sql.types import MapType, StructType

udf_getDict = F.udf(lambda x: json.loads(x), StructType)

subset.select(udf_getDict(F.col('data'))).printSchema()

【问题讨论】:

    标签: json pyspark user-defined-functions


    【解决方案1】:

    您可以使用spark.read.jsondf.rdd.map 的方法,例如:

    json_string = """
    {
        "glossary": {
            "title": "example glossary",
            "GlossDiv": {
                "title": "S",
                "GlossList": {
                    "GlossEntry": {
                        "ID": "SGML",
                        "SortAs": "SGML",
                        "GlossTerm": "Standard Generalized Markup Language",
                        "Acronym": "SGML",
                        "Abbrev": "ISO 8879:1986",
                        "GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
                            "GlossSeeAlso": ["GML", "XML"]
                        },
                        "GlossSee": "markup"
                    }
                }
            }
        }
    }
    """
    df2 = spark.createDataFrame(
        [
            (1, json_string), 
        ],
        ['id', 'txt'] 
    )
    df2.dtypes
    [('id', 'bigint'), ('txt', 'string')]
    
    
    new_df = spark.read.json(df2.rdd.map(lambda r: r.txt))
    new_df.printSchema()
    root
     |-- glossary: struct (nullable = true)
     |    |-- GlossDiv: struct (nullable = true)
     |    |    |-- GlossList: struct (nullable = true)
     |    |    |    |-- GlossEntry: struct (nullable = true)
     |    |    |    |    |-- Abbrev: string (nullable = true)
     |    |    |    |    |-- Acronym: string (nullable = true)
     |    |    |    |    |-- GlossDef: struct (nullable = true)
     |    |    |    |    |    |-- GlossSeeAlso: array (nullable = true)
     |    |    |    |    |    |    |-- element: string (containsNull = true)
     |    |    |    |    |    |-- para: string (nullable = true)
     |    |    |    |    |-- GlossSee: string (nullable = true)
     |    |    |    |    |-- GlossTerm: string (nullable = true)
     |    |    |    |    |-- ID: string (nullable = true)
     |    |    |    |    |-- SortAs: string (nullable = true)
     |    |    |-- title: string (nullable = true)
     |    |-- title: string (nullable = true)
    

    【讨论】:

      猜你喜欢
      • 2020-06-05
      • 1970-01-01
      • 2022-06-14
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-07-21
      • 2016-04-25
      • 2021-03-09
      相关资源
      最近更新 更多