【问题标题】:Pyspark ArrayType element of nested StructType complex Json嵌套StructType复杂Json的Pyspark ArrayType元素
【发布时间】:2021-05-24 22:29:47
【问题描述】:

我正在使用从 kafka 主题消息中读取它来创建一个 pyspark 数据帧,这是一个复杂的 json 消息。json 消息的一部分如下 -

{
"paymentEntity": {  
"id": 3081458,
"details": {
  "values": [
    {
      "CardType": "VisaDebit"
    },
    {
      "CardNumber": "********8759"
    },
    {
      "WorldPayMasterId": "c670b980c50eb50373f66a1fe2bf8e70d320a0f7"
    }
  ]
}}}

将其读入 DataFrame 后,其 shcema 和数据如下所示 -

root
 |-- details: struct (nullable = true)
 |    |-- values: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- CardNumber: string (nullable = true)
 |    |    |    |-- CardType: string (nullable = true)
 |    |    |    |-- WorldPayMasterId: string (nullable = true)
 |-- id: long (nullable = true)

+-----------------------------------------------------------------------------------+-------+
|details                                                                            |id     |
+-----------------------------------------------------------------------------------+-------+
|[[[, VisaDebit,], [********8759,,], [,, c670b980c50eb50373f66a1fe2bf8e70d320a0f7]]]|3081458|
+-----------------------------------------------------------------------------------+-------+

如果我使用以下代码进行转换

jsonDF = jsonDF.withColumn("paymentEntity-details- 
values",explode(col('paymentEntity.details.values'))) \
            .withColumn('id',col('paymentEntity.id')).drop('paymentEntity')

然后输出如下所示

root
 |-- paymentEntity-details-values: struct (nullable = true)
 |    |-- CardNumber: string (nullable = true)
 |    |-- CardType: string (nullable = true)
 |    |-- WorldPayMasterId: string (nullable = true)
 |-- id: long (nullable = true)

+---------------------------------------------+-------+
|paymentEntity-details-values                 |id     |
+---------------------------------------------+-------+
|[, VisaDebit,]                               |3081458|
|[********8759,,]                             |3081458|
|[,, c670b980c50eb50373f66a1fe2bf8e70d320a0f7]|3081458|
+---------------------------------------------+-------+

我想处理它并转换 DataFrame 输出,如下所示,而不爆炸数组字段 -

+------------+---------+---------------------------------------------------+-------+
|cardnumber  |CardType |WorldPayMasterId                                   |id     |
+------------+---------+---------------------------------------------------+-------+
|********8759|VisaDebit|c670b980c50eb50373f66a1fe2bf8e70d320a0f7           |3081458|
+------------+---------+---------------------------------------------------+-------+

请任何人提出如何获得相同的建议,感谢任何帮助。

【问题讨论】:

    标签: apache-spark pyspark apache-spark-sql


    【解决方案1】:

    因为你的结构中有一个ArrayType,所以爆炸是有意义的。之后您可以选择单个字段并进行一些聚合以使其按照您想要的方式格式化

    from pyspark.sql import functions as F
    
    (df
        .withColumn('id', F.col('paymentEntity.id'))
        .withColumn('tmp', F.explode(F.col('paymentEntity.details.values')))
        .withColumn('CardNumber', F.col('tmp.CardNumber'))
        .withColumn('CardType', F.col('tmp.CardType'))
        .withColumn('WorldPayMasterId', F.col('tmp.WorldPayMasterId'))
        .drop('paymentEntity', 'tmp')
        .groupBy('id')
        .agg(
            F.first('CardNumber', ignorenulls=True).alias('CardNumber'),
            F.first('CardType', ignorenulls=True).alias('CardType'),
            F.first('WorldPayMasterId', ignorenulls=True).alias('WorldPayMasterId'),
        )
        .show(10, False)
    )
    
    +-------+------------+---------+----------------------------------------+
    |id     |CardNumber  |CardType |WorldPayMasterId                        |
    +-------+------------+---------+----------------------------------------+
    |3081458|********8759|VisaDebit|c670b980c50eb50373f66a1fe2bf8e70d320a0f7|
    +-------+------------+---------+----------------------------------------+
    

    【讨论】:

    • 感谢您的帮助。
    猜你喜欢
    • 2018-10-21
    • 2021-09-28
    • 2017-12-13
    • 1970-01-01
    • 2018-12-28
    • 1970-01-01
    • 2016-03-30
    • 2017-07-04
    • 2020-11-29
    相关资源
    最近更新 更多