【发布时间】:2021-05-24 22:29:47
【问题描述】:
我正在使用从 kafka 主题消息中读取它来创建一个 pyspark 数据帧,这是一个复杂的 json 消息。json 消息的一部分如下 -
{
"paymentEntity": {
"id": 3081458,
"details": {
"values": [
{
"CardType": "VisaDebit"
},
{
"CardNumber": "********8759"
},
{
"WorldPayMasterId": "c670b980c50eb50373f66a1fe2bf8e70d320a0f7"
}
]
}}}
将其读入 DataFrame 后,其 shcema 和数据如下所示 -
root
|-- details: struct (nullable = true)
| |-- values: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- CardNumber: string (nullable = true)
| | | |-- CardType: string (nullable = true)
| | | |-- WorldPayMasterId: string (nullable = true)
|-- id: long (nullable = true)
+-----------------------------------------------------------------------------------+-------+
|details |id |
+-----------------------------------------------------------------------------------+-------+
|[[[, VisaDebit,], [********8759,,], [,, c670b980c50eb50373f66a1fe2bf8e70d320a0f7]]]|3081458|
+-----------------------------------------------------------------------------------+-------+
如果我使用以下代码进行转换
jsonDF = jsonDF.withColumn("paymentEntity-details-
values",explode(col('paymentEntity.details.values'))) \
.withColumn('id',col('paymentEntity.id')).drop('paymentEntity')
然后输出如下所示
root
|-- paymentEntity-details-values: struct (nullable = true)
| |-- CardNumber: string (nullable = true)
| |-- CardType: string (nullable = true)
| |-- WorldPayMasterId: string (nullable = true)
|-- id: long (nullable = true)
+---------------------------------------------+-------+
|paymentEntity-details-values |id |
+---------------------------------------------+-------+
|[, VisaDebit,] |3081458|
|[********8759,,] |3081458|
|[,, c670b980c50eb50373f66a1fe2bf8e70d320a0f7]|3081458|
+---------------------------------------------+-------+
我想处理它并转换 DataFrame 输出,如下所示,而不爆炸数组字段 -
+------------+---------+---------------------------------------------------+-------+
|cardnumber |CardType |WorldPayMasterId |id |
+------------+---------+---------------------------------------------------+-------+
|********8759|VisaDebit|c670b980c50eb50373f66a1fe2bf8e70d320a0f7 |3081458|
+------------+---------+---------------------------------------------------+-------+
请任何人提出如何获得相同的建议,感谢任何帮助。
【问题讨论】:
标签: apache-spark pyspark apache-spark-sql