PySpark RDD 到带有元组和字典列表的数据框

【问题标题】：PySpark RDD to dataframe with list of tuple and dictionaryPySpark RDD 到带有元组和字典列表的数据框
【发布时间】：2018-08-10 00:32:16
【问题描述】：

我在pyspark中处理了一些数据，它是一个具有这种结构的RDD

[(u'991', {'location': 'Australia', 'Age': '27', 'Colour': Pink}), (u'993', {'location': 'Singapore', 'Age': '55', 'Colour': Black}), (u'993', {'location': 'Mexico', 'Age': '12', 'Colour': Blue}), (u'994', {'location': 'USA', 'Age': '24', 'Colour': Red})]

如何将此结构转换为 Dataframe？我的最终目标是我可以存储一个带有 4 列（ID（即 991）、位置、年龄、颜色）的配置单元表

鉴于字典在元组中，Row 解决方案似乎不起作用

【问题讨论】：

标签： python dictionary apache-spark dataframe rdd

【解决方案1】：

将每个元组转换为一个Row对象，然后调用toDF方法； Row(ID=t[0], **t[1]) 将元组中的字典作为关键字参数传递给每一行，并使用ID = t[0] 创建一个以ID 为键的新键值对：

from pyspark.sql import Row
rdd.map(lambda t: Row(ID=t[0], **t[1])).toDF().show()
+---+------+---+---------+
|Age|Colour| ID| location|
+---+------+---+---------+
| 27|  Pink|991|Australia|
| 55| Black|993|Singapore|
| 12|  Blue|993|   Mexico|
| 24|   Red|994|      USA|
+---+------+---+---------+

【讨论】：

这似乎不起作用，除非我只是在其中一个数据点上执行，即 t[1] 或 t[0]
@shecode，现在尝试我编辑了答案，因为交换是代码正常工作所必需的。 :)