Pyspark：重命名 DataFrame 列中的字典键

【问题标题】：Pyspark: Rename a dictionary key which is within a DataFrame columnPyspark：重命名 DataFrame 列中的字典键
【发布时间】：2016-09-23 10:52:41
【问题描述】：

经过一些处理后，我得到了一个数据框，其中我在数据框列中有一个字典。现在我想更改列中字典的键。从 "_1" 到 "product_id" 和 "_2" 到 "timestamp"。

这里是处理的代码：

df1 = data.select("user_id","product_id","timestamp_gmt").rdd.map(lambda x: (x[0], (x[1],x[2]))).groupByKey()\
.map(lambda x:(x[0], list(x[1]))).toDF()\
.withColumnRenamed('_1', 'user_id')\
.withColumnRenamed('_2', 'purchase_info')

结果如下：

【问题讨论】：

标签： python dictionary apache-spark dataframe pyspark

【解决方案1】：

Spark 2.0+

使用collect_list 和struct：

from pyspark.sql.functions import collect_list, struct, col

df = sc.parallelize([
    (1, 100, "2012-01-01 00:00:00"),
    (1, 200, "2016-04-04 00:00:01")
]).toDF(["user_id","product_id","timestamp_gmt"])

pi = (collect_list(struct(col("product_id"), col("timestamp_gmt")))
    .alias("purchase_info"))

df.groupBy("user_id").agg(pi)

火花

使用Rows：

(df
    .select("user_id", struct(col("product_id"), col("timestamp_gmt")))
    .rdd.groupByKey()
    .toDF(["user_id", "purchase_info"]))

这可以说更优雅，但应该与将传递给 map 的函数替换为类似的效果：

lambda x: (x[0], Row(product_id=x[1], timestamp_gmt=x[2]))

附带说明，这些不是字典 (MapType)，而是 structs (StructType)。

【讨论】：