【发布时间】:2018-02-21 23:35:17
【问题描述】:
我有一个庞大的 pyspark 数据框。我必须执行一个小组,但是我遇到了严重的性能问题。我需要优化代码,所以我一直在读到 Reduce by Key 更有效。
这是数据框的示例。
a = [('Bob', 562,"Food", "12 May 2018"), ('Bob',880,"Food","01 June 2018"), ('Bob',380,'Household'," 16 June 2018"), ('Sue',85,'Household'," 16 July 2018"), ('Sue',963,'Household'," 16 Sept 2018")]
df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"])
输出:
+------+------+---------+-------------+
|Person|Amount| Budget| Date|
+------+------+---------+-------------+
| Bob| 562| Food| 12 May 2018|
| Bob| 880| Food| 01 June 2018|
| Bob| 380|Household| 16 June 2018|
| Sue| 85|Household| 16 July 2018|
| Sue| 963|Household| 16 Sept 2018|
+------+------+---------+-------------+
我已经实现了以下代码,但是如前所述,实际的数据帧是巨大的。
df_grouped = df.groupby('person').agg(F.collect_list(F.struct("Amount", "Budget", "Date")).alias("data"))
输出:
+------+--------------------------------------------------------------------------------+
|person|data |
+------+--------------------------------------------------------------------------------+
|Sue |[[85,Household, 16 July 2018], [963,Household, 16 Sept 2018]] |
|Bob |[[562,Food,12 May 2018], [880,Food,01 June 2018], [380,Household, 16 June 2018]]|
+------+--------------------------------------------------------------------------------+
架构为:
root
|-- person: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Amount: long (nullable = true)
| | |-- Budget: string (nullable = true)
| | |-- Date: string (nullable = true)
我需要将 group by 转换为 reduce by key,以便我可以创建与上面相同的架构。
【问题讨论】:
-
您在哪里读到
reduceByKey()效率更高?能给个链接吗? -
groupBy将是您的最佳选择
标签: python group-by pyspark spark-dataframe