如何在pyspark中的groupBy之后计算唯一ID答案

【问题标题】：How to count unique ID after groupBy in pyspark如何在pyspark中的groupBy之后计算唯一ID
【发布时间】：2018-03-07 09:29:21
【问题描述】：

我使用以下代码每年汇总学生。目的是了解每年的学生总数。

from pyspark.sql.functions import col
import pyspark.sql.functions as fn
gr = Df2.groupby(['Year'])
df_grouped = 
gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year'))

我发现这么多ID重复的问题，结果是错误的和巨大的。

我想按年汇总学生，按年统计学生总数，避免ID重复。

【问题讨论】：

我从 Hive 表中调用了数据

标签： python pyspark apache-spark-sql

【解决方案1】：

使用countDistinct函数

from pyspark.sql.functions import countDistinct
x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")]
y = spark.createDataFrame(x,["year","id"])

gr = y.groupBy("year").agg(countDistinct("id"))
gr.show()

输出

+----+------------------+
|year|count(DISTINCT id)|
+----+------------------+
|2002|                 2|
|2001|                 2|
+----+------------------+

【讨论】：

为了完整起见，您也可以使用.alias() 重命名列。
请注意，countDistinct 不会将 Null 计为不同的值！
基于哪个版本？

【解决方案2】：

你也可以这样做：

gr.groupBy("year", "id").count().groupBy("year").count()

此查询将返回每年唯一的学生。

【讨论】：

【解决方案3】：

countDistinct() 和多个 aggr 均不支持流式传输。

【讨论】：

【解决方案4】：

如果您使用的是旧版 Spark 并且没有 countDistinct 函数，您可以使用 size 和 collect_set 函数的组合来复制它，如下所示：

gr = gr.groupBy("year").agg(fn.size(fn.collect_set("id")).alias("distinct_count"))

如果您必须对多列进行不同计数，只需使用 concat 将这些列连接成一个新列，然后执行与上述相同的操作。

【讨论】：