【问题标题】:How to obtain the most repeat value in Array Type in Pyspark?如何在 Pyspark 中获取数组类型中重复次数最多的值?
【发布时间】:2021-11-19 17:11:20
【问题描述】:

我有一个如下所示的 pyspark 数据框:

columns = ["id","values"]
data = [("sample1", ["a","b","a"]), ("sample2", ["b","b","a","c"])]
dataframe = spark.sparkContext.parallelize(data)

来源

+-------+--------------------+
|     id|              values|
+-------+--------------------+
|sample1|       ["a","b","a"]|
|sample2|   ["b","b","a","c"]|
+-------+--------------------+

我想用数组中最常见的值构建一列,并获得如下数据框:

+-------+--------------------+---------+
|     id|              values|   common|
+-------+--------------------+---------+
|sample1|       ["a","b","a"]|      "a"|
|sample2|   ["b","b","a","c"]|      "b"|
+-------+--------------------+---------+

【问题讨论】:

    标签: python apache-spark pyspark apache-spark-sql


    【解决方案1】:

    您可以分解数组values group by 以计算每个值的出现次数,并使用 Window 过滤具有最大计数的值:

    from pyspark.sql import Window
    import pyspark.sql.functions as F
    
    df1 = df.withColumn(
        "common",
        F.explode("values")
    ).groupBy("id", "values", "common").count().withColumn(
        "rn",
        F.row_number().over(Window.partitionBy("id", "values").orderBy(F.col("count").desc()))
    ).filter("rn = 1").drop("rn", "count")
    
    df1.show()
    #+-------+------------+------+
    #|id     |values      |common|
    #+-------+------------+------+
    #|sample1|[a, b, a]   |a     |
    #|sample2|[b, b, a, c]|b     |
    #+-------+------------+------+
    

    另一种不使用explode的方法是使用高阶函数transformfilter以及一些数组函数:

    df1 = df.withColumn(
        "common",
        F.array_max(
            F.expr("""transform(
                        array_distinct(values), 
                        x -> struct(
                                size(filter(values, y -> y = x)) as count, 
                                x as value
                            )
                    )""")
        )["value"]
    )
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2016-02-08
      • 1970-01-01
      • 2022-10-23
      • 2017-10-16
      • 1970-01-01
      • 2021-10-17
      • 1970-01-01
      • 2019-04-14
      相关资源
      最近更新 更多