按列排列的前 n 个元素答案

【问题标题】：Top n elements by column按列排列的前 n 个元素
【发布时间】：2026-01-01 00:30:01
【问题描述】：

假设我有这个 Spark 数据框：

col1 | col2 | col3 | col4
   a |    g |    h |    p
   r |    i |    h |    l
   f |    j |    z |    d
   a |    j |    m |    l
   f |    g |    h |    q
   f |    z |    z |    a
 ...

我想取消透视列并按出现次数获取前 n 个元素的数组。例如 n=3：

columnName |   content
      col1 | [f, a, r]
      col2 | [g, j, i]
      col3 | [h, z, m]
      col4 | [l, a, d]

我设法使用此代码将列名加入到单个列中：

columnNames = output_df.columns
output_df = output_df.withColumn("columns", F.array([F.lit(x) for x in columnNames]))

我想我可以使用 explode 功能，但不确定它是不是最有效的方法。

有什么建议吗？

谢谢

【问题讨论】：

top n elements 是什么意思？按发生次数？
@BlueSheepToken 是出现次数最多的 n 个元素。我会更新问题。

标签： python hive pyspark apache-spark-sql

【解决方案1】：

除了手动计算所有出现之外，我什么都看不到，这不是很有效，我很高兴听到其他方法。

但是，如果您不担心性能问题，那么这样做可以解决问题！

请注意，我是用 scala 编写的，我会尝试将其翻译为 pyspark，但由于我以前从未这样做过，这将很难。

// Let's create a dataframe for reproductibility
val data = Seq(("a", "g", "h", "p"),
("r", "i", "h", "l"),
("f", "j", "z", "d"),
("a", "j", "m", "l"),
("f", "g", "h", "q"),
("f", "z", "z", "a"))

val df = data.toDF("col1", "col2", "col3", "col4")

// Let's add a constant 1, with the groupby sum that will give us the occurencies !
val dfWithFuturOccurences = df.withColumn("futur_occurences", F.lit(1))

// Your n value
val n = 3

// Here goes the magic
df.columns // For each column
    .map(x => 
        (x, dfWithFuturOccurences
            .groupBy(x)
            .agg(sum("futur_occurences").alias("occurences")) // Count occurences here
            .orderBy(desc("occurences"))
            .select(x)
            .limit(n) // Select the top n elements
            .rdd.map(r => r(0).toString).collect().toSeq) //  Collect them and store them as a Seq of string
        )
    .toSeq
    .toDF("col", "top_elements")

在 PySpark 中可能是这样的：

import pyspark.sql.functions as F

data = list(map(lambda x: 
            (x,
            [str(row[x]) for row in 
             dfWithFuturOccurences
            .groupBy(x)
            .agg(F.sum("futur_occurences").alias("occurences"))
            .orderBy(desc("occurences"))
            .select(x)
            .limit(n)
            .collect()]
            )
        , df.columns))

然后将您的数据转换为数据框，就完成了！

输出：

+----+------------+
| col|top_elements|
+----+------------+
|col1|   [f, a, r]|
|col2|   [g, j, z]|
|col3|   [h, z, m]|
|col4|   [l, p, d]|
+----+------------+

【讨论】：

感谢您的回复。您不应该在地图开头使用dfWithFuturOccurences 而不是dfWithOccurences 吗？我试过你的代码（pyspark）并得到这个错误in lambda TypeError: unsupported operand type(s) for +: 'int' and 'str'
@Maxbester 你说得对，我迷路了！你能告诉我哪条线报错了吗？
聚合线.agg(sum("futur_occurences").alias("occurences"))。我不知道这是F.lit(1) 的演员表问题还是数据框中的数据格式问题。
@Maxbester，这应该可以工作，我认为 sum 方法不是来自 sparl sql 函数，而是来自 python 内置函数（sum('hello world') 出现同样的错误）
@Maxbester，这对你有帮助吗？