【发布时间】:2021-11-19 17:11:20
【问题描述】:
我有一个如下所示的 pyspark 数据框:
columns = ["id","values"]
data = [("sample1", ["a","b","a"]), ("sample2", ["b","b","a","c"])]
dataframe = spark.sparkContext.parallelize(data)
来源
+-------+--------------------+
| id| values|
+-------+--------------------+
|sample1| ["a","b","a"]|
|sample2| ["b","b","a","c"]|
+-------+--------------------+
我想用数组中最常见的值构建一列,并获得如下数据框:
+-------+--------------------+---------+
| id| values| common|
+-------+--------------------+---------+
|sample1| ["a","b","a"]| "a"|
|sample2| ["b","b","a","c"]| "b"|
+-------+--------------------+---------+
【问题讨论】:
标签: python apache-spark pyspark apache-spark-sql