根据列值对pyspark数据框进行排序

【问题标题】：Sorting pyspark dataframe accroding to columns values根据列值对pyspark数据框进行排序
【发布时间】：2020-05-15 08:16:54
【问题描述】：

我是 Spark 的初学者，我正在为我的问题寻找解决方案。我正在尝试根据每列包含的 null 值 的数量按升序对数据框进行排序。

例如：数据：

column1    Column2     Column3
a          d           h
b          null        null
null       e           i
null       f           h
null       null        k
c          g           l

排序后的dataframe应该是：

Column3     Colum2     Column1

我所能做的就是计算每一列的空值。

data.select([count(when(col(c).isNull(), c)).alias(c) for c in data.columns])

现在，我不知道如何继续。我希望你能帮助我。

【问题讨论】：

这能回答你的问题吗？ Python/pyspark data frame rearrange columns
相关：*.com/q/44627386/11301900

标签： python dataframe pyspark pyspark-sql

【解决方案1】：

我的解决方案，随心所欲：

#Based on your code
df=df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns])

# Convert dataframe to dictionary (Python 3.x)
dict = list(map(lambda row: row.asDict(), df.collect()))[0]

# Create a dictionary with sorted values based on keys
sorted_dict={k: v for k, v in sorted(dict.items(), key=lambda item: item[1])}

# Create a sorted list with the column names
sorted_cols = [c for c in sorted_dict.keys()]

# With .select() method we re-order the dataframe
df.select(sorted_cols).show()

【讨论】：

非常感谢您的回复。但是，它在字典行“不支持的类文件主要版本 55”中显示错误。我会尝试修复它。非常感谢
@Mus 你在使用 Python 2.x 吗？因为我的实现是针对 Python 3.x
对于 python 2.x 看看这篇文章：*.com/questions/9001509/…
如果我的回答对你来说没问题，你可以接受：-)
是的，我正在使用 Python2.7。我试过python3，你的答案100%有效。再次感谢