OP 想要将所有数组/列表聚合到第一行。
values = [(['Justin','Lee'],),(['Chatbots','were'],),(['Our','hopes','were'],),
(['And','why','wouldn'],),(['At','the','Mobile'],)]
df = sqlContext.createDataFrame(values,['author',])
df.show()
+------------------+
| author|
+------------------+
| [Justin, Lee]|
| [Chatbots, were]|
|[Our, hopes, were]|
|[And, why, wouldn]|
| [At, the, Mobile]|
+------------------+
这一步就够了。
from pyspark.sql import functions as F
df = df.groupby().agg(F.collect_list('author').alias('list_of_authors'))
df.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|list_of_authors |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray(Justin, Lee), WrappedArray(Chatbots, were), WrappedArray(Our, hopes, were), WrappedArray(And, why, wouldn), WrappedArray(At, the, Mobile)]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------+