按特定顺序为每个唯一 ID 连接多个字符串行答案

【问题标题】：Concat multiple string rows for each unique ID by a particular order按特定顺序为每个唯一 ID 连接多个字符串行
【发布时间】：2019-11-07 21:02:12
【问题描述】：

我想创建一个表，其中每一行都是一个唯一的 ID，并且 Place and City 列包含一个人访问过的所有地方和城市，按访问日期排序，使用 Pyspark 或 Hive。

   df.groupby("ID").agg(F.concat_ws("|",F.collect_list("Place")))

进行连接，但我无法按日期订购。此外，对于每一列，我都需要单独执行此步骤。

我也尝试过使用本文中提到的 windows 功能，(collect_list by preserving order based on another variable) 但它会引发错误：java.lang.UnsupportedOperationException: 'collect_list(') is not supported in a window operation。我想：

1- 按照旅行日期的顺序对连接的列进行排序

2- 对多列执行此步骤

数据

| ID | Date | Place | City |

| 1  | 2017 | UK    | Birm |
| 2  | 2014 | US    | LA   |
| 1  | 2018 | SIN   | Sin  |
| 1  | 2019 | MAL   | KL   |
| 2  | 2015 | US    | SF   |
| 3  | 2019 | UK    | Lon  |

预期

| ID | Place       | City          | 

| 1  |  UK,SIN,MAL |  Birm,Sin,KL  |
| 2  |  US,US      |  LA,SF        |
| 3  |  UK         |  Lon          |

【问题讨论】：

collect_list by preserving order based on another variable的可能重复
谢谢。第一种方案不能用于多列，使用windows函数的方案会报错：java.lang.UnsupportedOperationException: 'collect_list('Place) is not supported in a window operation。
您使用的是什么版本的 Spark？窗口函数不适用于太旧的版本（例如：stackoverflow.com/questions/46628459/…）。

标签： python apache-spark hive pyspark apache-spark-sql

【解决方案1】：

>>> from pyspark.sql import functions as F
>>> from pyspark.sql import Window
>>> w = Window.partitionBy('ID').orderBy('Date')

//Input data frame
>>> df.show()
+---+----+-----+----+
| ID|Date|Place|City|
+---+----+-----+----+
|  1|2017|   UK|Birm|
|  2|2014|   US|  LA|
|  1|2018|  SIN| Sin|
|  1|2019|  MAL|  KL|
|  2|2015|   US|  SF|
|  3|2019|   UK| Lon|
+---+----+-----+----+

>>> df2 = df.withColumn("Place",F.collect_list("Place").over(w)).withColumn("City",F.collect_list("City").over(w)).groupBy("ID").agg(F.max("Place").alias("Place"), F.max("City").alias("City"))

 //Data value in List
>>> df2.show()
+---+--------------+---------------+
| ID|         Place|           City|
+---+--------------+---------------+
|  3|          [UK]|          [Lon]|
|  1|[UK, SIN, MAL]|[Birm, Sin, KL]|
|  2|      [US, US]|       [LA, SF]|
+---+--------------+---------------+


//If you want value in String 
>>> df2.withColumn("Place", F.concat_ws(" ", "Place")).withColumn("City", F.concat_ws(" ", "City")).show()
+---+----------+-----------+
| ID|     Place|       City|
+---+----------+-----------+
|  3|        UK|        Lon|
|  1|UK SIN MAL|Birm Sin KL|
|  2|     US US|      LA SF|
+---+----------+-----------+

【讨论】：

我在窗口中使用 order by，它将根据 ID 处理您的订单。请检查我的代码中的 w（第 3 行）。
抛出错误：ava.lang.UnsupportedOperationException: 'collect_list('Place) is not supported in a window operation.
您使用的是哪个 Spark 版本？您是否导入了我提到的所有软件包？您使用的是 collect_list("Place") 还是 collect_list(Place)？