如何将复杂数据类型列与pyspark数据框中的其他类型列连接起来？答案

【问题标题】：how to concatenate complex data type columns with other type columns in pyspark data-frame?如何将复杂数据类型列与pyspark数据框中的其他类型列连接起来？
【发布时间】：2020-11-27 17:17:03
【问题描述】：

我正在尝试连接string , int , array<string> , array<array<string>> and

|-- components: array (nullable = true)
 |    |-- element: struct (containsNull = true)

但尝试使用 concat_ws 会引发以下错误： array 类型。参数 28 需要（数组或字符串）类型

似乎 concat_ws() 不适用于复杂的数据类型。是否有替代 concat_ws() 来实现上述要求。这也应该动态工作，这意味着列名不应该是硬编码的，它应该适用于任何列。

【问题讨论】：

只是好奇，你这样做的动机是什么？
我想在整行上计算 md5，但 md5 函数采用一个参数，因此试图连接所有列，然后计算 md5。

标签： arrays python-3.x pyspark complex-data-types

【解决方案1】：

我尝试将整行组合成一个数组，然后组合成一个 json。之后 md5 工作。

df = spark.createDataFrame([[1]]).selectExpr("array(array('1','2'),array('3','4')) as col", "array(array('1','2'),array('3','4')) as col2")

df.show()
+----------------+----------------+
|             col|            col2|
+----------------+----------------+
|[[1, 2], [3, 4]]|[[1, 2], [3, 4]]|
+----------------+----------------+

df.select(F.md5(F.to_json(F.array(df.columns))).alias('md5')).show(truncate=False)
+--------------------------------+
|md5                             |
+--------------------------------+
|ae5cf1132240349bdc100d9f6ff4dd8b|
+--------------------------------+

【讨论】：

是的，这很好，但是数组将接受相同类型的列，因此即使不将行放入数组中，也可以直接转换为 json 并计算 md5。