【问题标题】:Split the Array column in pyspark dataframe拆分 pyspark 数据框中的 Array 列
【发布时间】:2023-03-05 02:09:01
【问题描述】:

我有一个如下所示的数据框

 col1
 -----------------
 [a1_b1_c1, a2_b2_c2, a3_b3_c3]
 [aa1_bb1_cc1, aa2_bb2_cc2, aa3_bb3]
 [aaa2_bbb2_ccc1, aaa2_bbb2_cc2, aaa3_bbb3]

现在我要拆分元素,需要形成下面提到的数据框,

newcol1
--------
[c1,c2,c3]
[cc1,cc2,null]
[ccc1,ccc2,null]

请向我建议实现这一目标的最佳方法?

【问题讨论】:

  • 您使用的是哪个版本的 Spark?
  • spark 2.4.4 和 python3
  • @Yeskay 很高兴您发现我的解决方案很有帮助!除了已经接受之外,还请考虑支持它:)

标签: arrays dataframe apache-spark pyspark split


【解决方案1】:

您可以将built-in higher order function 用作TRANSFORM。

df.createTempView("tab")
spark.sql(
"select col1, TRANSFORM(col1, v-> split(v, '_')[2]) as newcol1 from tab")\
.show(truncate=False)

+------------------------------------------+------------+
|col1                                      |newcol1     |
+------------------------------------------+------------+
|[a1_b1_c1, a2_b2_c2, a3_b3_c3]            |[c1, c2, c3]|
|[aa1_bb1_cc1, aa2_bb2_cc2, aa3_bb3]       |[cc1, cc2,] |
|[aaa2_bbb2_ccc1, aaa2_bbb2_cc2, aaa3_bbb3]|[ccc1, cc2,]|
+------------------------------------------+------------+

【讨论】:

    【解决方案2】:

    您可以使用udf 实现您想要的输出

    import pyspark.sql.functions as F
    import pyspark.sql.types as T
    
    def my_split(l):
      ll = [e.split('_') + [None]*(3 - len(e.split('_'))) for e in l]
      return [e[2] for e in ll]
    
    my_udf = F.udf(my_split, T.ArrayType(T.StringType()))
    
    df = df.withColumn('newcol1', my_udf('col1'))
    
    df.show(truncate=False)
    
    +------------------------------------------+-----------------+
    |col1                                      |newcol1          |
    +------------------------------------------+-----------------+
    |[a1_b1_c1, a2_b2_c2, a3_b3_c3]            |[c1, c2, c3]     |
    |[aa1_bb1_cc1, aa2_bb2_cc2, aa3_bb3]       |[cc1, cc2, null] |
    |[aaa2_bbb2_ccc1, aaa2_bbb2_cc2, aaa3_bbb3]|[ccc1, cc2, null]|
    +------------------------------------------+-----------------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-07-11
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-01-15
      • 2022-11-20
      • 1970-01-01
      相关资源
      最近更新 更多