【问题标题】:Join multiple spark dfs, combine array column with union of all values加入多个spark dfs,将数组列与所有值的并集结合起来
【发布时间】:2020-02-07 06:34:05
【问题描述】:

假设我的 dfs 有两个列:id (int) 和 names (array[string])

df1: 
1 []
3 ['alice']
4 ['bob']

df2: 
1 ['jack']
2 ['breanna']
3 []

df3: 
1 ['anna']
3 ['rob', 'jerry']
4 []

我想将它们全部组合成:

df_union: 
1 ['jack', 'anna']
2 ['breanna']
3 ['alice','rob','jerry']
4 ['bob']

这是我为提供帮助而制作的 udf:

def appendReasonUdf =
udf((names: Seq[String], newNames: Seq[String]) => names ++ newNames)

不确定下一个最佳行动方案是什么。

df1.union(df2) * insert code to special handle the names col ??? *

【问题讨论】:

    标签: scala dataframe apache-spark user-defined-functions


    【解决方案1】:

    基本上我们需要分解“名称”,合并分解的表,然后在收集列表中的名称时按“id”列分组。我们开始:

    scala> :pa
    // Entering paste mode (ctrl-D to finish)
    
    import org.apache.spark.sql.DataFrame
    val df1 = Seq((1, Nil), (3, List("alice")), (4, List("bob"))).toDF("id", "names")
    val df2 = Seq((1, List("jack")), (2, List("breanna")), (3, Nil)).toDF("id", "names")
    val df3 = Seq((1, List("anna")), (3, List("rob", "jerry")), (4, Nil)).toDF("id", "names")
    
    def expl(df: DataFrame) = df.select($"id", explode($"names").as("name"))
    
    val dfUnion = expl(df1).union(expl(df2)).union(expl(df3))
    dfUnion.groupBy("id").agg(collect_list($"name").as("names")).select("id", "names").orderBy("id").show
    
    
    // Exiting paste mode, now interpreting.
    
    +---+-------------------+
    | id|              names|
    +---+-------------------+
    |  1|       [jack, anna]|
    |  2|          [breanna]|
    |  3|[alice, rob, jerry]|
    |  4|              [bob]|
    +---+-------------------+
    

    【讨论】:

      猜你喜欢
      • 2019-03-27
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-11-02
      • 1970-01-01
      相关资源
      最近更新 更多