【问题标题】:merge two dataframes in scala spark在scala spark中合并两个数据框
【发布时间】:2020-03-20 00:17:17
【问题描述】:

我有两个数据框:

数据框1:

+-----++-----++-------------+
| id  || name| has_bank_acc |
+-----++-----++-------------+
|    0||  qwe||  true       |
|    1||  asd||  false      |
|    2||  rty||  false      |
|    3||  tyu||  true       |
+-----++-----++-------------+

数据框2:

+-----++-----++--------------+
| id  || name| has_email_acc |
+-----++-----++--------------+
|    0||  qwe||  true        |
|    5||  hjk||  false       |
|    8||  oiu||  false       |
|    7||  nmb||  true        |
+-----++-----++--------------+

我必须合并这些数据框以获得以下内容:

+-----++-----++-------------++---------------+
| id  || name| has_bank_acc || has_email_acc |
+-----++-----++-------------++---------------+
|    0||  qwe||  true       |    null        |
|    1||  asd||  false      |    null        |
|    2||  rty||  false      |    null        |
|    3||  tyu||  true       |    null        |
|    0||  qwe||  null       |    true        |
|    5||  hjk||  null       |    false       |
|    8||  oiu||  null       |    false       |
|    7||  nmb||  null       |    true        |
+-----++-----++-------------+----------------+ 

我尝试过联合加入,但没有成功

【问题讨论】:

  • 合并和加入时遇到什么错误

标签: scala apache-spark


【解决方案1】:

您不能对不同的列执行Union。如果您添加缺少的列并传递 null,那么它将给出数据类型错误。 所以唯一的解决方案是加入。

scala> df1.show()
+---+----+------------+
| id|name|has_bank_acc|
+---+----+------------+
|  0| qwe|        true|
|  1| asd|       false|
|  2| rty|       false|
|  3| tyu|        true|
+---+----+------------+


scala> df2.show()
+---+----+-------------+
| id|name|has_email_acc|
+---+----+-------------+
|  0| qwe|       true  |
|  5| hjk|       false |
|  8| oiu|       false |
|  7| nmb|       true  |
+---+----+-------------+


scala> val df11 = df1.withColumn("fid", lit(1))

scala> val df22 = df1.withColumn("fid", lit(2))

scala> df11.alias("1").join(df22.alias("2"), List("fid", "id", "name"),"full").drop("fid").show()
+---+----+------------+------------+
| id|name|has_bank_acc|has_bank_acc|
+---+----+------------+------------+
|  0| qwe|        true|        null|
|  1| asd|       false|        null|
|  2| rty|       false|        null|
|  3| tyu|        true|        null|
|  0| qwe|        null|        true|
|  1| asd|        null|       false|
|  2| rty|        null|       false|
|  3| tyu|        null|        true|
+---+----+------------+------------+

【讨论】:

    【解决方案2】:

    解决方案可能是:

    scala> df1.show
    +---+----+------------+
    | id|name|has_bank_acc|
    +---+----+------------+
    |  0| qwe|        true|
    |  1| asd|       false|
    |  2| rty|       false|
    |  3| tyu|        true|
    +---+----+------------+
    
    
    scala> df2.show
    +---+----+-------------+
    | id|name|has_email_acc|
    +---+----+-------------+
    |  0| qwe|         true|
    |  5| hjk|        false|
    |  8| oiu|        false|
    |  7| nmb|         true|
    +---+----+-------------+
    
    
    scala> val cols1 = df1.columns.toSet
    cols1: scala.collection.immutable.Set[String] = Set(id, name, has_bank_acc)
    
    scala> val cols2 = df2.columns.toSet
    cols2: scala.collection.immutable.Set[String] = Set(id, name, has_email_acc)
    
    scala> val total = cols1 ++ cols2
    total: scala.collection.immutable.Set[String] = Set(id, name, has_bank_acc, has_email_acc)
    
    scala> def expr(myCols: Set[String], allCols: Set[String]) = {
         | allCols.toList.map(x => x match {
         | case x if myCols.contains(x) => col(x)
         | case _ => lit(null).as(x)
         | })
         | }
    expr: (myCols: Set[String], allCols: Set[String])List[org.apache.spark.sql.Column]
    
    scala> df1.select(expr(cols1, total): _*).unionAll(df2.select(expr(cols2,total): _*)).show
    warning: there was one deprecation warning; re-run with -deprecation for details
    +---+----+------------+-------------+
    | id|name|has_bank_acc|has_email_acc|
    +---+----+------------+-------------+
    |  0| qwe|        true|         null|
    |  1| asd|       false|         null|
    |  2| rty|       false|         null|
    |  3| tyu|        true|         null|
    |  0| qwe|        null|         true|
    |  5| hjk|        null|        false|
    |  8| oiu|        null|        false|
    |  7| nmb|        null|         true|
    +---+----+------------+-------------+
    

    如果有帮助请告诉我!!

    【讨论】:

      【解决方案3】:

      添加遗漏列的“UnionAll”可以提供帮助:

      dataframe1
        .withColumn("has_email_acc", lit(null).cast(BooleanType))
          .unionByName(dataframe2.withColumn("has_bank_acc", lit(null).cast(BooleanType)))
      

      【讨论】:

        【解决方案4】:
        val data = Seq((0,"qwe","true"),(1,"asd","false"),(2,"rty","false"),(3,"tyu","true")).toDF("id","name","has_bank_acc")
        scala> data.show
        +---+----+------------+
        | id|name|has_bank_acc|
        +---+----+------------+
        |  0| qwe|        true|
        |  1| asd|       false|
        |  2| rty|       false|
        |  3| tyu|        true|
        +---+----+------------+
        
        val data2 = Seq((0,"qwe","true"),(5,"hjk","false"),(8,"oiu","false"),(7,"nmb","true")).toDF("id","name","has_email_acc")
        
        scala> data2.show
        +---+----+-------------+
        | id|name|has_email_acc|
        +---+----+-------------+
        |  0| qwe|         true|
        |  5| hjk|        false|
        |  8| oiu|        false|
        |  7| nmb|         true|
        +---+----+-------------+
        
        val data_cols = data.columns
        val data2_cols = data2.columns
        
        val transformedData = data2_cols.diff(data_cols).foldLeft(data) {
              case (df, (newCols)) =>
                df.withColumn(newCols, lit("null"))
            }
        
        val transformedData2 = data_cols.diff(data2_cols).foldLeft(data2) {
              case (df, (newCols)) =>
                df.withColumn(newCols, lit("null"))
            }
        
        val finalData = transformedData2.unionByName(transformedData)
        finalData.show
        scala> finalData.show
        +---+----+-------------+------------+
        | id|name|has_email_acc|has_bank_acc|
        +---+----+-------------+------------+
        |  0| qwe|         true|        null|
        |  5| hjk|        false|        null|
        |  8| oiu|        false|        null|
        |  7| nmb|         true|        null|
        |  0| qwe|         null|        true|
        |  1| asd|         null|       false|
        |  2| rty|         null|       false|
        |  3| tyu|         null|        true|
        +---+----+-------------+------------+
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2015-10-18
          • 1970-01-01
          • 2018-10-19
          • 1970-01-01
          • 2018-04-19
          • 1970-01-01
          相关资源
          最近更新 更多