【问题标题】:Explode multiple columns into separate rows in Spark Scala在 Spark Scala 中将多列分解为单独的行
【发布时间】:2021-06-04 20:51:20
【问题描述】:

我在以下结构中有一个 DF

Col1.                       Col2                    Col3
Data1Col1,Data2Col1.     Data1Col2,Data2Col2.    Data1Col3,Data2Col3

我希望生成的数据集属于以下类型:

Col1         Col2        Col3
Data1Col1.  Data1Col2.   Data1Col3
Data2Col1.  Data2Col2    Data2Col3

请建议我如何处理这个问题。我尝试过 explode ,但这会导致重复的行。

【问题讨论】:

  • 为什么会有一些虚假的点?它们是否相关?

标签: scala apache-spark apache-spark-sql


【解决方案1】:
val df = Seq(("C,D,E,F","M,N,O,P","K,P,B,P")).toDF("Col1","Col2","Col3") 
   
df.show
+-------+-------+-------+
|   Col1|   Col2|   Col3|
+-------+-------+-------+
|C,D,E,F|M,N,O,P|K,P,B,P|
+-------+-------+-------+
           
val res1 = df.withColumn("Col1",split(col("Col1"),",")).withColumn("Col2",split(col("Col2"),",")).withColumn("Col3",split(col("Col3"),","))
           
res1.show
+------------+------------+------------+
|        Col1|        Col2|        Col3|
+------------+------------+------------+
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|
+------------+------------+------------+
           
           
val zip = udf((x: Seq[String], y: Seq[String], z: Seq[String]) => z.zip(x.zip(y)))
           
val res14 = res1.withColumn("test",explode(zip(col("Col1"),col("Col2"),col("Col3")))).show
+------------+------------+------------+-----------+
|        Col1|        Col2|        Col3|       test|
+------------+------------+------------+-----------+
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[K, [C, M]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[P, [D, N]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[B, [E, O]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[P, [F, P]]|
+------------+------------+------------+-----------+
           
       
res14.withColumn("t3",col("test._1")).withColumn("tn",col("test._2")).withColumn("t2",col("tn._2")).withColumn("t1",col("tn._1")).select("t1","t2","t3").show
+---+---+---+
| t1| t2| t3|
+---+---+---+
|  C|  M|  K|
|  D|  N|  P|
|  E|  O|  B|
|  F|  P|  P|
+---+---+---+

res1 - 初始数据帧

res14 - 中间 Df

【讨论】:

    猜你喜欢
    • 2018-09-29
    • 2020-05-09
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-04-24
    • 1970-01-01
    相关资源
    最近更新 更多