Spark Scala数据框中的列拆分答案

【问题标题】：column split in Spark Scala dataframeSpark Scala数据框中的列拆分
【发布时间】：2020-07-07 20:00:24
【问题描述】：

我有以下数据框 -

scala> val df1=Seq(
     | ("1_10","2_20","3_30"),
     | ("7_70","8_80","9_90")
     | )toDF("c1","c2","c3")

scala> df1.show

+----+----+----+
|  c1|  c2|  c3|
+----+----+----+
|1_10|2_20|3_30|
|7_70|8_80|9_90|
+----+----+----+

如何根据分隔符“_”将其拆分为不同的列。

预期输出 -

+----+----+----+----+----+----+
|  c1|  c2|  c3|c1_1|c2_1|c3_1|
+----+----+----+----+----+----+
|1   |2   |3   |  10|  20|  30|
|7   |8   |9   |  70|  80|  90|
+----+----+----+----+----+----+

我在 DF 中有 50 多列。提前致谢。

【问题讨论】：

标签： scala apache-spark

【解决方案1】：

这里是foldLeft的好用处。 Split 每个column 并为每个splited 值创建一个新的column

val cols = df1.columns
  cols.foldLeft(df1) { (acc, name) =>
    acc.withColumn(name, split(col(name), "_"))
      .withColumn(s"${name}_1", col(name).getItem(0))
      .withColumn(s"${name}_2", col(name).getItem(1))
  }.drop(cols:_*)
   .show(false)

如果您完全需要列名，那么您需要过滤以_1 结尾的列，并使用foldLeft 再次重命名它们

输出：

+----+----+----+----+----+----+
|c1_1|c1_2|c2_1|c2_2|c3_1|c3_2|
+----+----+----+----+----+----+
|1   |10  |2   |20  |3   |30  |
|7   |70  |8   |80  |9   |90  |
+----+----+----+----+----+----+

【讨论】：

【解决方案2】：

你可以使用拆分方法

split(col("c1"), '_')

这将返回 ArrayType(StringType) 然后您可以使用 .getItem(index) 方法访问项目。也就是说，如果拆分后元素数量稳定，如果不是这种情况，如果拆分后数组中不存在索引值，则会有一些空值。

代码示例：

df.select(
  split(col("c1"), "_").alias("c1_items"),
  split(col("c2"), "_").alias("c2_items"),
  split(col("c3"), "_").alias("c3_items"),
).select(
  col("c1_items").getItem(0).alias("c1"),
  col("c1_items").getItem(1).alias("c1_1"),
  col("c2_items").getItem(0).alias("c2"),
  col("c2_items").getItem(1).alias("c2_1"),
  col("c3_items").getItem(0).alias("c3"),
  col("c3_items").getItem(1).alias("c3_1")
)

由于您需要为 50 多列执行此操作，我可能会建议以这种方式将其包装在单列 + withColumn 语句的方法中

def splitMyCol(df: Dataset[_], name: String) = {
  df.withColumn(
    s"${name}_items", split(col("name"), "_")
  ).withColumn(
    name, col(s"${name}_items).getItem(0)
  ).withColumn(
    s"${name}_1", col(s"${name}_items).getItem(1)
  ).drop(s"${name}_items")
}

注意，我假设您不需要维护项目，因此我放弃了它。也不是因为两个变量之间的名称中的 _ 是 s"" 字符串，您需要将第一个包装在 {} 中，而第二个实际上不需要 {} 包装，$ 就足够了。

您可以用这种方式将其包装在 fold 方法中：

val result = columnsToExpand.foldLeft(df)(
  (acc, next) => splitMyCol(acc, next)
)

【讨论】：

【解决方案3】：

pyspark 解决方案：

import pyspark.sql.functions as F
df1=sqlContext.createDataFrame([("1_10","2_20","3_30"),("7_70","8_80","9_90")]).toDF("c1","c2","c3")
expr = [F.split(coln,"_") for coln in df1.columns]
df2=df1.select(*expr)
#%%
df3= df2.withColumn("clctn",F.flatten(F.array(df2.columns)))
#%%  assuming all columns will have data in the same format x_y
arr_size = len(df1.columns)*2
df_fin= df3.select([F.expr("clctn["+str(x)+"]").alias("c"+str(x/2)+'_'+str(x%2)) for x in range(arr_size)])

结果：

+----+----+----+----+----+----+
|c0_0|c0_1|c1_0|c1_1|c2_0|c2_1|
+----+----+----+----+----+----+
|   1|  10|   2|  20|   3|  30|
|   7|  70|   8|  80|   9|  90|
+----+----+----+----+----+----+

【讨论】：

更新了答案以获得更好的列名。检查一下，如果有帮助，请将答案标记为已接受。
实际上我正在寻找一个 Scala 实现。但这对 python 有帮助。
啊，好吧..认为您可以将相同的概念直接转移到scala。祝你好运:-)

【解决方案4】：

尝试使用select 而不是foldLeft 以获得更好的性能。因为foldLeft 可能需要比select 更长的时间

Check this post - foldLeft,select

val expr = df
.columns
.flatMap(c => Seq(
        split(col(c),"_")(0).as(s"${c}_1"),
        split(col(c),"_")(1).as(s"${c}_2")
    )
)
.toSeq

结果

df.select(expr:_*).show(false)
    
+----+----+----+----+----+----+
|c1_1|c1_2|c2_1|c2_2|c3_1|c3_2|
+----+----+----+----+----+----+
|1   |10  |2   |20  |3   |30  |
|7   |70  |8   |80  |9   |90  |
+----+----+----+----+----+----+

【讨论】：

【解决方案5】：

你可以这样做。

var df=Seq(("1_10","2_20","3_30"),("7_70","8_80","9_90")).toDF("c1","c2","c3")

  for (cl <- df.columns) {
    df=df.withColumn(cl+"_temp",split(df.col(cl),"_")(0))
    df=df.withColumn(cl+"_"+cl.substring(1),split(df.col(cl),"_")(1))
    df=df.withColumn(cl,df.col(cl+"_temp")).drop(cl+"_temp")
  }
  df.show(false)
}

//Sample output
    +---+---+---+----+----+----+
    |c1 |c2 |c3 |c1_1|c2_2|c3_3|
    +---+---+---+----+----+----+
    |1  |2  |3  |10  |20  |30  |
    |7  |8  |9  |70  |80  |90  |
    +---+---+---+----+----+----+

【讨论】：