【问题标题】:replace spaces with null value using regexp_replace if i have multiple如果我有多个,则使用 regexp_replace 将空格替换为空值
【发布时间】:2019-05-31 16:09:01
【问题描述】:

如果我在多列中有空格,如何用 Null 替换空格。

Input Dataset which i have
+---+-----++----+
| Id|col_1|col_2|
+---+-----+-----+
|  0|104  |     |
|  1|     |     |
+---+-----+-----+
import org.apache.spark.sql.functions._

val test = df.withColumn("col_1","col_2", regexp_replace(df("col_1","col_1"), "^\\s*", lit(Null)))
test.filter("col_1,col_2 is null").show()

输出数据集:

+---+-----++----+
| Id|col_1|col_2|
+---+-----+-----+
|  0|104  | Null|
|  1|Null | Null|
+---+-----+-----+

【问题讨论】:

    标签: scala


    【解决方案1】:

    为每一列使用一个 withColumn:

    import org.apache.spark.sql.functions._
    val df = List(("0", "104", "    "), ("1", " ", "")).toDF("Id","col_1", "col_2")
    
    val test = df
      .withColumn("col_1", when(regexp_replace (col("col_1"), "\\s+", "") === "", null).otherwise(col("col_1")))
      .withColumn("col_2", when(regexp_replace (col("col_2"), "\\s+", "") === "", null).otherwise(col("col_2")))
      .show
    

    结果

    +---+-----+-----+
    | Id|col_1|col_2|
    +---+-----+-----+
    |  0|  104| null|
    |  1| null| null|
    +---+-----+-----+
    

    【讨论】:

    • notebook:2:警告:通过插入 () 来适应参数列表已被弃用:泄漏(对象接收)目标使这特别危险。签名:Column.getItem(key: Any): org.apache.spark.sql.Column 给定参数: 适配后:Column.getItem((): Unit) .withColumn("Name", if(col(" Name").getItem().toString().replaceAll(" ", "").equals("")) lit(null) else col("Name") )
    • 我收到这个错误 java.lang.RuntimeException: Unsupported literal type class scala.runtime.BoxedUnit ()
    • @PraveenSaini 现在已修复
    【解决方案2】:

    嗨,你可以这样做:

    scala> val someDFWithName = Seq((1, "anurag", ""), (5, "", "")).toDF("id", "name", "age")
    someDFWithName: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
    
    scala> someDFWithName.show
    +---+------+---+
    | id|  name|age|
    +---+------+---+
    |  1|anurag|   |
    |  5|      |   |
    +---+------+---+
    scala> someDFWithName.na.replace(Seq("name","age"),Map(""-> null)).show
    +---+------+----+
    | id|  name| age|
    +---+------+----+
    |  1|anurag|null|
    |  5|  null|null|
    +---+------+----+
    

    或者也试试这个:

    scala> someDFWithName.withColumn("Name", when(col("Name") === "", null).otherwise(col("Name"))).withColumn("Age", when(col("Age") === "", null).otherwise(col("Age"))).show
    +---+------+----+
    | id|  name| age|
    +---+------+----+
    |  1|anurag|null|
    |  5|  null|null|
    +---+------+----+
    

    或者对于多个空间,试试这个:

    scala> val someDFWithName = Seq(("n", "a"), ( "", "n"), ("         ", ""), ("  ", "a"), ("   ",""), ("        ","   "), ("c"," ")).toDF("name", "place")
    someDFWithName: org.apache.spark.sql.DataFrame = [name: string, place: string]
    
    scala> someDFWithName.withColumn("Name", when(regexp_replace(col("name"),"\\s+","") === "", null).otherwise(col("Name"))).withColumn("Place", when(regexp_replace(col("place"),"\\s+","") === "", null).otherwise(col("place"))).show
    +----+-----+
    |Name|Place|
    +----+-----+
    |   n|    a|
    |null|    n|
    |null| null|
    |null|    a|
    |null| null|
    |null| null|
    |   c| null|
    +----+-----+
    

    我希望这会对你有所帮助。谢谢

    【讨论】:

    • 我已经在数据块中尝试过,但它不起作用。请参阅下面的代码和相应的输出。 odr.na.replace(Seq("name","age"),Map(""-> null)).show +----+-----+ |Name|Place| +----+-----+ | N|一个| |一个| | |一个| | |一个| | | | | |乙| | | c| | | c| | | | | | d| | +----+-----+
    • @PraveenSaini 请试试这个:odr.na.replace(Seq("Name","Place"),Map(""-> null)).show 请根据要求对查询进行修改。不要盲目复制粘贴。
    • 不工作我按照你的指示试过了。有什么我想念的吗?
    • 您能否提供您的所有查询,例如您是如何创建 DF 的?你想过滤掉什么?
    • @Learner 我有我从 DBFS 导入并创建数据框的 excel 文件。所以在我的文件中有各种有空间的列,但我想删除所有“空”的空间。
    猜你喜欢
    • 2019-03-31
    • 2019-06-09
    • 2013-07-19
    • 1970-01-01
    • 2015-08-09
    • 2014-05-21
    • 2016-11-07
    • 2010-11-19
    • 1970-01-01
    相关资源
    最近更新 更多