【问题标题】:Update a column value in a spark dataframe based another column基于另一列更新火花数据框中的列值
【发布时间】:2020-12-07 07:24:21
【问题描述】:

我有一个如下所述的 spark 数据框。

val data = spark.sparkContext.parallelize(Seq(
    (1,"", "SNACKS", "BISCUITS - AMBIENT", "BISCUITS - AMBIENT", "", "REFLETS DE FRANCE CROQUANT", "UNCOATED  BISCUIT", "NO PROMOTION", "", "", "400G","",""),
    (2,"GROCERY", "BISCUITS", "SWEET BISCUITS ", "BISCUITS - AMBIENT", "", "", "AMBIENT BISCUIT", "NO PROMOTION", "", "", "400G","","CHOCOS")
  ))
  .toDF("id", "c4", "c1001", "c1002", "c1003", "c1008", "c1008_unmasked", "c1009", "c1011", "c1012", "c1013", "c1015", "c1016", "c1016_unmasked")

data.show(false)

样本输入:

+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+
|id |c4     |c1001   |c1002             |c1003             |c1008|c1008_unmasked            |c1009            |c1011       |c1012|c1013|c1015|c1016|c1016_unmasked|
+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+
|1  |       |SNACKS  |BISCUITS - AMBIENT|BISCUITS - AMBIENT|     |REFLETS DE FRANCE CROQUANT|UNCOATED  BISCUIT|NO PROMOTION|     |     |400G |     |              |
|2  |GROCERY|BISCUITS|SWEET BISCUITS    |BISCUITS - AMBIENT|     |                          |AMBIENT BISCUIT  |NO PROMOTION|     |     |400G |     |CHOCOS        |
+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+

仅当相同的 cXXXX_unmasked 具有值时,才需要用值 "MASKED" 填充列 cXXXX。请检查示例输出以获得更好的理解。

+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+
|id |c4     |c1001   |c1002             |c1003             |c1008 |c1008_unmasked            |c1009            |c1011       |c1012|c1013|c1015|c1016 |c1016_unmasked|
+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+
|1  |       |SNACKS  |BISCUITS - AMBIENT|BISCUITS - AMBIENT|MASKED|REFLETS DE FRANCE CROQUANT|UNCOATED  BISCUIT|NO PROMOTION|     |     |400G |      |              |
|2  |GROCERY|BISCUITS|SWEET BISCUITS    |BISCUITS - AMBIENT|      |                          |AMBIENT BISCUIT  |NO PROMOTION|     |     |400G |MASKED|CHOCOS        |
+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+

提前致谢

【问题讨论】:

    标签: scala dataframe apache-spark apache-spark-sql user-defined-functions


    【解决方案1】:

    这是我的尝试。

    val cols = data.columns.filter(_.endsWith("_unmasked"))
    
    val new_data = cols.foldLeft(data) { (df, c) => 
        df.withColumn(c.split("_").head, when(col(c) =!= "" && col(c).isNotNull, lit("MASKED")).otherwise(col(c))) 
    }
    new_data.show
    
    +---+-------+--------+------------------+------------------+------+--------------------+-----------------+------------+-----+-----+-----+------+--------------+
    | id|     c4|   c1001|             c1002|             c1003| c1008|      c1008_unmasked|            c1009|       c1011|c1012|c1013|c1015| c1016|c1016_unmasked|
    +---+-------+--------+------------------+------------------+------+--------------------+-----------------+------------+-----+-----+-----+------+--------------+
    |  1|       |  SNACKS|BISCUITS - AMBIENT|BISCUITS - AMBIENT|MASKED|REFLETS DE FRANCE...|UNCOATED  BISCUIT|NO PROMOTION|     |     | 400G|      |              |
    |  2|GROCERY|BISCUITS|   SWEET BISCUITS |BISCUITS - AMBIENT|      |                    |  AMBIENT BISCUIT|NO PROMOTION|     |     | 400G|MASKED|        CHOCOS|
    +---+-------+--------+------------------+------------------+------+--------------------+-----------------+------------+-----+-----+-----+------+--------------+
    

    【讨论】:

      猜你喜欢
      • 2021-11-05
      • 2018-12-13
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-03-14
      相关资源
      最近更新 更多