【发布时间】:2020-12-07 07:24:21
【问题描述】:
我有一个如下所述的 spark 数据框。
val data = spark.sparkContext.parallelize(Seq(
(1,"", "SNACKS", "BISCUITS - AMBIENT", "BISCUITS - AMBIENT", "", "REFLETS DE FRANCE CROQUANT", "UNCOATED BISCUIT", "NO PROMOTION", "", "", "400G","",""),
(2,"GROCERY", "BISCUITS", "SWEET BISCUITS ", "BISCUITS - AMBIENT", "", "", "AMBIENT BISCUIT", "NO PROMOTION", "", "", "400G","","CHOCOS")
))
.toDF("id", "c4", "c1001", "c1002", "c1003", "c1008", "c1008_unmasked", "c1009", "c1011", "c1012", "c1013", "c1015", "c1016", "c1016_unmasked")
data.show(false)
样本输入:
+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+
|id |c4 |c1001 |c1002 |c1003 |c1008|c1008_unmasked |c1009 |c1011 |c1012|c1013|c1015|c1016|c1016_unmasked|
+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+
|1 | |SNACKS |BISCUITS - AMBIENT|BISCUITS - AMBIENT| |REFLETS DE FRANCE CROQUANT|UNCOATED BISCUIT|NO PROMOTION| | |400G | | |
|2 |GROCERY|BISCUITS|SWEET BISCUITS |BISCUITS - AMBIENT| | |AMBIENT BISCUIT |NO PROMOTION| | |400G | |CHOCOS |
+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+
仅当相同的 cXXXX_unmasked 具有值时,才需要用值 "MASKED" 填充列 cXXXX。请检查示例输出以获得更好的理解。
+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+
|id |c4 |c1001 |c1002 |c1003 |c1008 |c1008_unmasked |c1009 |c1011 |c1012|c1013|c1015|c1016 |c1016_unmasked|
+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+
|1 | |SNACKS |BISCUITS - AMBIENT|BISCUITS - AMBIENT|MASKED|REFLETS DE FRANCE CROQUANT|UNCOATED BISCUIT|NO PROMOTION| | |400G | | |
|2 |GROCERY|BISCUITS|SWEET BISCUITS |BISCUITS - AMBIENT| | |AMBIENT BISCUIT |NO PROMOTION| | |400G |MASKED|CHOCOS |
+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+
提前致谢
【问题讨论】:
标签: scala dataframe apache-spark apache-spark-sql user-defined-functions