【发布时间】:2020-11-23 07:47:20
【问题描述】:
我有一个场景,我必须为每条记录查找具有 Null 值的列,并将所有此类列名称写入单独的列中。
示例: 我有这个数据框:
+---------+---+------------+-----------+------+-------+
|firstName|age|jobStartDate|isGraduated|gender| salary|
+---------+---+------------+-----------+------+-------+
| null|se3| 2006-01-01| 8| M| F|
| null| a3| null| True| F| null|
| Robert| 37| 1992-01-01| null| M|5000.50|
+---------+---+------------+-----------+------+-------+
预期结果应如下所示:
+---------+---+------------+-----------+------+-------+----------------------+
|firstName|age|jobStartDate|isGraduated|gender| salary| Missing Columns|
+---------+---+------------+-----------+------+-------+----------------------+
| null|se3| 2006-01-01| 8| M| F| firstName|
| null| a3| 2006-01-02| True| F| null| firstName,salary|
| Robert| 37| 1992-01-01| null| M|5000.50| isGraduated|
+---------+---+------------+-----------+------+-------+----------------------+
我编写了一半符合我预期结果的代码:
def find_exceptions(df,mand_cols = ['firstName','jobStartDate','salary']):
miss = "Missing: "
for column in mand_cols:
if df[column] is None:
miss = miss + column + ","
return miss
我能够将缺失值收集为列表:
temp = sourceDF.rdd.map(find_exceptions)
temp.collect()
#result:
['Missing: firstName,', 'Missing: firstName,jobStartDate,salary,', 'Missing: ']
我发现很难将其实际写入新专栏。我对 Spark 还很陌生,如果有人能帮我解决这个问题,我将不胜感激。
【问题讨论】:
标签: python scala apache-spark pyspark databricks