PySpark：减去忽略某些列的数据框答案

【问题标题】：PySpark: Subtract Dataframe Ignoring Some ColumnsPySpark：减去忽略某些列的数据框
【发布时间】：2017-09-07 02:27:54
【问题描述】：

我想在 pyspark 中的 2 个数据帧之间执行减法运算。挑战是我必须在减去数据框时忽略一些列。但最终数据框应该包含所有列，包括被忽略的列。

这是一个例子：

userLeft = sc.parallelize([
    Row(id=u'1', 
        first_name=u'Steve', 
        last_name=u'Kent', 
        email=u's.kent@email.com',
        date1=u'2017-02-08'),
    Row(id=u'2', 
        first_name=u'Margaret', 
        last_name=u'Peace', 
        email=u'marge.peace@email.com',
        date1=u'2017-02-09'),
    Row(id=u'3', 
        first_name=None, 
        last_name=u'hh', 
        email=u'marge.hh@email.com',
        date1=u'2017-02-10')
]).toDF()

userRight = sc.parallelize([
    Row(id=u'2', 
        first_name=u'Margaret', 
        last_name=u'Peace', 
        email=u'marge.peace@email.com',
        date1=u'2017-02-11'),
    Row(id=u'3', 
        first_name=None, 
        last_name=u'hh', 
        email=u'marge.hh@email.com',
        date1=u'2017-02-12')
]).toDF()

预期：

ActiveDF = userLeft.subtract(userRight) ||| Ignore "date1" column while subtracting.

最终结果应该是这样的，包括“date1”列。

+----------+--------------------+----------+---+---------+
|     date1|               email|first_name| id|last_name|
+----------+--------------------+----------+---+---------+
|2017-02-08|    s.kent@email.com|     Steve|  1|     Kent|
+----------+--------------------+----------+---+---------+

【问题讨论】：

标签： apache-spark pyspark spark-dataframe

【解决方案1】：

看来你需要anti-join:

userLeft.join(userRight, ["id"], "leftanti").show()
+----------+----------------+----------+---+---------+  
|     date1|           email|first_name| id|last_name|
+----------+----------------+----------+---+---------+
|2017-02-08|s.kent@email.com|     Steve|  1|     Kent|
+----------+----------------+----------+---+---------+

【讨论】：

"leftanti" 在 pyspark 1.6 中不可用。我没有这些数据框的任何特定主键。我的数据框是在运行时生成的。所以，我不知道它的列详细信息。但我一直都知道，加入时我不想考虑哪些专栏。
如果您想加入除date1 之外的所有列，一个选项是userLeft.join(userRight, [col for col in userLeft.columns if col != 'date1'], "leftanti")，但这不是空安全的，您可能需要在执行此操作之前用空字符串填充空值。

【解决方案2】：

您也可以使用 full join 并仅保留 null 值：

userLeft.join(
    userRight, 
    [c for c in userLeft.columns if c != "date1"], 
    "full"
 ).filter(psf.isnull(userLeft.date1) | psf.isnull(userRight.date1)).show()

    +------------------+----------+---+---------+----------+----------+
    |             email|first_name| id|last_name|     date1|     date1|
    +------------------+----------+---+---------+----------+----------+
    |marge.hh@email.com|      null|  3|       hh|2017-02-10|      null|
    |marge.hh@email.com|      null|  3|       hh|      null|2017-02-12|
    |  s.kent@email.com|     Steve|  1|     Kent|2017-02-08|      null|
    +------------------+----------+---+---------+----------+----------+

如果您想使用连接，无论是leftanti 还是full，您都需要在连接列中找到null 的默认值（我想我们在上一个帖子中讨论过）。

您也可以只drop 困扰您的专栏subtract 和join：

df = userLeft.drop("date1").subtract(userRight.drop("date1"))
userLeft.join(df, df.columns).show()

    +----------------+----------+---+---------+----------+
    |           email|first_name| id|last_name|     date1|
    +----------------+----------+---+---------+----------+
    |s.kent@email.com|     Steve|  1|     Kent|2017-02-08|
    +----------------+----------+---+---------+----------+

【讨论】：

生产数据。我无法触摸 NULL 并为其分配默认值。