Pyspark，连接，sql，外部连接答案

【问题标题】：Pyspark, joins ,sql,outer joinPyspark，连接，sql，外部连接
【发布时间】：2021-01-31 21:40:30
【问题描述】：

我有这样的要求...我想外连接两个表 A 和 B（例如），以便如果键匹配我的输出应该具有表 B 的值（不是 A 列值）例

A
a b
1 abc
2 fgh
3 xyz

B
a b
1 wer
6 uio

输出

a b
1 wer
2 fgh
3 xyz
6 uio

【问题讨论】：

请从intro tour 重复on topic 和how to ask。 “告诉我如何解决这个编码问题？”与 Stack Overflow 无关。您必须诚实地尝试解决方案，然后就您的实施提出具体问题。 Stack Overflow 并不打算取代现有的教程和文档。我们还期望有一个明确的问题规范。

标签： python sql join pyspark outer-join

【解决方案1】：

这是一个优先级查询。您似乎想要来自 b 的所有行，然后是基于第一列的来自 a 的不匹配行。

一种方法是union all:

select b.*
from b
union all
select a.*
from a
where not exists (select 1 from b where b.a = a.a);

【讨论】：

【解决方案2】：

Pyspark 解决方案是使用full 连接和coalesce。

from pyspark.sql import functions as F

# Create dataframes
A = spark.createDataFrame(data=[[1, 'abc'], [2, 'fgh'], [3, 'xyz']], schema=['a', 'b'])
B = spark.createDataFrame(data=[[1, 'wer'], [6, 'uio']], schema=['a', 'b'])

# Rename column `b` to prevent naming collision 
A = A.select('a', F.col('b').alias('b_a'))
B = B.select('a', F.col('b').alias('b_b'))

# Full join on `a` keeps all entries from both dataframes
combined = A.join(B, on='a', how='full')

# Coalesce takes value from `b_b` if not null and `b_a` otherwise
combined = combined.withColumn('b', F.coalesce('b_b', 'b_a'))

# Drop unneeded helper columns
combined = combined.drop('b_b', 'b_a')

combined.show()

结果

+---+---+
|  a|  b|
+---+---+
|  1|wer|
|  2|fgh|
|  3|xyz|
|  6|uio|
+---+---+

【讨论】：