根据 PySpark Dataframe 中两列之间的匹配分配唯一 ID答案

【问题标题】：Assign unique ID based on match between two columns in PySpark Dataframe根据 PySpark Dataframe 中两列之间的匹配分配唯一 ID
【发布时间】：2021-08-12 02:42:46
【问题描述】：

我想为数据框中的列分配一个自动增量唯一 ID。

如果 column1 值与 column2 值匹配，则标志将启用为 true，并且对于所有这些匹配，我们需要分配相同的 ID。如果 column1 值与 column2 值不匹配，则标志将为 False，我们需要为该 column1 值提供唯一 ID。

输入df

ID	Column1	Column2	flag
null	1	2	True
null	1	3	True
null	2	1	True
null	2	3	True
null	3	1	True
null	3	2	True
null	4		False
null	5		False
null	6	7	True
null	7	6	True
null	9	2	True
null	1	9	True
null	3	9	True
null	2	9	True
null	8		False

输出df

这里 column1 的值 1,2,3 和 9 形成匹配，因此我们为所有这 4 个值 (101) 分配一个唯一 ID，column1 值 4 不匹配，因此我们分配下一个唯一 ID(102)，column1值 5 也不匹配，因此我们分配下一个唯一 ID（103），column1 值 6 和 7 是匹配的，因此我们为 2 个值（104）分配相同的唯一 ID，column1 值 8 与任何 column2 值都不匹配，所以我们分配下一个唯一 ID 105

ID	Column1
101	1
101	2
101	3
102	4
103	5
104	6
104	7
101	9
105	8

【问题讨论】：

如果我理解正确，对于col1 中的每个唯一值，您需要检查col2 中是否存在该值，如果存在则flag 为True，否则flag 为错误的？如何将 1、2、3、9 或 6、7 组合在一起？
嗨@pltc，是的 col1 值在输出 df 中是唯一的。在输入中，每个 col1 值都映射到 col2 值，它表示这两个值共享一个唯一 ID。在第一行中，第 1 行映射到 2，因此 1 和 2 形成一对，并且将具有相同的 ID(101) 分配给两者。在第 2 行，1 再次映射到 3，因此现在 3 也与 1 和 2 共享相同的 ID(101)。在第 3 行，2 与 1 映射，我们已经确定了这对的唯一 ID。它继续......并且在第 11 行中，col1 的第 9 行映射到 col2 的第 2 行，因此 9 也应该与 1,2 和 3 共享相同的 ID(101)。

标签： apache-spark pyspark apache-spark-sql

【解决方案1】：

我将我的解释放在代码中。请注意使用Window 而不对您的数据进行分区，此操作会将您的所有数据移动到单个节点。

from pyspark.sql.window import Window
import pyspark.sql.functions as f

# [...] Your dataframe initialization

# Creating an index to retrieve original dataframe at the end
df = df.withColumn('index', f.monotonically_increasing_id())

w = Window.orderBy('least')

# Creating a column with least value from `Column1` and `Column2`. This will be used to "group" the values that must have the same ID
df = df.withColumn('least', f.least(f.col('Column1'), f.col('Column2')))

# Check if the current or previous `flag` is false to increase the id
df = df.withColumn('increase', ((~f.col('flag')) | (~f.lag('flag', default=True).over(w))).cast('int'))

# Generating incremental id
df = df.withColumn('ID', f.lit(101) + f.sum('increase').over(w))

(df
 .select('ID', 'Column1')
 .drop_duplicates()
 .sort('index')
 .show(truncate=False))

输出

+---+-------+
|ID |Column1|
+---+-------+
|101|1      |
|101|2      |
|101|3      |
|102|4      |
|103|5      |
|104|6      |
|104|7      |
|101|9      |
|105|8      |
+---+-------+

【讨论】：

嗨@Kafels，感谢您的回答，column1 中的值将不按顺序排列，因此我们需要考虑 column2 中数字的所有匹配项，然后为该组分配一个唯一 ID。我已经编辑了我的问题
@jvr 我做了更新。如果不对数据框进行排序，我想不出一种方法来实现您的结果。检查它是否能解决您的问题