您可以通过regexp_replace在Regex条件上加入两个DataFrame,如下所示:
val df1 = Seq(
(1, "Test1", "Amount paid to User1 dt"),
(2, "Test2", "Amount paid to User1 dt"),
(3, "Test3", "Amount paid to balamurugan dt"),
(4, "Test4", "Amount paid to final dt"),
(5, "Test5", "Amount paid to John less dt")
).toDF("ID", "Name", "Text")
val df2 = Seq("User1", "murugan", "Amo").toDF("Text")
val pattern = concat(lit("\\b"), df2("Text"), lit("\\b"))
df1.join(df2, regexp_replace(df1("Text"), pattern, lit("")) =!= df1("Text")).show
// +---+-----+-----------------------+-----+
// |ID |Name |Text |Text |
// +---+-----+-----------------------+-----+
// |1 |Test1|Amount paid to User1 dt|User1|
// |2 |Test2|Amount paid to User1 dt|User1|
// +---+-----+-----------------------+-----+
请注意,\b 表示单词边界,因此将正则表达式匹配限制为仅单词匹配。
更新:
正如其他答案中所建议的,left_semi join 可能会更好地工作,以避免在有多个匹配项时出现重复的行。默认的inner 连接适用于df2 的列要包含在结果数据集中的情况。