pyspark 字符串匹配 - 选择第一个匹配项答案

【问题标题】：pyspark string matching - pick first matchpyspark 字符串匹配 - 选择第一个匹配项
【发布时间】：2021-04-11 11:46:54
【问题描述】：

我有两张桌子。

comment_df

| Date | Comment | 
|:---- |:------:| 
| 20/01/2020 | Transfer from Euro Account to HSBC account done on Monday but AMEX payment was on Tue. |
| 20/01/2020 | Brian initiated a Transfer from Euro Account to Natwest last Tuesday. |
| 21/01/2020 | AMEX payment to Natwest was delayed for second time in a row. |
| 21/01/2020 | AMEX receipts from Euro Account delayed. |

code_df

| Tag | Comment | 
|:---- |:------:| 
| EURO | Euro Account to HSBC |
| Natwest | Euro Account to Natwest |
| AMEX | AMEX payment |

想要的表

| Date | Comment | Tag |
|:---- |:------:| ----:|
| 20/01/2020 | Transfer from Euro Account to HSBC account done on Monday but AMEX payment was on Tue.| EURO |
| 20/01/2020 | Brian initiated a Transfer from Euro Account to Natwest last Tuesday. | Natwest |
| 21/01/2020 | AMEX payment to Natwest was delayed for second time in a row. | AMEX | 
| 21/01/2020 | AMEX receipts from Euro Account delayed. | |

所以第一条评论有两个标签（欧元账户到汇丰银行和美国运通支付），但我希望结果显示它遇到的第一个标签，而不是重复行。以下是之前的建议。

code_df = code_df.withColumnRenamed('Comment', 'Commentcode')

result = comment_df.join(code_df, comment_df.Comment.contains(code_df.Commentcode), 'left').drop('Commentcode')

result.show(truncate=False)

+----------+---------------------------------------------------------------------+-------+
|Date      |Comment                                                              |Tag    |
+----------+---------------------------------------------------------------------+-------+
|20/01/2020|Transfer from Euro Account to HSBC account done on Monday but AMEX payment was on Tue. |EURO|
|20/01/2020|Brian initiated a Transfer from Euro Account to Natwest last Tuesday.|Natwest|
|21/01/2020|AMEX payment to Natwest was delayed for second time in a row.        |AMEX|
|21/01/2020|AMEX receipts from Euro Account delayed.                             |null|
+----------+---------------------------------------------------------------------+-------+

【问题讨论】：

你能不能提一下预期的输出？

标签： string apache-spark pyspark apache-spark-sql string-matching

【解决方案1】：

您可以根据匹配字符串的位置对结果进行排序，通过过滤位置只得到第一个匹配。

from pyspark.sql import functions as F, Window

result = comment_df.join(
    code_df,
    comment_df.Comment.contains(code_df.Commentcode),
    'left'
).withColumn(
    'rn',
    F.row_number().over(
        Window.partitionBy('Date', 'Comment')
              .orderBy(F.expr('instr(Comment, Commentcode)'))
    )
).filter('rn = 1')

【讨论】：