【发布时间】:2021-04-11 11:46:54
【问题描述】:
我有两张桌子。
comment_df
| Date | Comment |
|:---- |:------:|
| 20/01/2020 | Transfer from Euro Account to HSBC account done on Monday but AMEX payment was on Tue. |
| 20/01/2020 | Brian initiated a Transfer from Euro Account to Natwest last Tuesday. |
| 21/01/2020 | AMEX payment to Natwest was delayed for second time in a row. |
| 21/01/2020 | AMEX receipts from Euro Account delayed. |
code_df
| Tag | Comment |
|:---- |:------:|
| EURO | Euro Account to HSBC |
| Natwest | Euro Account to Natwest |
| AMEX | AMEX payment |
想要的表
| Date | Comment | Tag |
|:---- |:------:| ----:|
| 20/01/2020 | Transfer from Euro Account to HSBC account done on Monday but AMEX payment was on Tue.| EURO |
| 20/01/2020 | Brian initiated a Transfer from Euro Account to Natwest last Tuesday. | Natwest |
| 21/01/2020 | AMEX payment to Natwest was delayed for second time in a row. | AMEX |
| 21/01/2020 | AMEX receipts from Euro Account delayed. | |
所以第一条评论有两个标签(欧元账户到汇丰银行和美国运通支付),但我希望结果显示它遇到的第一个标签,而不是重复行。以下是之前的建议。
code_df = code_df.withColumnRenamed('Comment', 'Commentcode')
result = comment_df.join(code_df, comment_df.Comment.contains(code_df.Commentcode), 'left').drop('Commentcode')
result.show(truncate=False)
+----------+---------------------------------------------------------------------+-------+
|Date |Comment |Tag |
+----------+---------------------------------------------------------------------+-------+
|20/01/2020|Transfer from Euro Account to HSBC account done on Monday but AMEX payment was on Tue. |EURO|
|20/01/2020|Brian initiated a Transfer from Euro Account to Natwest last Tuesday.|Natwest|
|21/01/2020|AMEX payment to Natwest was delayed for second time in a row. |AMEX|
|21/01/2020|AMEX receipts from Euro Account delayed. |null|
+----------+---------------------------------------------------------------------+-------+
【问题讨论】:
-
你能不能提一下预期的输出?
标签: string apache-spark pyspark apache-spark-sql string-matching