在python中通过子字符串匹配两个数据帧答案

【问题标题】：Match two data frames by substring in python在python中通过子字符串匹配两个数据帧
【发布时间】：2021-08-12 02:52:07
【问题描述】：

我有两个大数据框（1000行），我需要通过子字符串来匹配它们，例如：

df1:

Id    Title
1     The house of pump
2     Where is Andijan
3     The Joker
4     Good bars in Andijan
5     What a beautiful house

df2:

Keyword
house
andijan
joker

预期的输出是：

Id    Title                    Keyword
1     The house of pump        house
2     Where is Andijan         andijan
3     The Joker                joker
4     Good bars in Andijan     andijan
5     What a beautiful house   house

现在，我写了一种非常低效的方法来匹配它，但是对于数据帧的实际大小，它运行了很长时间：

for keyword in df2.to_dict(orient='records'):
    df1['keyword'] = np.where(creative_df['title'].str.contains(keyword['keyword']), keyword['keyword'], df1['keyword'])

现在，我确信有一种更适合 pandas 且更有效的方式来做同样的事情，而且还能在合理的时间内运行。

【问题讨论】：

标签： python pandas performance optimization string-matching

【解决方案1】：

让我们试试findall

import re
df1['new'] = df1.Title.str.findall('|'.join(df2.Keyword.tolist()),flags= re.IGNORECASE).str[0]
df1
   Id                   Title      new
0   1       The house of pump    house
1   2        Where is Andijan  Andijan
2   3               The Joker    Joker
3   4    Good bars in Andijan  Andijan
4   5  What a beautiful house    house

【讨论】：

不错。我相信在这种情况下，我们可以从“.tolist()”中辞职，因为“加入”会给熊猫系列带来相同的结果。
就漂亮的 pandas 语法而言，我喜欢这个解决方案！然而，就性能而言，这个解决方案仍然运行了相当长的一段时间而没有完成。供参考，大约有 65000 个关键字。知道如何提高效率吗？
我最终选择了这个解决方案。为了更高效的运行时，我将关键字计数减少到 ~1000 并分批运行我的过程。这可能是最好的方法。

【解决方案2】：

进一步开发@BENY 的解决方案，以便能够为每个标题获取多个关键字：

regex = '|'.join(keywords['Keyword'])
keywords = df['Title'].str.findall(regex, flags=re.IGNORECASE)
keywords_exploded = pd.DataFrame(keywords.explode().dropna())
df.merge(keywords_exploded, left_index=True, right_index=True)

【讨论】：