如何从数据框列中提取与列表的完全匹配？答案

【问题标题】：How to extract exact matches with list from a dataframe column?如何从数据框列中提取与列表的完全匹配？
【发布时间】：2019-11-30 12:35:41
【问题描述】：

我有一个带有文本的大型数据框，我想用它来从单词列表（其中大约 1k 个单词）中查找匹配项。

我已经设法从数据框中的列表中获取单词的缺失/存在，但知道哪个单词匹配对我来说也很重要。有时与列表中的多个单词完全匹配，我想拥有它们。

我尝试使用下面的代码，但它给了我部分匹配 - 音节而不是完整的单词。

#this is a code to recreate the initial DF

import pandas as pd

df_data= [['orange','0'],
['apple and lemon','1'],
['lemon and orange','1']]

df= pd.DataFrame(df_data,columns=['text','match','exact word'])

初始 DF：

 text                 match
 orange               0
 apple and lemon      1
 lemon and orange     1

这是我需要匹配的单词列表

 exactmatch = ['apple', 'lemon']

预期结果：

 text                    match  exact words
 orange                    0         0 
 apple and lemon           1        'apple','lemon'
 lemon and orange          1        'lemon'

这是我尝试过的：

# for some rows it gives me words I want, 
#and for some it gives me parts of the word

#regex attempt 1, gives me partial matches (syllables or single letters)

pattern1 = '|'.join(exactmatch)
df['contains'] = df['text'].str.extract("(" + "|".join(exactmatch) 
+")", expand=False)

#regex attempt 2 - this gives me an error - unexpected EOL

df['contains'] = df['text'].str.extractall
("(" + "|".join(exactmatch) +")").unstack().apply(','.join, 1)

#TypeError: ('sequence item 1: expected str instance, float found', 
#'occurred at index 2')

#no regex attempt, does not give me matches if the word is in there

lst = list(df['text'])
match = []
for w in lst:
 if w in exactmatch:
    match.append(w)
    break

【问题讨论】：

你能发布你的预期输出吗？
@harvpan 预期的输出在 df - 列'exact words'中。现在将编辑问题

标签： python regex pandas dataframe

【解决方案1】：

使用str.findall

例如：

exactmatch = ['apple', 'lemon']
df_data= [['orange'],['apple and lemon',],['lemon and orange'],]

df= pd.DataFrame(df_data,columns=['text'])
df['exact word'] = df["text"].str.findall(r"|".join(exactmatch)).apply(", ".join)
print(df)

输出：

               text    exact word
0            orange              
1   apple and lemon  apple, lemon
2  lemon and orange         lemon

【讨论】：

谢谢！它有效，但除了给我完全匹配之外，它还给我更大数据集中的音节匹配。例如：其中一个匹配看起来像这样“a, la, et, identify, la, are, la, ideology, ...”。我需要“识别”和“意识形态”这两个词，因为它们在我的列表中，但我不确定如何消除部分匹配（字母组合）。
看起来你需要正则表达式边界 \b
谢谢 :) 你能帮我看看我应该把它们放在哪里吗？
前str.findall(r"\b"+"|".join(exactmatch) + r"\b")
@Rakesh 似乎正则表达式边界仍然给出了与 alinaz 提到的结果相同的结果

【解决方案2】：

将某些单词匹配为“精确”单词或匹配的问题不是简单的正则表达式任务。最终解决方案取决于您的具体用例，即在每个特定场景中“精确”的含义。

您需要使用Match a whole word in a string using dynamic regex 或Word boundary with words starting or ending with special characters gives unexpected results 中描述的方法之一从单词列表中动态构建模式。

然后，您可以简单地使用Series.str.findall，而不必担心您的模式是否包含捕获组：

df = pd.DataFrame({'text':['orange','apple and lemon', 'lemon and orange'], 'match':['0','1','1']})
exactmatch = ['apple', 'lemon']
pattern = fr'\b({"|".join(exactmatch)})\b' # This works for words consisting of letters, digits or underscores
df['exact word'] = df['text'].str.findall(pattern).str.join(", ")
# => >>> df
# =>                text match    exact word
# => 0            orange     0              
# => 1   apple and lemon     1  apple, lemon
# => 2  lemon and orange     1         lemon

如果你需要依赖精确匹配而不是\b字边界：

全字符串匹配：fr'^({"|".join([re.escape(word) for word in exactmatch])})\Z'（这是.findall 最奇怪的情况，Series.str.extract 更有意义，甚至非正则表达式的方法也必须在这里考虑，例如.isin）
当单词可以包含特殊字符inside 单词和重叠术语时，支持最长匹配的单词边界（当单词为['sour', 'lemon', 'sour lemon'] 时，从I have a sour lemon 中提取sour lemon）：@ 987654333@
空白边界（在空白或空白与字符串的开始/结束之间发生匹配：pattern = fr'(?<!\S)({"|".join([re.escape(word) for word in sorted(exactmatch, key=len, reverse=False)])})(?!\S)'
明确的单词边界（单词 - 字母、数字、下划线 - 字符之间不匹配：pattern = fr'(?<!\w)({"|".join([re.escape(word) for word in sorted(exactmatch, key=len, reverse=False)])})(?!\w)'
减去下划线的明确单词边界（字母或数字之间不匹配，但_lemon_ 是一个精确的lemon 单词）：pattern = fr'(?<![^\W_])({"|".join([re.escape(word) for word in sorted(exactmatch, key=len, reverse=False)])})(?![^\W_])'
字母边界（字母之间不匹配，但_lemon_ 和0lemon1 是精确的lemon 单词的大小写）：pattern = fr'(?<![^\W\d_])({"|".join([re.escape(word) for word in sorted(exactmatch, key=len, reverse=False)])})(?![^\W\d_])'
自适应动态单词边界类型 1（当您无法控制要匹配的单词时，它们可以在任何地方包含特殊字符，对于初始和尾随特殊字符没有特殊的上下文限制）：@987654343 @
自适应动态单词边界类型 2（当您无法控制要匹配的单词时，它们可以在任何地方包含特殊字符，并且如果单词的开头或结尾有特殊字符, 旁边不能出现其他单词 char)：pattern = fr'(?:\B(?!\w)|\b(?=\w))({"|".join([re.escape(word) for word in sorted(exactmatch, key=len, reverse=False)])})(?:(?<=\w)\b|(?<!\w)\B)'。

【讨论】：