识别包含单词的句子，并使用 str.contains 在列中显示该单词答案

【问题标题】：Identifying sentences which contain a word, and displaying that word in a column using str.contains识别包含单词的句子，并使用 str.contains 在列中显示该单词
【发布时间】：2021-09-30 13:22:56
【问题描述】：

我有一个包含句子的数据框，我想使用 str.contains 来找出句子是否包含单词，然后将找到的单词添加到数据框的列中。例如：输入

dataset['sentence']
0    I am using a macbookpro and I like it.
1    I am using dell with windows OS and I love it.
        ........


   searchfor = '|'.join(searchfor)
   searchfor
 
0  windows|macbook|love


dataset['Match']=dataset['sentence'].str.contains(searchfor)
dataset['Matchcount']=dataset['sentence'].str.count(searchfor)

**Expected Output:**
    sentence                                             Match            Matched Word      Matchcount
    I am using a macbookpro and I like it.               True             macbook           1
    I am using dell with windows OS and I love it.       True             windows,love       2

如何在输出中获得“匹配的单词”？谢谢

【问题讨论】：

标签： python dataframe nlp substring contains

【解决方案1】：

我可能会先看看 spaCy 的模式匹配 + NER。 spacy 提供的模式匹配规则非常强大，尤其是与它们的统计 NER 模型结合使用时。您甚至可以使用您开发的模式来创建您自己的自定义 NER 模型。 spaCy 返回的文档将在您提供的字符串中为您提供匹配的位置信息，以便您将其拉出并与您提供的文本一起显示为输入。

将 REGEX 实体添加到 SpaCy 的匹配器

【讨论】：

【解决方案2】：

这可能不是最有效的方法，但应该这样做：

import re

regex = re.compile(searchfor)
dataset["MatchedWords"] = dataset.apply(lambda l: set(regex.findall(l["sentence"])), axis=1)

如果你想要字符串/列表，你可以将 set 部分放在 join/list 中。

编辑：这个最小的例子是否会给你带来预期的结果？

dataset = pandas.DataFrame({'sentence':["I am using a macbookpro and I like it.", "I am using dell with windows OS and I love it."]})
regex = re.compile("windows|macbook|love")
dataset["MatchedWords"] = dataset.apply(lambda l: set(regex.findall(l["sentence"])), axis=1)

【讨论】：

您好，我使用以下命令，regex = re.compile(searchfor) dataset["MatchedWords"] = dataset.apply(lambda l: set(regex.findall(l["sentence"])) , axis=1) 但是，我在所有行中都得到了一个空的 {} 列表。基本上它告诉我没有找到匹配项，我在这里做错了吗？
您搜索的字符串是否不仅仅是单词？如果它们中有特殊字符，则可能会弄乱正则表达式。这至少适用于您示例中的数据吗？ [将代码移至答案] 这个最小的示例是否给出了预期的结果？
是的，它们是 2-3 个单词组合在一起，例如“传染病”、“社交距离”。没有特殊字符，我在处理数据时去掉了那些。我有大约 20 多个搜索词，当我将列表减少到 3-5 个词时，上面的代码有效，当我添加所有词时，它给了我一个空列表。
那我就不知所措了。我会尝试确定这 20 多个单词中的特定子集是否导致了问题（分别尝试两半，重复显示问题的任何一半，直到我找到一些可能导致问题的单词）。此外，绝对与问题无关，但我也会通过在 re.compile 调用中添加 re.IGNORECASE 选项来忽略大小写。