考虑否定词和否定词，计算字符串中的肯定词和否定词答案

【问题标题】：Count positive and negative words in a string considering negatives and negation考虑否定词和否定词，计算字符串中的肯定词和否定词
【发布时间】：2025-12-28 07:20:09
【问题描述】：

以下代码匹配文本中的正面和负面单词并计算它们。让我们考虑例如

sentences<-c("You are not perfect!", 
            "However, let us not forget what happened across the Atlantic.", 
            "And I can't support you.",
            "No abnormal energy readings",
            "So with gratitude, the universe is abundant forever.")

我们先导入正面和负面的词

pos = readLines("positive-words.txt")
neg = readLines("negative-words.txt")

来自 txt 文件。在这些文件中我们发现：

abundant
gratitude
perfect
support

对于positive-words.txt 和

abnormal

为negative-words.txt。以下命令：

sentence = gsub("[[:punct:]]", "", sentence)
sentence = gsub("[[:cntrl:]]", "", sentence)
sentence = gsub('\\d+', '', sentence)

删除数字、控制字符和标点符号。然后我们用str_split (stringr package)将句子分成单词

word.list = str_split(sentence, "\\s+")
words = unlist(word.list)

并将单词与正面和负面术语的字典进行比较

pos.matches = match(words, pos)
neg.matches = match(words, neg)
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)

变量sentence 可以是sentences[1]、sentences[2]、sentences[3]、sentences[4] 或sentences[5]。例如。如果sentence=sentences[5]，此代码正确返回两个正字；实际上结果是：

> pos.matches
[1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE

所有其他句子也是如此。例如。如果sentence=sentences[4]:

> neg.matches
[1] FALSE  TRUE FALSE FALSE

无论如何，我想修改此代码以解决sentences[1]、sentences[3] 和sentences[4] 中包含的情况。实际上：sentences[1] 中的perfect 是一个肯定词，但它前面是not，然后我想将这两个词视为一个（否定）词； sentences[3] 中的support 是正面词，但前面是cant，然后我想将这两个词视为负面词； abnormal in sentences[4] 是一个否定词，但它前面是no，然后我想将这两个词视为一个积极词。例如。 sentence=sentences[4] 的期望结果是：

> pos.matches
[1] TRUE FALSE FALSE

相反，我通过这段代码获得：

> pos.matches
[1] FALSE FALSE FALSE FALSE

我想然后定义一个带有否定和否定的变量：

NegativesNegations <- paste("\\b(", paste(c("no","not","couldnt","cant"), collapse = "|"), ")\\b")

但我不知道该怎么做。

【问题讨论】：

标签： r regex string

【解决方案1】：

您可以使用纯正则表达式完成此任务。首先，您将肯定和否定列表转换为正则表达式字符串，就像对否定否定列表所做的那样：

pos_rgx = paste0("\\b(", paste(pos, collapse="|"), ")\\b")
neg_rgx = paste0("\\b(", paste(neg, collapse="|"), ")\\b")

您现在可以检查每个句子是否存在肯定或否定词：

grepl(pos_rgx, sentences, ignore.case=TRUE)
grepl(neg_rgx, sentences, ignore.case=TRUE)

要添加否定，您可以相应地进行：

pos_neg_rgx = paste0('\\b(no|not|couldn\'t|can\'t)\\s', pos_rgx)
grepl(pos_neg_rgx, sentences)

注意：'\\s' 表示否定词和肯定词之间有一个空格

note(2)：如果你用单引号引用你的字符串，那么你必须转义 "can't" 中的引号（如示例中所示）。否则，您可以使用双引号引用字符串： "\b(no|not|couldn't|can't)\s"

如果您想深入挖掘文本挖掘，请查看包tidytext

【讨论】：