字符串匹配，其中字符串包含标点符号答案

【问题标题】：String matching where strings contain punctuation字符串匹配，其中字符串包含标点符号
【发布时间】：2019-03-13 22:58:26
【问题描述】：

我想使用grepl() 查找不区分大小写的匹配项。

我想在我的数据框df 的 Text 列中找到以下关键字列表。

# There is a long list of words, but for simplification I have provided only a subset.
I, I'm, the, and, to, a, of

我想分别为每个数据行计算这些单词的计数。我将这个要在代码中使用的单词列表定义为：

word_list = c('\\bI\\b','\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')
# Note that I'm is not currently in this word_list

在我的数据框df 中，我添加如下列以保持上述单词的计数：

df$I    = 0
df$IM   = 0   # this is where I need help
df$THE  = 0
df$AND  = 0
df$TO   = 0
df$A    = 0
df$OF   = 0

然后我对单词列表的每个单词使用以下 for 循环来迭代所需列的每一行。

# for each word of my word_list
for (i in 1:length(word_list)){ 

  # to search in each row of text response 
  for(j in 1:nrow(df)){

    if(grepl(word_list[i], df$Text[j], ignore.case = T)){   
      df[j,i+4] = (df[j,i+4])    # 4 is added to go to the specific column

    }#if 
  }#for
}#for

对于一个可重现的示例，dput(df) 如下：

dput(df)

structure(list(cluster3 = c(2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L), userID = c(3016094L, 3042038L, 3079341L, 3079396L, 3130832L, 3130864L, 3148118L, 3148914L, 3149040L, 3150222L), Text = structure(c(3L, 4L, 2L, 9L, 6L, 10L, 7L, 1L, 5L, 8L), .Label = c("I'm alright","I'm stressed", "I am a good person.", "I don't care", "I have a difficult task", "I like it", "I think it doesn't matter", "Let's not argue about this", "Let's see if I can run", "No, I'm not in a mood"), class = "factor"), I = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), IM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), AND = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), THE = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), TO = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), OF = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -10L))

【问题讨论】：

您可以使用双引号 " 来引用包含单引号 ' 的字符串（反之亦然）。因此，只需将"\\bI'm\\b" 添加到您的单词列表中即可。
谢谢@Gregor 我也试过了！
另外，\\b 是一个正则表达式模式，因此如果您设置fixed = TRUE，它将被忽略。

标签： r string string-matching grepl

【解决方案1】：

我会建议一种更精简的方法：

## use a named vector for the word patterns
## with the column names you want to add to `df`
word_list = c('I' = '\\bi\\b', 'THE' = '\\bthe\\b', 'AND' = '\\band\\b',
              'TO' = '\\bto\\b', 'A' = '\\ba\\b', 'OF' = '\\bof\\b', 'IM' = "\\bim")

## use `stringr::str_count` instead of `grepl`
## sapply does the looping and result gathering for us
library(stringr)
results = sapply(word_list, str_count,
   string = gsub("[[:punct:]]", "", tolower(df$Text))
)
results
#       I THE AND TO A OF IM
#  [1,] 1   3   2  1 1  1  0
#  [2,] 0   0   1  0 0  0  0
#  [3,] 0   0   0  0 0  0  0
#  [4,] 2   2   3  2 1  1  1
#  [5,] 0   0   0  1 1  0  0
#  [6,] 0   3   2  2 0  0  0
#  [7,] 1   3   0  1 1  0  0
#  [8,] 1   2   0  1 1  1  0
#  [9,] 0   0   0  0 0  0  0
# [10,] 0   0   0  1 2  0  0

## put the results into the data frame based on the names
df[colnames(results)] = data.frame(results)

由于我们依赖于矢量化的str_count，因此这应该比逐行方法快得多。

【讨论】：

谢谢@Gregor。您的解决方案更简单，但是，“I”的计数不正确。它还包括“我是”的计数。我已经更新了 dput()，也许你可以重新测试你的脚本？
啊，是的，撇号将被视为单词边界并匹配\\b。您可以尝试预测所有组合，例如使用'\\bI[ .,?:;!-()\\[\\]*]' 之类的模式，但在上一步中删除所有标点符号可能更安全。然后你可以使用'\\bim\\b' 代替I'm 并保留其他所有内容。这是我会推荐的，我会更新答案以匹配。
您能在这方面进一步协助我吗？现在，我有一个正面和负面情绪词的列表，位于：cs.uic.edu/~liub/FBS/sentiment-analysis.html 我想计算每行有多少正面词和多少负面词。这基本上意味着，将上面示例中的 word_list 替换为 positive_emotions_list，然后替换为negative_emotion_list。我在此尝试了您的脚本： POS = sapply(positive_list, str_count, string = gsub("[[:punct:]]", "",tolower(surv_motivate$hyp_massage_exp))) 但它在所有行上返回零。跨度>
好吧，如果单词列表的格式相同，它应该可以工作。你有没有像我一样把它变成一个命名向量？数据结构一样吗？你真的没有给我任何关于有什么不同的线索。确保一切看起来都相似，如果您仍然遇到问题，我建议您提出一个新问题。

【解决方案2】：

我可以通过在双引号中添加表达式来使我的代码工作：

word_list = c('\\bI\\b',"\\bI'm\\b",'\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')

【讨论】：