【发布时间】:2019-03-13 22:58:26
【问题描述】:
我想使用grepl() 查找不区分大小写的匹配项。
我想在我的数据框df 的 Text 列中找到以下关键字列表。
# There is a long list of words, but for simplification I have provided only a subset.
I, I'm, the, and, to, a, of
我想分别为每个数据行计算这些单词的计数。 我将这个要在代码中使用的单词列表定义为:
word_list = c('\\bI\\b','\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')
# Note that I'm is not currently in this word_list
在我的数据框df 中,我添加如下列以保持上述单词的计数:
df$I = 0
df$IM = 0 # this is where I need help
df$THE = 0
df$AND = 0
df$TO = 0
df$A = 0
df$OF = 0
然后我对单词列表的每个单词使用以下 for 循环来迭代所需列的每一行。
# for each word of my word_list
for (i in 1:length(word_list)){
# to search in each row of text response
for(j in 1:nrow(df)){
if(grepl(word_list[i], df$Text[j], ignore.case = T)){
df[j,i+4] = (df[j,i+4]) # 4 is added to go to the specific column
}#if
}#for
}#for
对于一个可重现的示例,dput(df) 如下:
dput(df)
structure(list(cluster3 = c(2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L), userID = c(3016094L, 3042038L, 3079341L, 3079396L, 3130832L, 3130864L, 3148118L, 3148914L, 3149040L, 3150222L), Text = structure(c(3L, 4L, 2L, 9L, 6L, 10L, 7L, 1L, 5L, 8L), .Label = c("I'm alright","I'm stressed", "I am a good person.", "I don't care", "I have a difficult task", "I like it", "I think it doesn't matter", "Let's not argue about this", "Let's see if I can run", "No, I'm not in a mood"), class = "factor"), I = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), IM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), AND = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), THE = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), TO = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), OF = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -10L))
【问题讨论】:
-
您可以使用双引号
"来引用包含单引号'的字符串(反之亦然)。因此,只需将"\\bI'm\\b"添加到您的单词列表中即可。 -
谢谢@Gregor 我也试过了!
-
另外,
\\b是一个正则表达式模式,因此如果您设置fixed = TRUE,它将被忽略。
标签: r string string-matching grepl