【发布时间】:2015-02-23 09:38:56
【问题描述】:
我正在为 R 中的 for 循环 寻找一些简单的矢量化方法。 我有以下带有句子的数据框和两个正负词词典:
# Create data.frame with sentences
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop",
"wouldnt bad notebook", "very good quality", "orgtop",
"great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)
# Create pos/negWords
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
"extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
"wouldnt bad")
negWords <- c("hate","bad","not good","horrible")
现在我创建原始数据框的副本来模拟大数据集:
# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(100000,sent$words))
# library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),]
rownames(sent) <- NULL
对于我的下一步,我必须对字典中的单词及其情绪分数进行降序排序(pos word = 1 和 neg word = -1)。
# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
rownames(wordsDF) <- NULL
然后我用 for 循环定义以下函数:
# Sentiment score function
scoreSentence2 <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
count <- length(grep(matchWords,sentence)) # count them
if(count){
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF
# library(qdapRegex)
sentence <- rm_white(sentence)
}
}
score
}
然后我在数据框中的句子上调用前面的函数:
# Apply scoreSentence function to sentences
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
# Time consumption for 700.000 sentences in sent data.frame:
# user system elapsed
# 1054.19 0.09 1056.17
# Add sentiment score to origin sent data.frame
sent <- cbind(sent, SentimentScore2)
期望的输出是:
Words user SentimentScore2
just right size and i love this notebook 1 2
benefits great laptop 2 1
wouldnt bad notebook 3 1
very good quality 4 1
orgtop 5 0
.
.
.
等等……
请,任何人都可以帮助我减少原始方法的计算时间。由于我的初学者在 R 中的编程技能,我最终:-) 您的任何帮助或建议将不胜感激。非常感谢您。
【问题讨论】:
-
我从代码中了解到,您想要删除检测到的单词,但所需的输出仍然有它们。那么它的哪一部分是不正确的还是我读错了?
-
请详细说明您要使用
SentimentScore2函数实现的目标 -
删除文字是我方法的一部分。在 pos/neg 单词中的单词降序后,将它们与句子中的单词匹配,然后将它们删除,以便它们不会出现在另一个循环中。所需的输出必须包含它们,但需要很长时间,所以这是问题......
-
基本上是在pos/neg字典中为每个句子搜索或精确匹配单词,然后计算它们并将它们作为分数写入原始数据帧。 1) 匹配确切的单词 2) 计算它们 3) 计算分数 (count * sentValue) 4) 从 wordsDF 中删除匹配的单词 使用 lapply 它适用于每个句子。
标签: r