【发布时间】:2015-03-04 14:39:18
【问题描述】:
我在 sent$words 中有带有句子的 data.frame sent 和 wordsDF 中带有 pos/neg 词的字典(wordsDF[ x,1])。正词 = 1,负词 = -1 (wordsDF[x,2])。该 wordsDF 数据帧中的单词根据它们的长度(字符串的长度)按降序排序。我将此目的用于我的以下功能。
这个函数是如何工作的:
1) 通过每个句子计算存储在 wordsDF 中的单词的出现次数 2)计算情感分数:特定句子中特定单词(wordsDF)的出现次数*该单词的情绪值(正= 1,负= -1) 3) 从句子中删除匹配的单词以进行另一次迭代。
stringr包的原始解决方案:
scoreSentence_01 <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
count <- str_count(sentence, wordsDF[x,1])
score <- (score + (count * wordsDF[x,2])) # compute score (count * sentValue)
sentence <- str_replace_all(sentence, wordsDF[x,1], " ")
}
score
}
更快的解决方案 - 第 4 行和第 5 行替换原始解决方案中的第 4 行。
scoreSentence_02 <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
sd <- function(text) {stri_count(text, regex=wordsDF[x,1])}
results <- sapply(sentence, sd, USE.NAMES=F)
score <- (score + (results * wordsDF[x,2])) # compute score (count * sentValue)
sentence <- str_replace_all(sentence, wordsDF[x,1], " ")
}
score
}
调用函数是:
scoreSentence_Score <- scoreSentence_01(sent$words)
实际上,我正在使用包含 300.000 个句子的数据集和包含正面和负面单词的字典 - 总共 7.000 个单词。这种方法非常缓慢,而且因为我在 R 编程方面的初学者知识我正在努力结束。
谁能帮助我,如何将此函数重写为矢量化或并行解决方案,拜托。非常感谢任何帮助或建议。非常感谢您。
虚拟数据:
sent <- data.frame(words = c("great just great right size and i love this notebook", "benefits great laptop at the top",
"wouldnt bad notebook and very good", "very good quality", "bad orgtop but great",
"great improvement for that great improvement bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
"extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
"wouldnt bad")
negWords <- c("hate","bad","not good","horrible")
# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(10000,sent$words))
sent <- coredata(sent)[rep(seq(nrow(sent)),10000),]
sent$words <- paste(c(""), sent$words, c(""), collapse = NULL)
rownames(sent) <- NULL
# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
wordsDF$words <- paste(c(""), wordsDF$words, c(""), collapse = NULL)
rownames(wordsDF) <- NULL
期望的输出是:
words user scoreSentence_Score
great just great right size and i love this notebook 1 4
benefits great laptop at the top 2 2
wouldnt bad notebook and very good 3 2
very good quality 4 1
bad orgtop but great 5 0
great improvement for that great improvement bad product but overall is not good 6 0
notebook is not good but i love batterytop 7 0
【问题讨论】:
标签: r parallel-processing vectorization