【问题标题】:Vectorizing for loops to speed up a program in R向量化 for 循环以加速 R 中的程序
【发布时间】:2015-02-23 09:38:56
【问题描述】:

我正在为 R 中的 for 循环 寻找一些简单的矢量化方法。 我有以下带有句子的数据框和两个正负词词典:

# Create data.frame with sentences
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop",
                         "wouldnt bad notebook", "very good quality", "orgtop",
                         "great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
               stringsAsFactors=F)

# Create pos/negWords
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
          "extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
          "wouldnt bad")
negWords <- c("hate","bad","not good","horrible")

现在我创建原始数据框的副本来模拟大数据集:

# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(100000,sent$words))
# library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),]
rownames(sent) <- NULL

对于我的下一步,我必须对字典中的单词及其情绪分数进行降序排序(pos word = 1 和 neg word = -1)。

# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
rownames(wordsDF) <- NULL

然后我用 for 循环定义以下函数:

# Sentiment score function
scoreSentence2 <- function(sentence){
  score <- 0
  for(x in 1:nrow(wordsDF)){
    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
    count <- length(grep(matchWords,sentence)) # count them
    if(count){
      score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
      sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF
      # library(qdapRegex)
      sentence <- rm_white(sentence)
    }
  }
  score
}

然后我在数据框中的句子上调用前面的函数:

# Apply scoreSentence function to sentences
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
# Time consumption for 700.000 sentences in sent data.frame:
# user       system    elapsed
# 1054.19    0.09      1056.17
# Add sentiment score to origin sent data.frame
sent <- cbind(sent, SentimentScore2)

期望的输出是:

Words                                             user      SentimentScore2
just right size and i love this notebook          1         2
benefits great laptop                             2         1
wouldnt bad notebook                              3         1
very good quality                                 4         1
orgtop                                            5         0
  .
  .
  .

等等……

请,任何人都可以帮助我减少原始方法的计算时间。由于我的初学者在 R 中的编程技能,我最终:-) 您的任何帮助或建议将不胜感激。非常感谢您。

【问题讨论】:

  • 我从代码中了解到,您想要删除检测到的单词,但所需的输出仍然有它们。那么它的哪一部分是不正确的还是我读错了?
  • 请详细说明您要使用SentimentScore2 函数实现的目标
  • 删除文字是我方法的一部分。在 pos/neg 单词中的单词降序后,将它们与句子中的单词匹配,然后将它们删除,以便它们不会出现在另一个循环中。所需的输出必须包含它们,但需要很长时间,所以这是问题......
  • 基本上是在pos/neg字典中为每个句子搜索或精确匹配单词,然后计算它们并将它们作为分数写入原始数据帧。 1) 匹配确切的单词 2) 计算它们 3) 计算分数 (count * sentValue) 4) 从 wordsDF 中删除匹配的单词 使用 lapply 它适用于每个句子。

标签: r


【解决方案1】:

本着“授人以鱼不如授人以渔”的精神,我将引导您完成:

  1. 复制你的代码:你会搞砸的!

  2. 找到瓶颈:

    1a:让问题变小:

    Rep  <- 100
    df.expanded <- as.data.frame(replicate(nRep,sent$words))
    library(zoo)
    sent <- coredata(sent)[rep(seq(nrow(sent)),nRep),]
    

    1b:保留参考解决方案:您将更改您的代码,很少有活动能比优化代码更令人惊奇地引入错误!

    sentRef <- sent
    

    并在代码末尾添加相同但注释掉的内容,以记住您的参考文献在哪里。为了更容易检查您没有弄乱您的代码,您可以在代码末尾自动测试它:

    library("testthat")
    expect_equal(sent,sentRef)
    

    1c:触发profiler围绕代码查看:

    Rprof()
    SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
    Rprof(NULL)
    

    1d:查看结果,以R为底:

    summaryRprof()
    

    还有更好的工具,你可以查看包 简介R 或者 线路配置

    lineprof 是我选择的工具,在这里是一个真正的附加值,允许将问题缩小到这两行:

    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
    count <- length(grep(matchWords,sentence)) # count them
    
  3. 修复它。

    3.1 幸运的是,主要问题相当简单:您不需要将第一行放在函数中,将其移到前面。顺便说一句,这同样适用于您的 paste0()。你的代码变成:

    matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
    matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')
    
    # Sentiment score function
    scoreSentence2 <- function(sentence){
        score <- 0
        for(x in 1:nrow(wordsDF)){
            count <- length(grep(matchWords[x],sentence)) # count them
            if(count){
                score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
                sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
                require(qdapRegex)
                # sentence <- rm_white(sentence)
            }
        }
        score
    }
    

    这会改变 1000 次重复的执行时间
    5.64 秒到 2.32 秒。不错的投资!

    3.2 下一个鞋领是“count

    matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
    matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')
    
    # Sentiment score function
    scoreSentence2 <- function(sentence){
        score <- 0
        for(x in 1:nrow(wordsDF)){
            count <- grepl(matchWords[x],sentence) # count them
            score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
            sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
            require(qdapRegex)
            # sentence <- rm_white(sentence)
        }
        score
    }
    

这里可以加快 0.18 秒或 31 倍...

【讨论】:

  • 太棒了!!!非常感谢楼主,你帮了我很多。你的方法是我任务的最佳解决方案。
【解决方案2】:

您可以轻松地矢量化您的 scoreSentence2 函数,因为 grepgrepl 已经矢量化了:

scoreSentence <- function(sentence){
  score <- rep(0, length(sentence))
  for(x in 1:nrow(wordsDF)){
    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
    count <- grepl(matchWords, sentence) # count them
    score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
    sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF
    sentence <- rm_white(sentence)
  }
  return(score)
}
scoreSentence(sent$words)

请注意,count 实际上并没有计算该表达式在一个句子中出现的次数(无论是在您的版本中还是在我的版本中)。它只是告诉你表达式是否出现。如果你想实际计算它们,你可以使用以下代替。

count <- sapply(gregexpr(matchWords, sentence), function(x) length(x[x>0]))

【讨论】:

  • 太好了,非常感谢,现在速度明显加快 (3 x)。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-06-11
  • 1970-01-01
  • 2023-03-31
  • 2020-01-29
  • 2019-07-22
  • 1970-01-01
相关资源
最近更新 更多