优化缓慢的 for 循环操作答案

【问题标题】：Optimizing a slow for-loop operation优化缓慢的 for 循环操作
【发布时间】：2020-10-22 10:06:37
【问题描述】：

我试图通过迭代data.table 的几列中的元素来创建一个包含全文的列。这是我目前的方法。它按我的预期工作，但是当data.table 变大时，我会浪费大量时间。

library(data.table)

new_df <- data.table(text= c("RT A y...", "RT b...", "XYZ 3...", "RT Ca...", "IO"),
                     full_text= c(NA, NA, "XYZ 378978978", NA, NA),
                     status.text= c("A yes y...", "ball ball", NA, "Call ca...", NA),
                     status.full_text= c("A yes yes yes yes", NA, NA, "Call call call", NA))

#     text     full_text status.text  status.full_text
# 1: RT A y...          <NA>  A yes y... A yes yes yes yes
# 2:   RT b...          <NA>   ball ball              <NA>
# 3:  XYZ 3... XYZ 378978978        <NA>              <NA>
# 4:  RT Ca...          <NA>  Call ca...    Call call call
# 5:        IO          <NA>        <NA>              <NA>
#   

attach_texts_in_df <- function(give_me_df){
  
  #make an empty vector to store texts
  complete_texts <- c()
  
  #loop through each elements of rows
  for(i in seq_along(1:nrow(give_me_df))){
    
    #check if text begins with RT
    if(!grepl('^RT', give_me_df[i, "text"])){
      #check if text is smaller than the full_text, while full text is not NA
      if((nchar(give_me_df[i, "text"]) < nchar(give_me_df[i, "full_text"]))& !is.na(give_me_df[i, "full_text"])){
        complete_texts <- c(complete_texts, give_me_df[i, "full_text"])
        
      }else{
        complete_texts <- c(complete_texts, give_me_df[i, "text"]) # if not, then it's original
      }
      
    }
    else{
      
      if((nchar(give_me_df[i, "status.text"]) < nchar(give_me_df[i, "status.full_text"]))& !is.na(give_me_df[i, "status.full_text"])){
        complete_texts <- c(complete_texts, give_me_df[i, "status.full_text"])
        
      }else{
        complete_texts <- c(complete_texts, give_me_df[i, "status.text"])
      }
      
    }
  }
  
  #attached the proper texts
  give_me_df$complete_text <- complete_texts
  
  #return the vector
  return(give_me_df)
}

new_df <- attach_texts_in_df(new_df)

#this was the what I was looking for and I got it when its small, but big one take a long time!!
#     text     full_text status.text  status.full_text     complete_text
# 1: RT A y...          <NA>  A yes y... A yes yes yes yes A yes yes yes yes
# 2:   RT b...          <NA>   ball ball              <NA>         ball ball
# 3:  XYZ 3... XYZ 378978978        <NA>              <NA>     XYZ 378978978
# 4:  RT Ca...          <NA>  Call ca...    Call call call    Call call call
# 5:        IO          <NA>        <NA>              <NA>                IO

我想知道是否有人可以帮助我优化它。 R对我来说是新的。我知道存在应用函数，但我不知道如何使用这些自定义函数。

感谢您的帮助和提示。谢谢。

编辑：我使用data.table 函数做了以下操作，但是我遗漏了一些数据：

sample_fxn <-  function(t,ft,st,sft){
  if(!grepl('^RT', t)){
    if((nchar(t) < nchar(ft)) & !is.na(ft)){
      return(ft)
    }else{
      return(t)
    }
  }
  else{
    if((nchar(st) < nchar(sft))& !is.na(sft)){
      return(sft)
    }else{
      return(st)
    }
  }
}

new_df <- new_df[ ,complete_texts := sample_fxn(text,
                                                full_text,
                                                status.text,
                                                status.full_text)]

#   text     full_text status.text  status.full_text         complete_texts
# 1: RT A y...          <NA>  A yes y... A yes yes yes yes  A yes yes yes yes 
# 2:   RT b...          <NA>   ball ball              <NA>                <NA>              
# 3:  XYZ 3... XYZ 378978978        <NA>              <NA>                <NA>             
# 4:  RT Ca...          <NA>  Call ca...    Call call call    Call call call    
# 5:        IO          <NA>        <NA>              <NA>                <NA>

这是我阅读了@Henrik 分享的 R Inferno 书中的矢量化版本后的最佳尝试。我想出了：

new_df$complete_texts <- ifelse(!grepl('^RT', new_df$text),
                                yes = ifelse((nchar(new_df$text) < nchar(new_df$full_text))& !is.na(new_df$full_text),
                                             yes = new_df$full_text,
                                             no = new_df$text
                                ),
                                no = ifelse((nchar(new_df$status.text) < nchar(new_df$status.full_text))& !is.na(new_df$status.full_text),
                                            yes = new_df$status.full_text,
                                            no = new_df$status.text
                                )
                          )

这确实使工作的完成速度提高了 3 倍。我想知道是否有人可以向我解释更好的方法。我想学习。

【问题讨论】：

首先阅读R Inferno 中的第 2 章和第 3 章（回复你的 complete_texts <- c()；for(i in seq_along(1:nrow(give_me_df))。）
您好，感谢您与我分享一本好书。目前我只是从 YouTube 视频中学习。
不会让它明显更快，但你可以写seq_along(1:x)而不是seq_len(x)。
感谢康拉德的小费。肯定会根据你所说的做出改变。我正在阅读@Henrik 给我的那本书。我已经犯了很多罪。
@AOE_player 鉴于您最后一次尝试的巨大飞跃，您将被原谅。干杯。

标签： r dataframe optimization parallel-processing data.table

【解决方案1】：

请务必阅读data.table 的一些介绍材料——尤其是Introduction 和Reference Semantics 小插曲。

接下来，我看到的最明显的事情是缺乏矢量化。在低级语言中，你必须一直循环；在 R 中，你应该总是想——我真的需要一个循环吗？在您的代码中，我看到正在使用的几个矢量化函数的标量版本：

grepl 适用于向量
nchar 适用于向量
is.na 适用于向量

只要有可能，您应该使用向量版本——与只调用一次相比，重复调用 C 函数会有一些延迟：

for (i in 1:nrow(DT)) grepl('^RT', DT[i, "text"]) 保证比 grepl('^RT', DT$text) 慢

接下来，data.table 在重复调用[ 时会有一些额外的开销，因为[ 内部有很多事情要处理更复杂的“查询”，所以你应该尽可能地利用它！

最后，与其更新函数中的data.table，不如让函数返回一个我可以指定为列的向量——最终目标是得到类似的东西：

new_df[ , complete_text := my_function(.SD)]

注意my_function(.SD)在这个简单的情况下和my_function(new_df)是一样的——这里使用.SD是为了在更复杂的场景中习惯这种语法；更多信息请参见the .SD vignette。

这是我称之为get_complete_text的更新版本：

get_complete_text = function(DT) {
  DT[ , fifelse(
    grepl('^RT', text),
    fifelse(
      nchar(status.text) < nchar(status.full_text) & !is.na(status.full_text), 
      status.full_text,
      status.text
    ),
    fifelse(
      nchar(text) < nchar(full_text) & !is.na(full_text),
      full_text,
      text
    )
  )]
}
new_df[ , complete_text := get_complete_text(.SD)][]
#         text     full_text status.text  status.full_text     complete_text
# 1: RT A y...          <NA>  A yes y... A yes yes yes yes A yes yes yes yes
# 2:   RT b...          <NA>   ball ball              <NA>         ball ball
# 3:  XYZ 3... XYZ 378978978        <NA>              <NA>     XYZ 378978978
# 4:  RT Ca...          <NA>  Call ca...    Call call call    Call call call
# 5:        IO          <NA>        <NA>              <NA>                IO

嵌套的fifelse 可以通过一个中间列来简化，该列根据^RT 条件存储要使用的text 列：

idx = new_df[grepl('^RT', text), which=TRUE]
new_df[idx, c('rt_text', 'rt_full_text') := .(status.text, status.full_text)]
new_df[-idx, c('rt_text', 'rt_full_text') := .(text, full_text)]

new_df[ , complete_text := 
  fifelse(
    nchar(rt_text) < nchar(rt_full_text) & !is.na(rt_full_text),
    rt_full_text,
    rt_text
  )
]

另外，data.table 的开发版本有fcase，您可能会发现它更具可读性（在这种情况下，我认为嵌套的fifelse 工作正常）：

get_complete_text = function(DT) {
  DT[ , fcase(
    grepl('^RT', text) & nchar(status.text) < nchar(status.full_text) & !is.na(status.full_text),
    status.full_text,
    grepl('^RT', text) & (nchar(status.full_text) >= nchar(status.text) | is.na(status.full_text)),
    status.text,
    # here, we're implicitly using that logically grepl('^RT') is now FALSE
    nchar(text) < nchar(full_text) & !is.na(full_text),
    full_text,
    # there is some ongoing work to make this less ugly,
    #   but for now this is the catchall term -- we could also
    #   explicitly write out the conjugate condition nchar(full_text) >= ...
    rep(TRUE, .N),
    text
  )]
}

【讨论】：

谢谢你的朋友，为我提供了很好的解释并与我分享了一些很棒的资源:) 非常感谢你的帮助。