【发布时间】:2020-10-22 10:06:37
【问题描述】:
我试图通过迭代data.table 的几列中的元素来创建一个包含全文的列。这是我目前的方法。它按我的预期工作,但是当data.table 变大时,我会浪费大量时间。
library(data.table)
new_df <- data.table(text= c("RT A y...", "RT b...", "XYZ 3...", "RT Ca...", "IO"),
full_text= c(NA, NA, "XYZ 378978978", NA, NA),
status.text= c("A yes y...", "ball ball", NA, "Call ca...", NA),
status.full_text= c("A yes yes yes yes", NA, NA, "Call call call", NA))
# text full_text status.text status.full_text
# 1: RT A y... <NA> A yes y... A yes yes yes yes
# 2: RT b... <NA> ball ball <NA>
# 3: XYZ 3... XYZ 378978978 <NA> <NA>
# 4: RT Ca... <NA> Call ca... Call call call
# 5: IO <NA> <NA> <NA>
#
attach_texts_in_df <- function(give_me_df){
#make an empty vector to store texts
complete_texts <- c()
#loop through each elements of rows
for(i in seq_along(1:nrow(give_me_df))){
#check if text begins with RT
if(!grepl('^RT', give_me_df[i, "text"])){
#check if text is smaller than the full_text, while full text is not NA
if((nchar(give_me_df[i, "text"]) < nchar(give_me_df[i, "full_text"]))& !is.na(give_me_df[i, "full_text"])){
complete_texts <- c(complete_texts, give_me_df[i, "full_text"])
}else{
complete_texts <- c(complete_texts, give_me_df[i, "text"]) # if not, then it's original
}
}
else{
if((nchar(give_me_df[i, "status.text"]) < nchar(give_me_df[i, "status.full_text"]))& !is.na(give_me_df[i, "status.full_text"])){
complete_texts <- c(complete_texts, give_me_df[i, "status.full_text"])
}else{
complete_texts <- c(complete_texts, give_me_df[i, "status.text"])
}
}
}
#attached the proper texts
give_me_df$complete_text <- complete_texts
#return the vector
return(give_me_df)
}
new_df <- attach_texts_in_df(new_df)
#this was the what I was looking for and I got it when its small, but big one take a long time!!
# text full_text status.text status.full_text complete_text
# 1: RT A y... <NA> A yes y... A yes yes yes yes A yes yes yes yes
# 2: RT b... <NA> ball ball <NA> ball ball
# 3: XYZ 3... XYZ 378978978 <NA> <NA> XYZ 378978978
# 4: RT Ca... <NA> Call ca... Call call call Call call call
# 5: IO <NA> <NA> <NA> IO
我想知道是否有人可以帮助我优化它。 R对我来说是新的。我知道存在应用函数,但我不知道如何使用这些自定义函数。
感谢您的帮助和提示。谢谢。
编辑:我使用data.table 函数做了以下操作,但是我遗漏了一些数据:
sample_fxn <- function(t,ft,st,sft){
if(!grepl('^RT', t)){
if((nchar(t) < nchar(ft)) & !is.na(ft)){
return(ft)
}else{
return(t)
}
}
else{
if((nchar(st) < nchar(sft))& !is.na(sft)){
return(sft)
}else{
return(st)
}
}
}
new_df <- new_df[ ,complete_texts := sample_fxn(text,
full_text,
status.text,
status.full_text)]
# text full_text status.text status.full_text complete_texts
# 1: RT A y... <NA> A yes y... A yes yes yes yes A yes yes yes yes
# 2: RT b... <NA> ball ball <NA> <NA>
# 3: XYZ 3... XYZ 378978978 <NA> <NA> <NA>
# 4: RT Ca... <NA> Call ca... Call call call Call call call
# 5: IO <NA> <NA> <NA> <NA>
这是我阅读了@Henrik 分享的 R Inferno 书中的矢量化版本后的最佳尝试。我想出了:
new_df$complete_texts <- ifelse(!grepl('^RT', new_df$text),
yes = ifelse((nchar(new_df$text) < nchar(new_df$full_text))& !is.na(new_df$full_text),
yes = new_df$full_text,
no = new_df$text
),
no = ifelse((nchar(new_df$status.text) < nchar(new_df$status.full_text))& !is.na(new_df$status.full_text),
yes = new_df$status.full_text,
no = new_df$status.text
)
)
这确实使工作的完成速度提高了 3 倍。我想知道是否有人可以向我解释更好的方法。我想学习。
【问题讨论】:
-
首先阅读R Inferno 中的第 2 章和第 3 章(回复你的
complete_texts <- c();for(i in seq_along(1:nrow(give_me_df))。) -
您好,感谢您与我分享一本好书。目前我只是从 YouTube 视频中学习。
-
不会让它明显更快,但你可以写
seq_along(1:x)而不是seq_len(x)。 -
感谢康拉德的小费。肯定会根据你所说的做出改变。我正在阅读@Henrik 给我的那本书。我已经犯了很多罪。
-
@AOE_player 鉴于您最后一次尝试的巨大飞跃,您将被原谅。干杯。
标签: r dataframe optimization parallel-processing data.table