【问题标题】:How can I make this loop run faster in R?如何使这个循环在 R 中运行得更快?
【发布时间】:2016-04-21 08:20:03
【问题描述】:

有一个字典数据框 words.dict 大约有 44000 个单词,下面的代码应该将数据集 dataset.num 中的所有单词替换为他们在字典中的数字 ID。

data.num:

dput(head(dataset.num))
c("rt   breaking  will from here forward be know as", "i hope you like wine and cocktails", "this week we are upgrading our servers  there may be periodic disruptions to the housing application portal  sorry for any inconvenience", "hanging out in  foiachat  anyone have fav  management software on the gov t side  anything from intake to redaction   onwards", "they left out kourtney  instead they let chick from big bang talk", "i  am  encoding  film   like  for the  billionth time already ")

words.dict:

dput(head(words.dict,20)
structure(list(id = c(10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 3L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), word = structure(1:20, .Label =c("already", "am", "and", "any", "anyone", "anything", "application", "are", "as", "bang", "be", "big", "billionth", "breaking", "chick", "cocktails","disruptions", "encoding", "fav", "film", "foiachat", "for", "forward", "from", "gov", "hanging", "have", "here", "hope", "housing", "i", "in", "inconvenience", "instead", "intake", "know", "kourtney", "left", "let", "like", "management", "may", "on", "onwards", "our", "out", "periodic", "portal", "redaction", "rt", "servers", "side", "software", "sorry", "t", "talk", "the", "there", "they", "this", "time", "to", "upgrading", "we", "week", "will", "wine", "you"), class = "factor")), .Names = c("id", "word"), row.names = c(10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 3L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), class = "data.frame")

循环:

for (i in 1:nrow(words.dict))

    dataset.num <-  gsub(paste0("\\b(", words.dict[i,"word"], ")\\b"),words.dict[i,1], dataset.num) 

虽然我截断了数据,但 dataset.num 是一个将近 4 万行的字符向量(每行平均包含 20 个单词)。该代码在小数据上运行良好,但在处理速度有限的大数据上却没有那么快。

您对提高代码的效率和性能有何建议?

【问题讨论】:

  • 您能否提供一个使用dput(droplevels(head(dataset.num))) 的数据集的最小示例?
  • 您是否尝试过使用apply 功能?它本质上是 for 循环的矢量化实现,并且会更快
  • @HanjoJo'burgOdendaal apply 不是“for 循环的矢量化实现”,也不是“快得多”。实际上,它是 R for 循环的包装器。参见apply的源代码。你从哪里得到这些虚假信息?
  • 一个最小的可重现示例会有所帮助,但包 dplyr 实现了一些 C++,也许可以帮助您在这里变得更快......?
  • @nicola,认为 apply 是“矢量化实现”似乎是一种常见的误解。这是一个很好的讨论 - stackoverflow.com/questions/28983292/…。你每天都在学习新事物

标签: r performance loops


【解决方案1】:

这是一种不同的方法,也许可以更好地扩展,尽管我还没有真正测试过。

sapply(strsplit(dataset.num, "\\s+"), function(y) {
  i <- match(y, words.dict$word)
  y[!is.na(i)] <- words.dict$id[na.omit(i)]
  paste(y, collapse = " ")
})
#[1] "rt 22 will from here forward 3 know 18"                                                                           
#[2] "i hope you like wine 12 24"                                                                                       
#[3] "this week we 17 upgrading our servers there may 3 periodic 25 to the housing 16 portal sorry for 13 inconvenience"
#[4] "hanging out in foiachat 14 have 27 management software on the gov t side 15 from intake to redaction onwards"     
#[5] "they left out kourtney instead they let 23 from 20 19 talk"                                                       
#[6] "i 11 26 28 like for the 21 time 10"

请注意,您可以使用stringi::stri_split 来加快字符串拆分速度。

【讨论】:

  • 我正在测试数据。我应该选择哪种字符串拆分模式? (正则表达式,固定,coll ..等)
  • @Nal,你可以试试stri_split(dataset.num, regex = "\\s+")
猜你喜欢
  • 1970-01-01
  • 2021-07-07
  • 1970-01-01
  • 1970-01-01
  • 2018-12-25
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多