【发布时间】:2016-04-21 08:20:03
【问题描述】:
有一个字典数据框 words.dict 大约有 44000 个单词,下面的代码应该将数据集 dataset.num 中的所有单词替换为他们在字典中的数字 ID。
data.num:
dput(head(dataset.num))
c("rt breaking will from here forward be know as", "i hope you like wine and cocktails", "this week we are upgrading our servers there may be periodic disruptions to the housing application portal sorry for any inconvenience", "hanging out in foiachat anyone have fav management software on the gov t side anything from intake to redaction onwards", "they left out kourtney instead they let chick from big bang talk", "i am encoding film like for the billionth time already ")
words.dict:
dput(head(words.dict,20)
structure(list(id = c(10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 3L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), word = structure(1:20, .Label =c("already", "am", "and", "any", "anyone", "anything", "application", "are", "as", "bang", "be", "big", "billionth", "breaking", "chick", "cocktails","disruptions", "encoding", "fav", "film", "foiachat", "for", "forward", "from", "gov", "hanging", "have", "here", "hope", "housing", "i", "in", "inconvenience", "instead", "intake", "know", "kourtney", "left", "let", "like", "management", "may", "on", "onwards", "our", "out", "periodic", "portal", "redaction", "rt", "servers", "side", "software", "sorry", "t", "talk", "the", "there", "they", "this", "time", "to", "upgrading", "we", "week", "will", "wine", "you"), class = "factor")), .Names = c("id", "word"), row.names = c(10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 3L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), class = "data.frame")
循环:
for (i in 1:nrow(words.dict))
dataset.num <- gsub(paste0("\\b(", words.dict[i,"word"], ")\\b"),words.dict[i,1], dataset.num)
虽然我截断了数据,但 dataset.num 是一个将近 4 万行的字符向量(每行平均包含 20 个单词)。该代码在小数据上运行良好,但在处理速度有限的大数据上却没有那么快。
您对提高代码的效率和性能有何建议?
【问题讨论】:
-
您能否提供一个使用
dput(droplevels(head(dataset.num)))的数据集的最小示例? -
您是否尝试过使用
apply功能?它本质上是for循环的矢量化实现,并且会更快 -
@HanjoJo'burgOdendaal
apply不是“for 循环的矢量化实现”,也不是“快得多”。实际上,它是 Rfor循环的包装器。参见apply的源代码。你从哪里得到这些虚假信息? -
一个最小的可重现示例会有所帮助,但包
dplyr实现了一些 C++,也许可以帮助您在这里变得更快......? -
@nicola,认为 apply 是“矢量化实现”似乎是一种常见的误解。这是一个很好的讨论 - stackoverflow.com/questions/28983292/…。你每天都在学习新事物
标签: r performance loops