【问题标题】:How do I identify what is causing thrashing in my R function?如何确定导致 R 函数抖动的原因?
【发布时间】:2026-02-19 22:15:01
【问题描述】:

我写了一个函数来匿名化数据框中的名字,给定一些键,一旦它匿名化很多名字,它就会爬起来,但我不明白为什么。

有问题的数据框是一组通过 Twitter API 收集的 4733 条推文,其中每行是一条包含 32 列数据的推文。无论名称显示在哪一行,这些名称都将被匿名化,因此我不想将函数限制为仅查看这 32 列中的几列。

key 是一个包含 211121 对真假姓名的数据框,真假名称在数据框中都是唯一的。匿名化大约 10 万个姓名后,该功能会大大减慢。

函数如下所示:

pseudonymize <- function(df, key) {
  for(name in key$realNames) {
    df <- as.data.frame(apply(df, 2, function(column) gsub(name, key[key$realNames == name, 2], column)))
  }
}

这里有什么明显的东西会导致速度变慢吗?我完全没有优化代码以提高速度的经验。

编辑1:

这里有几行来自要匿名的数据框。

"https://twitter.com/__jgil/statuses/825559753447313408","__jgil",0.000576911235261567,756,4,13,17,7,16,23,10,0.28166915052161,0.390123456790124,0.00271311644806025,0.474529795261862,0.00641025649383664,"@jadahung20 GIRL I am tooooooo salty tonight lolll","lolll","adjoint","anglais","indefini","anglais","anglais","non","iPhone, Twitter",4057,214,241,"Canada","Nouvelle-Ecosse","Middleton","indefini","Shari"
"https://twitter.com/__paigewhite/statuses/827988259573788673","__paigewhite",0,1917,0,8,8,0,9,9,16,0.143476044852192,0.162056634159209,0.000172947386274259,0,0,"@abbytutty_ i miss emily lololol _Ù÷â_Ù÷É","lololol","adjoint","anglais","indefini","anglais","anglais","non","iPhone, Twitter",8366,392,661,"Canada","Nouvelle-Ecosse","indefini","indefini","Shari"
"https://twitter.com/_*ehynes/statuses/821022926287884288","_*ehynes",0,1917,1,6,7,1,7,8,1,1,1,0.000196850793912616,0.00393656926735126,0.200000002980232,"@tdesj3 @belle lol yea doubt it.","lol","adjoint","indefini","anglais","anglais","anglais","non","iPhone, Twitter",1184,87,70,"Canada","Nouvelle-Ecosse","Halifax","indefini","Shari"

以下是关键的几行。

"","realNames","fakeNames"
"1","________","Tajid_Pinkley"
"2","____________aho","Monica_Yujiri"
"3","___________ass","Alexander_Garay-Grajeda"

编辑2:

我已将 DF 简化为仅需要匿名化的两列,这让事情变得更快,但在处理了大约 155k 个名称后它仍然退出。

根据 cmets 的要求,这是要匿名的 DF 前三行的 dput() 输出。

structure(list(
  utilisateur = c("___Yeliab", "__courtlezz", "__courtlezz"),
  texte = c("@EmilyIsPro ik lol", "@NikkiErica21 there was a sighting in sunset ridge too. Keep Winnie and bob safe lol", "@NikkiErica21 lol yes _Ã\231։")
  ),
  row.names = c(NA, 3L),
  class = "data.frame")

这是密钥前三行的dput()

structure(list(
  realNames = c("________", "____________aho", "___________ass"),
  fakeNames = c("Abhinav_Chang", "Caleb_Dunn-Sparks", "Taryn_Hunzicker")
  ),
  row.names = c(NA, 3L),
  class = "data.frame")

【问题讨论】:

  • 请分享一个小的、可重复的(复制/粘贴!)样本输入。
  • 如果不查看您的数据结构就很难判断,但您在循环内进行了大量转换。 apply 将数据帧转换为矩阵——你可能根本不应该使用它。 as.data.frame 转换回数据框。您真的需要在每次迭代中将对象转换为矩阵,然后再转换回数据框吗?如果您可以将这些操作移到循环之外——将所有内容转换一次——它会更快。当我们看到输入数据时,您可能根本不需要转换。
  • 另外,如果您不使用正则表达式特殊字符,使用fixed = TRUE 参数将使gsub() 更快。并且可能有矢量化选项,所以你根本不需要循环......
  • 能否将数据分享给dput(),以便包含所有类和结构信息? dput(df[1:3, ])dput(key([1:3]) 会很棒。

标签: r performance optimization twitter anonymize


【解决方案1】:

将数据作为向量而不是 data.frame 会更有效。我遇到了一些编码问题,因此使用 iconv 将文本转换为 UTF-8;如果名称包含非 ASCII 字符,则需要进行一些处理。

key1 <- data.frame(
    realNames = c("________", "____________aho", "___________ass", 
        "___Yeliab", "__courtlezz", "NikkiErica21", "EmilyIsPro", "aho"),
    fakeNames = c("Abhinav_Chang", "Caleb_Dunn-Sparks", "Taryn_Hunzicker", 
        "A_A", "B_B", "C_C", "D_D", "E_E"),
    stringsAsFactors = FALSE
)

pseudonymize1 <- function(df, key) {
    mat <- as.matrix(df)
    dims <- attr(mat, which = "dim")
    cnam <- colnames(df)
    vec <- iconv(unclass(mat), from = "latin1", to = "UTF-8")
    for (name in split(key, f = seq_len(nrow(key)))) {
        vec <- gsub(
            vec, 
            pattern = name$realNames, 
            replacement = name$fakeNames, 
            fixed = TRUE)
    }
    mat <- vec
    attr(mat, which = "dim") <- dims
    df <- as.data.frame(mat, stringsAsFactors = FALSE)
    colnames(df) <- cnam
    df
}
pseudonymize1(df1, key1)
# utilisateur                                                                       texte
# 1         A_A                                                                 @D_D ik lol
# 2         B_B @C_C there was a sighting in sunset ridge too. Keep Winnie and bob safe lol
# 3         B_B                               @C_C lol yes _Ã\u0083\u0099Ã\u0083·Ã\u0083¢

library(microbenchmark)    
microbenchmark(
    pseudonymize(df1, key1),
    pseudonymize1(df1, key1)
)
# Unit: microseconds
#                     expr      min        lq     mean   median        uq      max neval cld
#  pseudonymize(df1, key1) 1842.554 1885.6750 2131.089 1994.755 2294.6850 3007.371   100   b
# pseudonymize1(df1, key1)  287.683  306.1905  333.678  314.950  339.8705  497.301   100  a 

我对 155k 名称的担忧是,当作为正则表达式搜索时,您会发现名称包含在其他名称中。这可能是真名中的真名(例如 EmilyIsPro 中的 Emily),或者是以前替换的假名中的真名。您需要对此进行测试,并考虑使用随机散列而不是类似名称的假名称。

【讨论】:

    最近更新 更多