data.table 中的 .internal.selfref 无效答案

【问题标题】：Invalid .internal.selfref in data.tabledata.table 中的 .internal.selfref 无效
【发布时间】：2013-06-12 13:56:10
【问题描述】：

我需要分配一个“第二个”ID 来对我原来的 id 中的一些值进行分组。这是我的示例数据：

dt<-structure(list(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
                   period = c("start", "end", "start", "end", "start", "end"),
                   date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date"))),
              class = c("data.table", "data.frame"),
              .Names = c("id", "period", "date"),
              sorted = "id")
> dt
     id period       date
1: aaaa  start 2012-03-02
2: aaaa    end 2012-03-05
3: aaas  start 2012-08-21
4: aaas    end 2013-02-25
5: bbbb  start 2012-03-31
6: bbbb    end 2013-02-11

需要根据此列表对列id 进行分组（在id2 中使用相同的值）：

> groups
[[1]]
[1] "aaaa" "aaas"

[[2]]
[1] "bbbb"

我使用了以下代码，它似乎通过给出以下warning：

    > dt[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]
    Warning message:
    In `[.data.table`(dt, , `:=`(id2, which(vapply(groups, function(x,  :
      Invalid .internal.selfref detected and fixed by taking a copy of the whole table,
so that := can add this new column by reference. At an earlier point, this data.table has
been copied by R (or been created manually using structure() or similar). Avoid key<-,
names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use
set* syntax instead to avoid copying: setkey(), setnames() and setattr(). Also,
list (DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects),
use reflist() instead if needed (to be implemented). If this message doesn't help,
please report to datatable-help so the root cause can be fixed.
    > dt
         id period       date id2
    1: aaaa  start 2012-03-02   1
    2: aaaa    end 2012-03-02   1
    3: aaas  start 2012-08-29   1
    4: aaas    end 2013-02-26   1
    5: bbbb  start 2012-03-31   2
    6: bbbb    end 2013-02-11   2

有人可以简要解释此警告的性质以及对最终结果的任何最终影响（如果有的话）吗？谢谢

编辑：

以下代码实际上显示了 dt 的创建时间以及如何将其传递给给出警告的函数：

f.main <- function(){
      f2 <- function(x){
      groups <- list(c("aaaa", "aaas"), "bbbb") # actually generated depending on the similarity between values of x$id
      x <- x[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]
      return(x)
  }
  x <- f1()
  if(!is.null(x[["res"]])){
    x <- f2(x[["res"]])
    return(x)
  } else {
    # something else
  }
}

f1 <- function(){
  dt<-data.table(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
                 period = c("start", "end", "start", "end", "start", "end"),
                 date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date")))
  return(list(res=dt, other_results=""))
}

> f.main()
     id period       date id2
1: aaaa  start 2012-03-02   1
2: aaaa    end 2012-03-02   1
3: aaas  start 2012-08-29   1
4: aaas    end 2013-02-26   1
5: bbbb  start 2012-03-31   2
6: bbbb    end 2013-02-11   2
Warning message:
In `[.data.table`(x, , `:=`(id2, which(vapply(groups, function(x,  :
  Invalid .internal.selfref detected and fixed by taking a copy of the whole table,
so that := can add this new column by reference. At an earlier point, this data.table
has been copied by R (or been created manually using structure() or similar).
Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole
data.table. Use set* syntax instead to avoid copying: setkey(), setnames() and setattr().
Also, list(DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects),
use reflist() instead if needed (to be implemented). If this message doesn't help,
please report to datatable-help so the root cause can be fixed.

【问题讨论】：

警告说：“使用结构（）或类似的手动创建”。使用函数data.table 创建您的data.table。但是，这只是一个警告，您不应该遇到重大问题（除了较慢的性能）。此外，您可以将.BY[[1]] 替换为id。
@Roland 感谢您的回复，但在实际情况下，该表不是通过structure 装箱的。这只是print(dput(x)) 的（修改后的）输出，我用来知道我的程序中的表格发生了什么。仔细检查一下，dt 是通过函数中的data.table 生成的，返回（）到主函数，主函数将其作为参数传递给另一个函数，这里发生了warning
好吧，让你的代码代表你真正的问题。向我们展示如何在函数之间传递 DT。
@Roland 完成。请查看编辑
fwiw，实现上述目的的更短的表达式是dt[melt(groups)]，使用reshape2::melt

标签： r data.table

【解决方案1】：

是的，问题出在列表上。这是一个简单的例子：

DT <- data.table(1:5)
mylist1 <- list(DT,"a")
mylist1[[1]][,id:=.I]
#warning

mylist2 <- list(data.table(1:5),"a")
mylist2[[1]][,id:=.I]
#no warning

您应该避免将 data.table 复制到列表中（为了安全起见，我会完全避免在列表中包含 DT）。试试这个：

f1 <- function(){
  mylist <- list(res=data.table(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
                 period = c("start", "end", "start", "end", "start", "end"),
                 date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date"))))
  other_results <- ""
  mylist$other_results <- other_results
  mylist
}

【讨论】：

谢谢！也可以删除 warning 做的：dt <- copy(mylist1[[1]])
当然，但是对于通常很大的data.tables，目标是避免复制。这是该软件包的主要好处之一。
我知道，但 dt 不超过 10 行。我的程序检查许多不同的数据流（每个都在 dt 中）通过子设置这些 data.tables 并调用不同的函数（如上面的 @987654328）为每个客户执行操作、警报、创建工作@ 是有趣检查拼写错误的结果），具体取决于流类型。我使用dt 主要是因为在某些时候这些小桌子会与一个巨大的桌子（50M+）连接在一起，我喜欢用joins 和data.table！特别是roll 选项，我一直在使用它！
你是说内存吗？在main 的每次运行中，可能有 15 到 20 个这样的副本。但是对于下一个客户，有一个新的函数调用，（main 由adply 针对每一行的 cust.detail 表调用），所以我认为（除了值main 返回）每个都被取消并且没有左右复制。我说的对吗？
不，我是说速度。制作副本需要时间。如果这是相关的，则取决于您的用例。

【解决方案2】：

您可以在创建列表时“浅拷贝”，这样 1）您不会进行全内存复制（速度不受影响）和 2）您不会收到内部 ref 错误（感谢 @mnel这个技巧）。

创建数据：

set.seed(45)
ss <- function() {
    tt <- sample(1:10, 1e6, replace=TRUE)
}
tt <- replicate(100, ss(), simplify=FALSE)
tt <- as.data.table(tt)

你应该如何创建列表（浅拷贝）：

system.time( {
    ll <- list(d1 = { # shallow copy here...
        data.table:::settruelength(tt, 0)
        invisible(alloc.col(tt))
    }, "a")
})
user  system elapsed
   0       0       0
> system.time(tt[, bla := 2])
   user  system elapsed
  0.012   0.000   0.013
> system.time(ll[[1]][, bla :=2 ])
   user  system elapsed
  0.008   0.000   0.010

因此，您不会在速度上妥协，也不会在收到完整副本后收到警告。希望这会有所帮助。

【讨论】：

即使我已经有了答案（警告的原因），这提供了最好的和更通用的（因为我可能需要创建 dt 和 then 把它在列表中）使用data.table 对象内部和外部lists 的方式。我希望我可以多次投票:-)

【解决方案3】：

“检测到无效的 .internal.selfref 并通过复制修复...”

在 f2() 中分配 id2 时无需复制，您可以通过更改直接添加列：

# From:

      x <- x[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]

# To something along the lines of:
      x$id2 <- findInterval( match( x$id, unlist(groups)), cumsum(c(0,sapply(groups, length)))+1)

然后您可以像往常一样继续使用您的“x”data.table，而不会产生警告。

此外，为了简单地抑制警告，您可以在 f2(x[["res"]]) 调用周围使用 suppressWarnings()。

即使在小表上也可能存在很大的性能差异：

Performance Comparison:
Unit: milliseconds
                       expr      min       lq   median       uq      max neval
                   f.main() 2.896716 2.982045 3.034334 3.137628 7.542367   100
 suppressWarnings(f.main()) 3.005142 3.081811 3.133137 3.210126 5.363575   100
            f.main.direct() 1.279303 1.384521 1.413713 1.486853 5.684363   100

【讨论】：

感谢您提供此选项。我会检查你的方法的性能。
很有趣，所以 findInterval+match 比 vapply+== 快 2 倍。这很有帮助，很多。谢谢！
信用到期的信用 - 在我第一次了解到这一点的地方支持答案:: stackoverflow.com/a/11002456/173985
已经支持你的回答 :-) 我接受了@Roland 的回答，因为它回答了我的实际问题：为什么我会收到警告。