R data.table：使用变量名访问列答案

【问题标题】：R data.table: accessing column with variable nameR data.table：使用变量名访问列
【发布时间】：2017-10-24 10:14:12
【问题描述】：

我正在使用美妙的 R data.table 包。但是，使用变量名访问（即通过引用操作）列是非常笨拙的：如果给定一个 data.table dt，它有两列 x 和 y，我们想要添加两列并将其命名为 z，那么命令是

dt = dt[, z := x + y]

现在让我们编写一个函数add，它接受一个（引用a）data.table dt 和三个列名summand1Name、summand2Name 和resultName 作为参数，它应该执行与上面完全相同的命令，仅具有通用列名。我现在使用的解决方案是反射，即

add = function(dt, summand1Name, summand2Name, resultName) {
  cmd = paste0('dt = dt[, ', resultName, ' := ', summand1Name, ' + ', summand2Name, ']')
  eval(parse(text=cmd))
  return(dt) # optional since manipulated  by reference
}

但是我对这个解决方案绝对不满意。首先，它很笨拙，这样编写代码并不有趣。调试起来很困难，而且只会让我生气并浪费时间。其次，它更难阅读和理解。这是我的问题：

我们可以用更好的方式编写这个函数吗？

我知道人们可以访问具有变量名称的列，如下所示：dt[[resultName]] 但是当我写的时候

dt[[resultName]] = dt[[summand1Name]] + dt[[summand2Name]]

然后 data.table 开始抱怨已复制并且无法通过引用工作。我不想要那个。我也喜欢dt = dt[<all 'database related operations'>] 的语法，这样我所做的一切都被放在一对括号中。是不是可以使用反引号之类的特殊符号来表示当前使用的名称不是引用数据表的实际列，而是实际列名称的占位符？

【问题讨论】：

你应该看看get和mget
另见this
add = function(dt, summand1Name, summand2Name, resultName) dt[, (resultName) := .SD[[summand1Name]] + .SD[[summand2Name]]] 怎么样？另一个选项可能是add2 = function(dt, summand1Name, summand2Name, resultName) dt[, (resultName) := eval(as.name(summand1Name)) + eval(as.name(summand2Name))] 或只使用上面建议的get。

标签： r data.table

【解决方案1】：

使用get()：

add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := get(summand1Name) + get(summand1Name)]
}

使用mget()：

add2 <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := do.call(`+`, mget(c(summand1Name,summand2Name)))]
}

# Let
dt <- data.table(a = 1:5, b = 10:14)
# Then
add(dt, 'x', 'y', 'z')
dt[]
#    x y z
# 1: 1 2 2

【讨论】：

+ 只能接受 1 或 2 个参数，因此 mget 版本需要稍作调整才能扩展到更多列。
@AccidentalStatistician 哦，谢谢。可以使用 Reduce() 而不是 do.call() 但我猜 rowSums() 会更有效。

【解决方案2】：

new_add <- function(dt, summand1Name, summand2Name, resultName) {
    dt[, (resultName) := rowSums(.SD), .SDcols = c(summand1Name, summand2Name)]
}

这只是将列名作为字符串。将此添加到 amatsuo_net 的速度测试中，并添加 sindri 的两个版本，我们得到以下结果：

microbenchmark::microbenchmark(
  original_add(dt, 'a', 'b', 'c'),
  my_add(dt, 'a', 'b', 'c'),
  list_access_add(dt, 'a', 'b', 'c'),
  david_add(dt, 'a', 'b', 'c'),
  new_add(dt, 'a', 'b', 'c'),
  get_add(dt, 'a', 'b', 'c'),
  mget_add(dt, 'a', 'b', 'c'))

## Unit: microseconds
##                               expr   min      lq     mean median      uq     max neval
##    original_add(dt, "a", "b", "c") 433.3  491.00  635.315  531.4  600.00  6064.0   100
##          my_add(dt, "a", "b", "c") 978.0 1062.35 1310.808 1208.8 1357.80  4157.3   100
## list_access_add(dt, "a", "b", "c") 303.9  331.95  432.939  363.8  434.05  3361.6   100
##       david_add(dt, "a", "b", "c") 401.3  440.65  659.748  474.5  577.75 11623.0   100
##         new_add(dt, "a", "b", "c") 518.9  588.30  765.394  667.1  741.95  5636.5   100
##         get_add(dt, "a", "b", "c") 415.1  454.50  674.699  491.1  546.70  9804.3   100
##        mget_add(dt, "a", "b", "c") 425.4  474.65  596.165  533.2  590.75  3888.0   100

这不是所有版本中最快的，但如果您正在寻找编写起来不费力的代码，那么这非常简单。由于它使用rowSums 工作，因此它也可以更容易地泛化为一次对任意数量的列求和。

此外，由于方括号内未提及dt，因此您可以将此列定义添加到 data.table“管道”中，而不是作为函数添加：

dt[, (resultName) := rowSums(.SD), .SDcols = c(summand1Name, summand2Name)
][, lapply(.SD, range), .SDcols = c(summand1Name, summand2Name, resultName)
][... # etc
]

【讨论】：

我已将 sindri 的版本添加到基准测试中。
谢谢。有趣的结果。

【解决方案3】：

您可以在 := 的 LHS 上结合使用 () 以及在引用 RHS 上的变量时使用 with = FALSE。

dt <- data.table(a = 1:5, b = 10:14)
my_add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := dt[, summand1Name, with = FALSE] + 
       dt[, summand1Name, with = FALSE]]
}
my_add(dt, 'a', 'b', 'c')
dt

编辑：

比较了三个版本。我的效率最低……（但仅供参考）。

set.seed(1)
dt <- data.table(a = rnorm(10000), b = rnorm(10000))
original_add <- function(dt, summand1Name, summand2Name, resultName) {
  cmd = paste0('dt = dt[, ', resultName, ' := ', summand1Name, ' + ', summand2Name, ']')
  eval(parse(text=cmd))
  return(dt) # optional since manipulated  by reference
}
my_add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := dt[, summand1Name, with = FALSE] + 
       dt[, summand1Name, with = FALSE]]
}
list_access_add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := dt[[summand1Name]] + dt[[summand2Name]]]
}
david_add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := .SD[[summand1Name]] + .SD[[summand2Name]]]
}

microbenchmark::microbenchmark(
  original_add(dt, 'a', 'b', 'c'),
  my_add(dt, 'a', 'b', 'c'),
  list_access_add(dt, 'a', 'b', 'c'),
  david_add(dt, 'a', 'b', 'c'))

## Unit: microseconds
##                                expr      min        lq      mean    median        uq      max
##     original_add(dt, "a", "b", "c")  604.397  659.6395  784.2206  713.0315  776.1295 5070.541
##           my_add(dt, "a", "b", "c") 1063.984 1168.6140 1460.5329 1247.7990 1486.9730 6134.959
##  list_access_add(dt, "a", "b", "c")  272.822  310.9680  422.6424  334.3110  380.6885 3620.463
##        david_add(dt, "a", "b", "c")  389.389  431.9080  542.7955  454.5335  493.4895 3696.992
##  neval
##    100
##    100
##    100
##    100

编辑2：

有一百万行，结果如下所示。正如预期的那样，原始方法执行良好，一旦完成eval，这将很快奏效。

## Unit: milliseconds
##                                expr       min        lq      mean    median        uq      max
##     original_add(dt, "a", "b", "c")  2.493553  3.499039  6.585651  3.607101  4.390051 114.0612
##           my_add(dt, "a", "b", "c") 11.821820 14.512878 28.387841 17.412433 19.642231 117.6359
##  list_access_add(dt, "a", "b", "c")  2.161276  3.133110  6.874885  3.218185  3.407776 107.6853
##        david_add(dt, "a", "b", "c")  2.237089  3.313133  6.047832  3.381757  3.788558 103.7532
##  neval
##    100
##    100
##    100
##    100

【讨论】：

您也可以使用 substitute 和 eval，或者来自 rlang 或 dplyr 的开发版本的潜在 Hadley 更好的 quo 和 UQ 函数来执行此操作而无需调用 @ 987654334@ 这似乎不太理想
I.e summand1 = substitute (summand1Name)... 在函数内部，开始是 dt[, (resultName) := eval(summand1) + eval(summand2)]。在这里，您传入裸列名称，而不是求和的字符串。
首先with = FALSE 也可以复制，其次dt[[summand1Name]] 将比dt[, summand1Name, with = FALSE] 更高效。
@David_Arenburg。是的，你说得很对。查看我的编辑。
在使用按组操作时，不同的方法表现得非常不同——您不想在这种情况下使用其中的一些。比较dt[, sum(dt[["a"]]), by = cut(b, breaks = c(-Inf, 0, Inf))];dt[, sum(a), by = cut(b, breaks = c(-Inf, 0, Inf))];dt[, sum(.SD[["a"]]), by = cut(b, breaks = c(-Inf, 0, Inf))]; dt[, sum(dt[, "a", with = FALSE]), by = cut(b, breaks = c(-Inf, 0, Inf))]

【解决方案4】：

这是另一个使用substitute 的解决方案。我通常尽量避免使用substitute，但我认为这是使用快速data.table 和:= 代码而不是原生列表访问的唯一方法。

我一直在amatsuo_net的界面。

set.seed(1)
dt <- data.table(a = rnorm(10000), b = rnorm(10000))

snaut_add <- function(dt, summand1, summand2, resultName){
  eval(substitute(
    dt[, z := x + y],
    list(
      z=as.symbol(resultName),
      x=as.symbol(summand1),
      y=as.symbol(summand2)
    )
  ))
}

snaut_add(dt, "a", "b", "c")
dt

【讨论】：