R中的嵌套哈希答案

【问题标题】：Nested hashes in RR中的嵌套哈希
【发布时间】：2021-05-25 17:21:46
【问题描述】：

如何在 R 中有效地将深度散列与变量嵌套？例如：

hash <- new.env(hash = TRUE, parent = emptyenv(), size = 100L)
foo <- 'food'
bar <- 'fruits'
var <- 'apple'
count <- 1

# this will work... but it's only one level in
hash[[foo]] <- count

# A deeper nest is needed, but does not work:
hash[[foo]][[bar]][[var]] <- count

# this gets closer, but foo and bar need to be evaluated as their variables
hash$foo$bar$var <- count

# examine the keys
ls(hash)

我从post here 看到了答案，这在 R 中可能不可用。这是真的吗？它与 $ 赋值一起出现，我们可以更进一步，但这里的问题需要变量。

我看到了using the environment's hash capabilities are faster than using some packages，可能最好学习最快的方法，但是如果需要一个包来完成这项工作，那么我想它必须使用它。

【问题讨论】：

嵌套散列工作得很好，假设散列或字典具有“键”和“值”。您的 hash[[foo]][[bar]][[var]] 在这里毫无意义，因为 hash[[foo]] 必须解析为哈希本身才能使 hash[[foo]][[bar]] 有意义。由于hash[[foo]] 只是一个字符串，'food'[[bar]] 没有意义。我建议您扩展“嵌套哈希”的含义，或许可以展示您打算如何使用它。
感谢您的评论，我觉得您是对的，我的术语很差，但是如何描述呢？列表中的列表未知次数，具有散列功能以及获取键和覆盖值（如果键已存在）的能力。我认为他们在 perl 中将其称为深度嵌套的哈希，并像这样：my %hash; $hash{$foo}{$bar}{$var} = $count
有可能使hash[[foo]][[bar]] 工作，但它依赖于尚不存在的hash[[foo]]。也就是说，一旦它被分配1，对我来说，它可以从存储数字变为存储哈希的前提是不直观且具有破坏性的行为。你是否建议hash[[foo]] <- 1 应该工作，hash[[foo]][[bar]][[var]] <- count 也应该工作，毕竟，hash[[foo]] 仍然只返回1？这里面有很多模棱两可的地方，在某种程度上，我认为普通的类似哈希的接口往往不会暗示。
但是，如果您不预先分配 hash[[foo]] 并且希望能够执行 hash[[foo]][[bar]] <- ... 并让 [[foo]] 现在成为哈希（以前不存在），那么那可能是可行的。
不过，最终，我真的很好奇将使用它的用例。您可能不需要嵌套散列的前提。或者其他我没有想到的东西。

标签： r hash nested-lists

【解决方案1】：

这是一个列表列表，其中包含您想要的属性，可以添加任意级别而无需“预先声明”。

l = list()
l[[foo]][[bar]][[var]] = 2

list-of-lists 到 hash-of-hashes

事实上，可以将其从列表列表转换为嵌套环境（这将允许将结构传递给函数并更新叶节点，例如，无需返回结构）喜欢

as_environment = function(x) {
    if (is.list(x)) {
        x <- lapply(x, as_environment)
        x <- as.environment(x)
    }
    x
}

e = as_environment(l)

这表明在 R 中可以嵌套散列。

这里有一些数据 - 具有 50、100 或 1000 个可能值的嵌套级别，以及 10000 个总数据点

m = c(50, 100, 1000)
n = 1000 * 10
d = list(
    a = sample(as.character(seq_len(m[[1]])), n, TRUE),
    b = sample(as.character(seq_len(m[[2]])), n, TRUE),
    c = sample(as.character(seq_len(m[[3]])), n, TRUE)
)

这里有一些衡量性能的函数

f0 = function(a, b, c, n) {
    ## data access fixed cost
    for (i in seq_len(n))
        c(a[[i]], b[[i]], c[[i]])
}

f1 = function(x, a, b, c, n) {
    ## creation / assignment
    for (i in seq_len(n))
        x[[ a[[i]] ]][[ b[[i]] ]][[ c[[i]] ]] <- 1
    x
}

f2 = function(x, a, b, c, n) {
    ## update
    for (i in seq_len(n))
        x[[ a[[i]] ]][[ b[[i]] ]][[ c[[i]] ]] <-
            x[[ a[[i]] ]][[ b[[i]] ]][[ c[[i]] ]] + 1
    x
}

这是一些基准数据

library(microbenchmark)

l <- with(d, f1(list(), a, b, c, n))
e <- as_environment(l)
microbenchmark(
    with(d, f0(a, b, c, n)),
    with(d, f1(list(), a, b, c, n)),
    with(d, f2(l, a, b, c, n)),
    with(d, f2(e, a, b, c, n)),
    times = 10
)

有输出...

Unit: milliseconds
                            expr      min       lq      mean    median        uq       max neval
         with(d, f0(a, b, c, n)) 16.59220 17.37859  19.36920  18.02578  20.46342  25.21631    10
 with(d, f1(list(), a, b, c, n)) 72.54094 74.24085  83.54071  81.90286  90.75257  98.03838    10
      with(d, f2(l, a, b, c, n)) 86.65550 96.49548 104.69007 101.74540 116.04673 135.76844    10
      with(d, f2(e, a, b, c, n)) 48.53202 52.89202  57.76179  55.37080  64.14356  69.74413    10

首先，时间单位是毫秒。其次，更新哈希散列的时间比更新列表的时间快不到 50%。如果我将 n 增加 10 倍，我会看到这些时间大约增加了 10 倍——列表列表和哈希值都近似线性扩展。

这些时间点强烈表明，就性能而言，至少对于这种规模的数据，我们不妨使用直接的列表列表方法。

不过……

支持嵌套构造的 Hash 类？

这是一个“哈希”类，它是一个环境

Hash <- function()
    structure(new.env(parent = emptyenv()), class = "Hash")
h = Hash()

如果我们尝试

h[[foo]][[bar]][[var]] <- 1

h 是一个 Hash，但它包含一个作为 list-of-lists 的键

> h
<environment: 0x7fddbd00c490>
attr(,"class")
[1] "Hash"
> h[[foo]]
$fruits
$fruits$apple
[1] 1

这是因为 R 执行赋值评估的方式 - 基本上是从右到左，因此创建一个 list(apple = 1)，然后创建 list(fruits = list(apple = 1))，然后再将其分配给我们的 Hash / 环境。我真的看不出如何使用现有语法来强制创建具有最右边分配的环境，但我们可以编写一个更新方法，首先将列表列表强制为哈希值作业

## like as_environment, above...
as_Hash = function(x) {
    if (is.list(x)) {
        x <- lapply(x, as_Hash)
        x <- structure(as.environment(x), class = "Hash")
    }
    x
}

## re-define assignment of an element to a hash -- if it's a list-of-lists, 
## then coerce to a Hash-of-Hashes
`[[<-.Hash` <- function(x, i, value) {
    if (is.list(value))
        value <- as_Hash(value)
    assign(i, value, x)
    x
}

一旦分配完成，结果始终是哈希值。

> h = Hash()
> h[[foo]][[bar]][[var]] <- 1
> h[[foo]][["vegetable"]][["tomato"]] <- 2
> h
<environment: 0x7fddbb7102c8>
attr(,"class")
[1] "Hash"
> h[[foo]]
<environment: 0x7fddbb719ca8>
attr(,"class")
[1] "Hash"
> ls(h[[foo]])
[1] "fruits"    "vegetable"
> h[[foo]][[bar]]
<environment: 0x7fddbb718d80>
attr(,"class")
[1] "Hash"
> h[[foo]][[bar]][[var]]
[1] 1

但有必要吗？

回到我们原来的例子

l = list()
l[[foo]][[bar]][[var]] = 2

您可以使用.Internal(inspect(l)) 了解 R 是如何组织事物的

> .Internal(inspect(l))
@7f886228d6d0 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
  @7f886228d698 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
    @7f886228d660 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
      @7f886228d740 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 2
## ... additional output, dealing with the names (ATTRIB) at each level

这表示l 由位于特定地址@7f886228d6d0 的内存表示，表示R 的内部列表表示（VECSXP）。 VECSXP 的长度为 1，并指向另一个列表 /VECSXP@7f886228d698。这指向另一个列表 /VECSXP@7f886228d660，其中包含您分配的值——REALSXP@7f886228d740。

如果你更新一个元素会发生什么？

> l[[foo]][[bar]][[var]] <- 3
> .Internal(inspect(l))
@7f886228d6d0 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
  @7f886228d698 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
    @7f886228d660 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
      @7f886228d5f0 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 3
...

请注意，只有REALSXP 的内存位置发生了变化，因此您没有复制整个结构，只是实际更改的部分。很好。

添加另一种水果怎么样？

> l[[foo]][[bar]][["pear"]] <- 4
> .Internal(inspect(l))
@7f886228d6d0 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
  @7f886228d698 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
    @7f885dc9ff48 19 VECSXP g0c2 [REF(1),ATT] (len=2, tl=0)
      @7f886228d5f0 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 3
      @7f88622a4200 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 4
...

我们为梨添加了REALSXP，但也更改了VECSXP 水果。我们没有为苹果更改REALSXP，也没有更改其他VECSXP——我们再次只更改（或几乎只）需要更改的内存。

并改变食物链中更高的元素；）？

> l[[foo]][["vegetables"]][["tomato"]] <- 4
> .Internal(inspect(l))
@7f886228d6d0 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
  @7f885dca0148 19 VECSXP g0c2 [REF(1),ATT] (len=2, tl=0)
    @7f885dc9ff48 19 VECSXP g0c2 [REF(1),ATT] (len=2, tl=0)
      @7f886228d5f0 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 3
      @7f88622a4200 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 4
...
    @7f88622a3f98 19 VECSXP g0c1 [REF(1),ATT] (len=1, tl=0)
      @7f88622a4008 14 REALSXP g0c1 [REF(3)] (len=1, tl=0) 4

我们更改了水果和蔬菜级别对应的VECSXP，当然添加了我们的番茄，但数据结构的其他组件保持不变。

这表明 R 正在制作最少的数据副本，因此我们可能期望这种数据结构对于合理大小的嵌套列表相对有效。在投入更多（或这么多！）努力之前，值得发现是否是这种情况！

【讨论】：

代码看起来接近我们在discussion 中讨论的内容。

【解决方案2】：

如果您预先声明列表，这将有效：

count <- 1
hash <- new.env(hash = TRUE, parent = emptyenv(), size = 100L)
hash[[foo]] <- list()
hash[[foo]][[bar]] <- list()
hash[[foo]][[bar]][[var]] <- count

【讨论】：

环境和列表之间存在一些差异，在某些方面两者都可以被视为哈希。一个重要的问题是环境是引用对象，列表不是。我不知道@Oatmeal 想要哪一个。
其实我觉得这比你说的要好，hash = list(); hash[[foo]][[bar]][[var]] <- 1 为你创建了嵌套列表。我相信这对于中等大小/嵌套的哈希来说可能是足够有效的——更新列表列表不会导致数据结构的完整副本，只是需要更改的部分，例如叶子节点，或在添加“蔬菜”的情况下为内部节点。