这个问题的一个相当字面意思的实现是沿着玩家 id 应用,返回 id 头部的唯一元素
f0 <- function(player_ids)
lapply(seq_along(player_ids), function(i) unique(head(player_ids, i)))
这样就避免了对结果列表的分配进行管理,同时也处理了length(player_ids) == 0L时的情况。为了更有效地实施,请创建“累积”集列表
uid <- unique(player_ids)
sets <- lapply(seq_along(uid), function(i) uid[seq_len(i)])
然后识别属于第 i 个索引的集合
did <- !duplicated(player_ids)
sets[cumsum(did)]
这是目前为止的一些解决方案
f1 <- function(player_ids) {
end = length(player_ids)
tank <- player_ids[1]
unique_players_list = vector("list", end)
for(i in 1:end) {
if (!player_ids[i] %in% tank) tank <- c(tank, player_ids[i])
unique_players_list[[i]] = tank
}
unique_players_list
}
f2 <- function(player_ids) {
un = unique(player_ids)
ma = match(un, player_ids)
li = vector("list", length(player_ids))
for (i in seq_along(player_ids))
li[[i]] = un[ma <= i]
li
}
f3 <- function(player_ids) {
uid <- unique(player_ids)
sets <- lapply(seq_along(uid), function(i) uid[seq_len(i)])
sets[cumsum(!duplicated(player_ids))]
}
他们正在产生合理结果的一些基本测试
> identical(f1(player_ids), f2(player_ids))
[1] TRUE
> identical(f1(player_ids), f3(player_ids))
[1] TRUE
以及对更大数据集的性能评估
> library(microbenchmark)
> ids <- sample(100, 10000, TRUE)
> microbenchmark(f1(ids), f2(ids), f3(ids), times=10)
Unit: microseconds
expr min lq mean median uq max neval
f1(ids) 24397.193 25820.375 32055.5720 26475.8245 28030.866 56487.781 10
f2(ids) 20607.564 22148.888 34462.5850 24432.4785 51722.208 53473.468 10
f3(ids) 414.649 458.271 772.3738 501.5185 686.383 2163.261 10
f3() 在初始值的向量与唯一值的数量相比较大时表现良好。这是一个数据集,其中原始向量中的元素大多是唯一的,并且时序更具可比性
> ids <- sample(1000000, 10000, TRUE)
> microbenchmark(f1(ids), f2(ids), f3(ids), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
f1(ids) 214.2505 232.3902 233.7632 233.4617 237.5509 249.4652 10
f2(ids) 433.5181 443.5987 512.4475 463.8388 467.3710 949.4882 10
f3(ids) 299.2291 301.4931 307.7576 302.9375 316.6055 321.3942 10
正确处理边缘情况可能很重要,一个常见的问题是零长度向量,例如,f2(integer())。 f1() 不处理这种情况。有趣的是,我认为所有实现都与输入类型无关,例如,f1(sample(letters, 100, TRUE)) 有效。
一些离线讨论导致返回格式既不方便也不节省内存的建议,并且duplicated() 和unique() 在某种程度上是相似的操作,因此我们应该能够通过一次调用而侥幸逃脱。这导致了以下解决方案,它返回一个唯一标识符列表和每个 player_id 到唯一标识符末尾的偏移量
f5 <- function(player_ids) {
did <- !duplicated(player_ids)
list(uid = player_ids[did], end_idx = cumsum(did))
}
结果不能与identical() 或类似的直接比较。更新后的f3() 是
f3a <- function(player_ids) {
did <- !duplicated(player_ids)
uid <- player_ids[did]
sets <- lapply(seq_along(uid), function(i) uid[seq_len(i)])
sets[cumsum(did)]
}
这里有几个性能指标
> ids <- sample(100, 10000, TRUE)
> print(object.size(f3(ids)), units="auto")
4.2 Mb
> print(object.size(f5(ids)), units="auto")
39.8 Kb
> microbenchmark(f3(ids), f3a(ids), f5(ids), times=10)
Unit: microseconds
expr min lq mean median uq max neval
f3(ids) 437.663 445.091 450.3965 447.3755 452.629 476.016 10
f3a(ids) 342.378 351.408 385.0844 354.2375 369.861 638.084 10
f5(ids) 125.956 127.684 129.9898 128.5890 130.202 140.521 10
和
> ids <- sample(1000000, 10000, TRUE)
> microbenchmark(f3(ids), f3a(ids), f5(ids), times=10)
Unit: microseconds
expr min lq mean median uq max
f3(ids) 816317.361 821892.902 911862.5561 831274.596 1107496.984 1112586.295
f3a(ids) 824593.618 827590.130 1009032.9519 829197.863 838559.619 2607916.641
f5(ids) 213.677 270.397 313.1614 282.213 315.683 601.724
neval
10
10
10