【发布时间】:2023-03-23 08:07:01
【问题描述】:
主要编辑澄清答案是错误的
我有一个包含组列 (split_by)、键列 (key_by) 和特征 ID 列 (intersect_by) 的 data.table
我希望在每个 split_by 组中,只保留组中所有当前键共享 trait id 的行。
例如:
dt <- data.table(id = 1:6, key1 = 1, key2 = c(1:2, 2), group_id1= 1, group_id2= c(1:2, 2:1, 1:2), trait_id1 = 1, trait_id2 = 2:1)
setkey(dt, group_id1, group_id2, trait_id1, trait_id2)
dt
id key1 key2 group_id1 group_id2 trait_id1 trait_id2
1: 4 1 1 1 1 1 1
2: 1 1 1 1 1 1 2
3: 5 1 2 1 1 1 2
4: 2 1 2 1 2 1 1
5: 6 1 2 1 2 1 1
6: 3 1 2 1 2 1 2
res <- intersect_this_by(dt,
key_by = c("key1"),
split_by = c("group_id1", "group_id2"),
intersect_by = c("trait_id1", "trait_id2"))
我希望 res 是这样的:
> res[]
id key1 key2 group_id1 group_id2 trait_id1 trait_id2
1: 1 1 1 1 1 1 2
2: 5 1 2 1 1 1 2
3: 2 1 2 1 2 1 1
4: 6 1 2 1 2 1 1
5: 3 1 2 1 2 1 2
我们看到 id 4 已被删除,如 group_id1 = 1 和 group_id2 = 1 组合组(id 4 所属的组)只有一个键组合 (1,1) 具有这些特征 (1,1)而该组中有两个键组合:(1,1) 和 (1,2),因此该组中的所有键不共享特征 (1,1),因此我们从该组中删除该特征,因此删除 id 4. 相反,id 1 和 5 具有相同的特征,但键不同,它们代表了该组中的所有键((1,1)和(1,2)),因此保留了 id 1 和 5 的特征。
那里给出了实现这一点的功能:
intersect_this_by2 <- function(dt,
key_by = NULL,
split_by = NULL,
intersect_by = NULL){
dtc <- as.data.table(dt)
# compute number of keys in the group
dtc[, n_keys := uniqueN(.SD), by = split_by, .SDcols = key_by]
# compute number of keys represented by each trait in each group
# and keep row only if they represent all keys from the group
dtc[, keep := n_keys == uniqueN(.SD), by = c(intersect_by, split_by), .SDcols = key_by]
dtc <- dtc[keep == TRUE][, c("n_keys", "keep") := NULL]
return(dtc)
}
但是对于大型数据集或复杂的特征/键/组来说,它变得相当慢...... 真实的 data.table 有 1000 万行,并且特征有 30 个级别...... 有什么办法可以改善吗?有什么明显的陷阱吗? 感谢您的帮助
最终编辑: Uwe 提出了一个简洁的解决方案,它比我的初始代码快 40%(我在这里删除了它,因为它令人困惑) 最终函数如下所示:
intersect_this_by_uwe <- function(dt,
key_by = c("key1"),
split_by = c("group_id1", "group_id2"),
intersect_by = c("trait_id1", "trait_id2")){
dti <- copy(dt)
dti[, original_order_id__ := 1:.N]
setkeyv(dti, c(split_by, intersect_by, key_by))
uni <- unique(dti, by = c(split_by, intersect_by, key_by))
unique_keys_by_group <-
unique(uni, by = c(split_by, key_by))[, .N, by = c(split_by)]
unique_keys_by_group_and_trait <-
uni[, .N, by = c(split_by, intersect_by)]
# 1st join to pick group/traits combinations with equal number of unique keys
selected_groups_and_traits <-
unique_keys_by_group_and_trait[unique_keys_by_group,
on = c(split_by, "N"), nomatch = 0L]
# 2nd join to pick records of valid subsets
dti[selected_groups_and_traits, on = c(split_by, intersect_by)][
order(original_order_id__), -c("original_order_id__","N")]
}
对于记录,10M 行数据集的基准测试:
> microbenchmark::microbenchmark(old_way = {res <- intersect_this_by(dt,
+ key_by = c("key1"),
+ split_by = c("group_id1", "group_id2"),
+ intersect_by = c("trait_id1", "trait_id2"))},
+ new_way = {res <- intersect_this_by2(dt,
+ key_by = c("key1"),
+ split_by = c("group_id1", "group_id2"),
+ intersect_by = c("trait_id1", "trait_id2"))},
+ new_way_uwe = {res <- intersect_this_by_uwe(dt,
+ key_by = c("key1"),
+ split_by = c("group_id1", "group_id2"),
+ intersect_by = c("trait_id1", "trait_id2"))},
+ times = 10)
Unit: seconds
expr min lq mean median uq max neval cld
old_way 3.145468 3.530898 3.514020 3.544661 3.577814 3.623707 10 b
new_way 15.670487 15.792249 15.948385 15.988003 16.097436 16.206044 10 c
new_way_uwe 1.982503 2.350001 2.320591 2.394206 2.412751 2.436381 10 a
【问题讨论】:
-
特征顺序是否重要,可以满足3,2和2,3吗?还是有方向性的
-
是的,订单很重要
-
为什么 res 中缺少 4?它是 group = (2,1) 和 trait = (2, 1) 就像 id=40。如果是错字,那么可能是
dt[id %in% dt[, .SD[, if (.N > 1) id, by=.(trait_id1, trait_id2)], by=.(group_id1, group_id2)]$V1] -
如果顺序很重要,我希望将列粘贴在一起,然后在分组 ID 中按该列分组。然后过滤大于 1 的频率。
-
4 也丢失了,因为它是带有键 2 的特征 2-1,但在同一组中没有具有相同特征的键 1。抱歉,我认为您错过了关键思想,但我必须承认我的解释很糟糕!
标签: r merge data.table