这是根据 OP 的规则重现预期结果的尝试。
我仍在努力寻找使用unique()、duplicated() 处理宽格式数据以及重整为长格式数据的解决方案。
但是,这里有一个使用for 循环的解决方案,它再现了给定样本数据集的预期结果:
library(data.table)
# append row numbers
setDT(DT)[, rn := .I]
# which values appear only once in the `to`` column?
single_to <- DT[, .N, by = to][N == 1L, to]
single_to
[1] 2 1 7
DT[, drop := NA]
for (i in seq_len(nrow(DT))) {
print(i)
print(DT[i])
if (isTRUE(DT$drop[i])) next # row already has been eliminated
act_to <- DT$to[i]
# Rule 1: eliminate subsequent rows with repeated value in `to` column
DT[rn > i & (to == act_to), drop := TRUE]
# Rule 1: eliminate subsequent rows with repeated value in `from` column
# Rule 2: but keep rows where value is unique in the `to` column
DT[rn > i & (from == act_to) & !(to %in% single_to), drop := TRUE]
DT[i, drop := FALSE]
print(DT[])
}
[1] 1
from to distance weight rn drop
1: 1 8 1 10 1 NA
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 NA
3: 3 4 1 5 3 NA
4: 4 5 3 9 4 NA
5: 5 6 4 8 5 NA
6: 6 2 5 2 6 NA
7: 7 8 2 1 7 TRUE
8: 4 3 5 6 8 NA
9: 2 1 1 7 9 NA
10: 6 8 4 8 10 TRUE
11: 1 7 5 3 11 NA
12: 8 4 6 7 12 TRUE
13: 9 5 3 9 13 NA
14: 10 3 8 2 14 NA
[1] 2
from to distance weight rn drop
1: 2 6 1 9 2 NA
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 NA
4: 4 5 3 9 4 NA
5: 5 6 4 8 5 TRUE
6: 6 2 5 2 6 NA
7: 7 8 2 1 7 TRUE
8: 4 3 5 6 8 NA
9: 2 1 1 7 9 NA
10: 6 8 4 8 10 TRUE
11: 1 7 5 3 11 NA
12: 8 4 6 7 12 TRUE
13: 9 5 3 9 13 NA
14: 10 3 8 2 14 NA
[1] 3
from to distance weight rn drop
1: 3 4 1 5 3 NA
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 4 5 3 9 4 TRUE
5: 5 6 4 8 5 TRUE
6: 6 2 5 2 6 NA
7: 7 8 2 1 7 TRUE
8: 4 3 5 6 8 TRUE
9: 2 1 1 7 9 NA
10: 6 8 4 8 10 TRUE
11: 1 7 5 3 11 NA
12: 8 4 6 7 12 TRUE
13: 9 5 3 9 13 NA
14: 10 3 8 2 14 NA
[1] 4
from to distance weight rn drop
1: 4 5 3 9 4 TRUE
[1] 5
from to distance weight rn drop
1: 5 6 4 8 5 TRUE
[1] 6
from to distance weight rn drop
1: 6 2 5 2 6 NA
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 4 5 3 9 4 TRUE
5: 5 6 4 8 5 TRUE
6: 6 2 5 2 6 FALSE
7: 7 8 2 1 7 TRUE
8: 4 3 5 6 8 TRUE
9: 2 1 1 7 9 NA
10: 6 8 4 8 10 TRUE
11: 1 7 5 3 11 NA
12: 8 4 6 7 12 TRUE
13: 9 5 3 9 13 NA
14: 10 3 8 2 14 NA
[1] 7
from to distance weight rn drop
1: 7 8 2 1 7 TRUE
[1] 8
from to distance weight rn drop
1: 4 3 5 6 8 TRUE
[1] 9
from to distance weight rn drop
1: 2 1 1 7 9 NA
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 4 5 3 9 4 TRUE
5: 5 6 4 8 5 TRUE
6: 6 2 5 2 6 FALSE
7: 7 8 2 1 7 TRUE
8: 4 3 5 6 8 TRUE
9: 2 1 1 7 9 FALSE
10: 6 8 4 8 10 TRUE
11: 1 7 5 3 11 NA
12: 8 4 6 7 12 TRUE
13: 9 5 3 9 13 NA
14: 10 3 8 2 14 NA
[1] 10
from to distance weight rn drop
1: 6 8 4 8 10 TRUE
[1] 11
from to distance weight rn drop
1: 1 7 5 3 11 NA
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 4 5 3 9 4 TRUE
5: 5 6 4 8 5 TRUE
6: 6 2 5 2 6 FALSE
7: 7 8 2 1 7 TRUE
8: 4 3 5 6 8 TRUE
9: 2 1 1 7 9 FALSE
10: 6 8 4 8 10 TRUE
11: 1 7 5 3 11 FALSE
12: 8 4 6 7 12 TRUE
13: 9 5 3 9 13 NA
14: 10 3 8 2 14 NA
[1] 12
from to distance weight rn drop
1: 8 4 6 7 12 TRUE
[1] 13
from to distance weight rn drop
1: 9 5 3 9 13 NA
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 4 5 3 9 4 TRUE
5: 5 6 4 8 5 TRUE
6: 6 2 5 2 6 FALSE
7: 7 8 2 1 7 TRUE
8: 4 3 5 6 8 TRUE
9: 2 1 1 7 9 FALSE
10: 6 8 4 8 10 TRUE
11: 1 7 5 3 11 FALSE
12: 8 4 6 7 12 TRUE
13: 9 5 3 9 13 FALSE
14: 10 3 8 2 14 NA
[1] 14
from to distance weight rn drop
1: 10 3 8 2 14 NA
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 4 5 3 9 4 TRUE
5: 5 6 4 8 5 TRUE
6: 6 2 5 2 6 FALSE
7: 7 8 2 1 7 TRUE
8: 4 3 5 6 8 TRUE
9: 2 1 1 7 9 FALSE
10: 6 8 4 8 10 TRUE
11: 1 7 5 3 11 FALSE
12: 8 4 6 7 12 TRUE
13: 9 5 3 9 13 FALSE
14: 10 3 8 2 14 FALSE
目前的结果与预期的结果不同
result <- DT[!(drop)]
result
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 6 2 5 2 6 FALSE
5: 2 1 1 7 9 FALSE
6: 1 7 5 3 11 FALSE
7: 9 5 3 9 13 FALSE
8: 10 3 8 2 14 FALSE
第 1 到 3、11、13 和 14 行与预期结果一致,但此处保留第 6 行和第 9 行,因为值 2 和 1 在 to 列中是唯一的。
显然,这种方法需要改进,因为2 和1 已经分别出现在第 1 行和第 2 行的 from 列中。这些行需要作为重复项删除。
为了删除这些,result 从宽格式改成长格式并按行号排序:
ldt <- melt(result, "rn", c("to", "from"))[order(rn)]
ldt
rn variable value
1: 1 to 8
2: 1 from 1
3: 2 to 6
4: 2 from 2
5: 3 to 4
6: 3 from 3
7: 6 to 2
8: 6 from 6
9: 9 to 1
10: 9 from 2
11: 11 to 7
12: 11 from 1
13: 13 to 5
14: 13 from 9
15: 14 to 3
16: 14 from 10
现在,我们需要识别属于single_to 值的重复项的行号:
ldt[duplicated(value) & variable == "to" & value %in% single_to]
rn variable value
1: 6 to 2
2: 9 to 1
这些行被 anti-join 从result 中删除:
result2 <-
result[!ldt[duplicated(value) & variable == "to" & value %in% single_to], on = .(rn)]
result2
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 1 7 5 3 11 FALSE
5: 9 5 3 9 13 FALSE
6: 10 3 8 2 14 FALSE
现在这几乎符合预期结果。只需执行第 4 条规则。为此,使用与以前相同的方法:重塑为长格式,识别行号并连接。但是,这里使用了 update join:
ldt2 <- melt(unique(result2, by = "from"), "rn", c("to", "from"))[order(rn)]
result2[ldt2[duplicated(value)], on = .(rn), c("to", "distance") := NA_integer_]
result2
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 1 7 5 3 11 FALSE
5: 9 5 3 9 13 FALSE
6: 10 NA NA 2 14 FALSE
讨论
此解决方案并未声称在编码或执行速度方面高效。它只是旨在从给定的样本数据集中重现预期的结果。
它需要更多的测试。例如,OP 在规则 3 中要求
我想重复这个过程,直到所有唯一值都来自
from 和 to 组合在任一列中至少出现一次
通过执行规则 1 和 2,最终不会检查是否满足此条件。
另外,我相信可能还有其他方法可以实现相同的目标。