根据 2 列中的条件删除重复项并操作数据框答案

【问题标题】：Remove duplicates and manipulate dataframe based on conditions from 2 columns根据 2 列中的条件删除重复项并操作数据框
【发布时间】：2020-05-16 20:06:13
【问题描述】：

我有一个如下的数据框

+------+-----+----------+--------+
| from | to  | distance | weight |
+------+-----+----------+--------+
|    1 |   8 |        1 |     10 |
|    2 |   6 |        1 |      9 |
|    3 |   4 |        1 |      5 |
|    4 |   5 |        3 |      9 |
|    5 |   6 |        4 |      8 |
|    6 |   2 |        5 |      2 |
|    7 |   8 |        2 |      1 |
|    4 |   3 |        5 |      6 |
|    2 |   1 |        1 |      7 |
|    6 |   8 |        4 |      8 |
|    1 |   7 |        5 |      3 |
|    8 |   4 |        6 |      7 |
|    9 |   5 |        3 |      9 |
|   10 |   3 |        8 |      2 |
+------+-----+----------+--------+

我想根据以下条件依次过滤数据：

如果一个数字出现在to 列中，则它不应在to 或from 列中重复
如果from 中对应的to 是一个新值并且在to 列的任何单元格中都不可用，则可以重复from 中的数字
我想重复此过程，直到 from 和 to 组合中的所有唯一值在任一列中至少出现一次
如果from 列中的数字是新数字，并且其对应的to 值已存在于任一列中，则将to 和距离值替换为空白

所以结果表如下所示：

+------+-----+----------+--------+
| from | to  | Distance | weight |
+------+-----+----------+--------+
|    1 |   8 |        1 |     10 |
|    2 |   6 |        1 |      9 |
|    3 |   4 |        1 |      5 |
|    1 |   7 |        5 |      3 |
|    9 |   5 |        3 |      9 |
|   10 |     |          |      2 |
+------+-----+----------+--------+

【问题讨论】：

你能详细说明你的条件吗？我不清楚你的逻辑是如何在这里工作的。例如，在您所说的第一个条件中，“如果一个数字出现在 to 列中，则它不应在 to 或 from 列中重复。”在to 中有两行包含5。您在最终结果中有第 13 行。为什么不是第四排？我对第二种情况一无所知。您能否以不同的方式解释您的逻辑，以便人们可以看到您正在尝试做什么？
@jazzurro，那是因为第 4 行的“from”列中的数字 4 已经出现在“to”列中，所以为了防止重复，我选择了第 13 行
@jazzurro 至于第二个条件，让我们考虑第 1 行和第 11 行，虽然 from 列已经有一个值为 1 的单元格，但它仍然会再次重复，因为相应的 to 列有值 (7) 这是 to 列的新内容

标签： r while-loop dplyr grouping cluster-analysis

【解决方案1】：

这是根据 OP 的规则重现预期结果的尝试。

我仍在努力寻找使用unique()、duplicated() 处理宽格式数据以及重整为长格式数据的解决方案。

但是，这里有一个使用for 循环的解决方案，它再现了给定样本数据集的预期结果：

library(data.table)
# append row numbers
setDT(DT)[, rn := .I]

# which values appear only once in the `to`` column?
single_to <- DT[, .N, by = to][N == 1L, to]
single_to

[1] 2 1 7

DT[, drop := NA]
for (i in seq_len(nrow(DT))) {
  print(i)
  print(DT[i])
  if (isTRUE(DT$drop[i])) next # row already has been eliminated
  act_to <- DT$to[i]
  # Rule 1: eliminate subsequent rows with repeated value in `to` column  
  DT[rn > i & (to == act_to), drop := TRUE]
  # Rule 1: eliminate subsequent rows with repeated value in `from` column 
  # Rule 2: but keep rows where value is unique in the `to` column  
  DT[rn > i & (from == act_to) & !(to %in% single_to), drop := TRUE]
  DT[i, drop := FALSE]
  print(DT[])
}

[1] 1
   from to distance weight rn drop
1:    1  8        1     10  1   NA
    from to distance weight rn  drop
 1:    1  8        1     10  1 FALSE
 2:    2  6        1      9  2    NA
 3:    3  4        1      5  3    NA
 4:    4  5        3      9  4    NA
 5:    5  6        4      8  5    NA
 6:    6  2        5      2  6    NA
 7:    7  8        2      1  7  TRUE
 8:    4  3        5      6  8    NA
 9:    2  1        1      7  9    NA
10:    6  8        4      8 10  TRUE
11:    1  7        5      3 11    NA
12:    8  4        6      7 12  TRUE
13:    9  5        3      9 13    NA
14:   10  3        8      2 14    NA
[1] 2
   from to distance weight rn drop
1:    2  6        1      9  2   NA
    from to distance weight rn  drop
 1:    1  8        1     10  1 FALSE
 2:    2  6        1      9  2 FALSE
 3:    3  4        1      5  3    NA
 4:    4  5        3      9  4    NA
 5:    5  6        4      8  5  TRUE
 6:    6  2        5      2  6    NA
 7:    7  8        2      1  7  TRUE
 8:    4  3        5      6  8    NA
 9:    2  1        1      7  9    NA
10:    6  8        4      8 10  TRUE
11:    1  7        5      3 11    NA
12:    8  4        6      7 12  TRUE
13:    9  5        3      9 13    NA
14:   10  3        8      2 14    NA
[1] 3
   from to distance weight rn drop
1:    3  4        1      5  3   NA
    from to distance weight rn  drop
 1:    1  8        1     10  1 FALSE
 2:    2  6        1      9  2 FALSE
 3:    3  4        1      5  3 FALSE
 4:    4  5        3      9  4  TRUE
 5:    5  6        4      8  5  TRUE
 6:    6  2        5      2  6    NA
 7:    7  8        2      1  7  TRUE
 8:    4  3        5      6  8  TRUE
 9:    2  1        1      7  9    NA
10:    6  8        4      8 10  TRUE
11:    1  7        5      3 11    NA
12:    8  4        6      7 12  TRUE
13:    9  5        3      9 13    NA
14:   10  3        8      2 14    NA
[1] 4
   from to distance weight rn drop
1:    4  5        3      9  4 TRUE
[1] 5
   from to distance weight rn drop
1:    5  6        4      8  5 TRUE
[1] 6
   from to distance weight rn drop
1:    6  2        5      2  6   NA
    from to distance weight rn  drop
 1:    1  8        1     10  1 FALSE
 2:    2  6        1      9  2 FALSE
 3:    3  4        1      5  3 FALSE
 4:    4  5        3      9  4  TRUE
 5:    5  6        4      8  5  TRUE
 6:    6  2        5      2  6 FALSE
 7:    7  8        2      1  7  TRUE
 8:    4  3        5      6  8  TRUE
 9:    2  1        1      7  9    NA
10:    6  8        4      8 10  TRUE
11:    1  7        5      3 11    NA
12:    8  4        6      7 12  TRUE
13:    9  5        3      9 13    NA
14:   10  3        8      2 14    NA
[1] 7
   from to distance weight rn drop
1:    7  8        2      1  7 TRUE
[1] 8
   from to distance weight rn drop
1:    4  3        5      6  8 TRUE
[1] 9
   from to distance weight rn drop
1:    2  1        1      7  9   NA
    from to distance weight rn  drop
 1:    1  8        1     10  1 FALSE
 2:    2  6        1      9  2 FALSE
 3:    3  4        1      5  3 FALSE
 4:    4  5        3      9  4  TRUE
 5:    5  6        4      8  5  TRUE
 6:    6  2        5      2  6 FALSE
 7:    7  8        2      1  7  TRUE
 8:    4  3        5      6  8  TRUE
 9:    2  1        1      7  9 FALSE
10:    6  8        4      8 10  TRUE
11:    1  7        5      3 11    NA
12:    8  4        6      7 12  TRUE
13:    9  5        3      9 13    NA
14:   10  3        8      2 14    NA
[1] 10
   from to distance weight rn drop
1:    6  8        4      8 10 TRUE
[1] 11
   from to distance weight rn drop
1:    1  7        5      3 11   NA
    from to distance weight rn  drop
 1:    1  8        1     10  1 FALSE
 2:    2  6        1      9  2 FALSE
 3:    3  4        1      5  3 FALSE
 4:    4  5        3      9  4  TRUE
 5:    5  6        4      8  5  TRUE
 6:    6  2        5      2  6 FALSE
 7:    7  8        2      1  7  TRUE
 8:    4  3        5      6  8  TRUE
 9:    2  1        1      7  9 FALSE
10:    6  8        4      8 10  TRUE
11:    1  7        5      3 11 FALSE
12:    8  4        6      7 12  TRUE
13:    9  5        3      9 13    NA
14:   10  3        8      2 14    NA
[1] 12
   from to distance weight rn drop
1:    8  4        6      7 12 TRUE
[1] 13
   from to distance weight rn drop
1:    9  5        3      9 13   NA
    from to distance weight rn  drop
 1:    1  8        1     10  1 FALSE
 2:    2  6        1      9  2 FALSE
 3:    3  4        1      5  3 FALSE
 4:    4  5        3      9  4  TRUE
 5:    5  6        4      8  5  TRUE
 6:    6  2        5      2  6 FALSE
 7:    7  8        2      1  7  TRUE
 8:    4  3        5      6  8  TRUE
 9:    2  1        1      7  9 FALSE
10:    6  8        4      8 10  TRUE
11:    1  7        5      3 11 FALSE
12:    8  4        6      7 12  TRUE
13:    9  5        3      9 13 FALSE
14:   10  3        8      2 14    NA
[1] 14
   from to distance weight rn drop
1:   10  3        8      2 14   NA
    from to distance weight rn  drop
 1:    1  8        1     10  1 FALSE
 2:    2  6        1      9  2 FALSE
 3:    3  4        1      5  3 FALSE
 4:    4  5        3      9  4  TRUE
 5:    5  6        4      8  5  TRUE
 6:    6  2        5      2  6 FALSE
 7:    7  8        2      1  7  TRUE
 8:    4  3        5      6  8  TRUE
 9:    2  1        1      7  9 FALSE
10:    6  8        4      8 10  TRUE
11:    1  7        5      3 11 FALSE
12:    8  4        6      7 12  TRUE
13:    9  5        3      9 13 FALSE
14:   10  3        8      2 14 FALSE

目前的结果与预期的结果不同

result <- DT[!(drop)]
result

   from to distance weight rn  drop
1:    1  8        1     10  1 FALSE
2:    2  6        1      9  2 FALSE
3:    3  4        1      5  3 FALSE
4:    6  2        5      2  6 FALSE
5:    2  1        1      7  9 FALSE
6:    1  7        5      3 11 FALSE
7:    9  5        3      9 13 FALSE
8:   10  3        8      2 14 FALSE

第 1 到 3、11、13 和 14 行与预期结果一致，但此处保留第 6 行和第 9 行，因为值 2 和 1 在 to 列中是唯一的。

显然，这种方法需要改进，因为2 和1 已经分别出现在第 1 行和第 2 行的 from 列中。这些行需要作为重复项删除。

为了删除这些，result 从宽格式改成长格式并按行号排序：

ldt <- melt(result, "rn", c("to", "from"))[order(rn)]
ldt

    rn variable value
 1:  1       to     8
 2:  1     from     1
 3:  2       to     6
 4:  2     from     2
 5:  3       to     4
 6:  3     from     3
 7:  6       to     2
 8:  6     from     6
 9:  9       to     1
10:  9     from     2
11: 11       to     7
12: 11     from     1
13: 13       to     5
14: 13     from     9
15: 14       to     3
16: 14     from    10

现在，我们需要识别属于single_to 值的重复项的行号：

ldt[duplicated(value) & variable == "to" & value %in% single_to]

   rn variable value
1:  6       to     2
2:  9       to     1

这些行被 anti-join 从result 中删除：

result2 <-
  result[!ldt[duplicated(value) & variable == "to" & value %in% single_to], on = .(rn)]
result2

   from to distance weight rn  drop
1:    1  8        1     10  1 FALSE
2:    2  6        1      9  2 FALSE
3:    3  4        1      5  3 FALSE
4:    1  7        5      3 11 FALSE
5:    9  5        3      9 13 FALSE
6:   10  3        8      2 14 FALSE

现在这几乎符合预期结果。只需执行第 4 条规则。为此，使用与以前相同的方法：重塑为长格式，识别行号并连接。但是，这里使用了 update join：

ldt2 <- melt(unique(result2, by = "from"), "rn", c("to", "from"))[order(rn)]
result2[ldt2[duplicated(value)], on = .(rn), c("to", "distance") := NA_integer_]
result2

   from to distance weight rn  drop
1:    1  8        1     10  1 FALSE
2:    2  6        1      9  2 FALSE
3:    3  4        1      5  3 FALSE
4:    1  7        5      3 11 FALSE
5:    9  5        3      9 13 FALSE
6:   10 NA       NA      2 14 FALSE

讨论

此解决方案并未声称在编码或执行速度方面高效。它只是旨在从给定的样本数据集中重现预期的结果。

它需要更多的测试。例如，OP 在规则 3 中要求

我想重复这个过程，直到所有唯一值都来自 from 和 to 组合在任一列中至少出现一次

通过执行规则 1 和 2，最终不会检查是否满足此条件。

另外，我相信可能还有其他方法可以实现相同的目标。

【讨论】：