在 R data.table 中保留多列的第一行答案

【问题标题】：Keep first row by multiple columns in an R data.table在 R data.table 中保留多列的第一行
【发布时间】：2014-09-14 04:19:27
【问题描述】：

我想只从 data.table 中获取第一行，按多列分组。

这很简单，只有一列，例如：

(dt <- data.table(x = c(1, 1, 1, 2),
                  y = c(1, 1, 2, 2),
                  z = c(1, 2, 1, 2)))
#     x y z
# |1: 1 1 1
# |2: 1 1 2
# |3: 1 2 1
# |4: 2 2 2
dt[!duplicated(x)] # Remove rows 2-3
#     x y z
# |1: 1 1 1
# |2: 2 2 2

但是当尝试基于两列进行删除时，这些方法都不起作用；即在这种情况下仅删除第 2 行：

dt[!duplicated(x, y)] # Keeps only original data set
#     x y z
# |1: 1 1 1
# |2: 1 1 2
# |3: 1 2 1
# |4: 2 2 2
dt[!duplicated(list(x, y))] # Same as above
dt[!duplicated(c("x", "y"))] # Same as above
dt[!duplicated(list("x", "y"))] # Same as above
dt[!duplicated(c(x, y))] # Only removes duplicates from first column
#     x y z
# |1: 1 1 1
# |2: 2 2 2

除了这个，它只在某些情况下有效：

dt[!duplicated(paste0(x, y))]
#     x y z
# |1: 1 1 1
# |2: 1 2 1
# |3: 2 2 2

【问题讨论】：

标签： r duplicates data.table

【解决方案1】：

data.table 为unique、duplicated 和anyDuplicated 提供S3 方法

unique(dt, by = c('x','y'))

会给你你想要的。

【讨论】：

【解决方案2】：

data.table 通过键来执行duplicated。来自?duplicated.data.table：

 ‘duplicated’ returns a logical vector indicating which rows of a
 ‘data.table’ have duplicate rows (by key).

setkey(dt, x, y)
dt[!duplicated(dt)]
##    x y z
## 1: 1 1 1
## 2: 1 2 1
## 3: 2 2 2

【讨论】：

by key 默认可以指定by变量
@mnel 是的，我赞成您的回答。只是认为这可能会解释为什么这种行为是有意义的，尽管它可能看起来很奇怪
dt[!duplicated(dt[,c("x","y"),with=F])] #似乎有效