根据两列中的一组条件删除重复项（时间数据）答案

【问题标题】：Remove duplicates based on set of conditions in two columns (time data)根据两列中的一组条件删除重复项（时间数据）
【发布时间】：2018-05-23 16:44:12
【问题描述】：

以下是包含考勤时间表的数据集示例。我想保留最早的punch_in 和最后的punch_out 的记录（即id-1、name-sam、punch_in -8/6/2015 8:00:00 和punch_out-8/6/2015 16:05:00） .如何删除 R 中的其他重复条目？

id<-c(1,1,1,1,2,3,4)
name<-c("sam","sam","sam","sam","jack","john","jude")
sex<-c("M","M","M","M","M","M","F")
punch_in<-c("8/6/2015 8:00:00","8/6/2015 8:05:00","8/6/2015 8:00:00","8/6/2015 8:05:00","8/6/2015 8:06:00","8/6/2015 7:59:00","8/6/2015 8:00:00")
punch_out<-c("8/6/2015 16:00:00","8/6/2015 16:00:00","8/6/2015 16:05:00","8/6/2015 16:05:00","8/6/2015 16:00:00","8/6/2015 16:05:00","8/6/2015 16:05:00")
data<-as.data.frame(cbind(id,name,sex,punch_in,punch_out))

【问题讨论】：

附注你可以只做data.frame(id,name,sex,punch_in,punch_out) 而不是as.data.frame(cbind(id,name,sex,punch_in,punch_out))

标签： r datetime duplicates

【解决方案1】：

id<-c(1,1,1,1,2,3,4)
name<-c("sam","sam","sam","sam","jack","john","jude")
sex<-c("M","M","M","M","M","M","F")
punch_in<-c("8/6/2015 8:00:00","8/6/2015 8:05:00","8/6/2015 8:00:00","8/6/2015 8:05:00","8/6/2015 8:06:00","8/6/2015 7:59:00","8/6/2015 8:00:00")
punch_out<-c("8/6/2015 16:00:00","8/6/2015 16:00:00","8/6/2015 16:05:00","8/6/2015 16:05:00","8/6/2015 16:00:00","8/6/2015 16:05:00","8/6/2015 16:05:00")
data<-as.data.frame(cbind(id,name,sex,punch_in,punch_out))

library(dplyr)

data %>%
  group_by(id, name, sex) %>%                 # for each combination of id, name, sex
  summarise(punch_in = first(punch_in),       # keep the first punch in
            punch_out = last(punch_out)) %>%  # keep the last punch out
  ungroup()                                   # forget the grouping

# # A tibble: 4 x 5
#   id    name  sex   punch_in         punch_out        
#   <fct> <fct> <fct> <fct>            <fct>            
# 1 1     sam   M     8/6/2015 8:00:00 8/6/2015 16:05:00
# 2 2     jack  M     8/6/2015 8:06:00 8/6/2015 16:00:00
# 3 3     john  M     8/6/2015 7:59:00 8/6/2015 16:05:00
# 4 4     jude  F     8/6/2015 8:00:00 8/6/2015 16:05:00

这假设行是按日期排序的，因此对于每个 id，第一个是最早的，最后一个是最近的。

【讨论】：

如果我将三个变量 id、name 和 sex 分组在一起，但我想保留数据集中的其他变量但不将它们分组，则可以。原始数据集还有其他变量也需要保留。
这取决于其他变量有多少唯一值。您必须更新您的示例并包含其他变量。