【问题标题】:Remove duplicates based on set of conditions in two columns (time data)根据两列中的一组条件删除重复项(时间数据)
【发布时间】:2018-05-23 16:44:12
【问题描述】:

以下是包含考勤时间表的数据集示例。我想保留最早的punch_in 和最后的punch_out 的记录(即id-1、name-sam、punch_in -8/6/2015 8:00:00 和punch_out-8/6/2015 16:05:00) .如何删除 R 中的其他重复条目?

id<-c(1,1,1,1,2,3,4)
name<-c("sam","sam","sam","sam","jack","john","jude")
sex<-c("M","M","M","M","M","M","F")
punch_in<-c("8/6/2015 8:00:00","8/6/2015 8:05:00","8/6/2015 8:00:00","8/6/2015 8:05:00","8/6/2015 8:06:00","8/6/2015 7:59:00","8/6/2015 8:00:00")
punch_out<-c("8/6/2015 16:00:00","8/6/2015 16:00:00","8/6/2015 16:05:00","8/6/2015 16:05:00","8/6/2015 16:00:00","8/6/2015 16:05:00","8/6/2015 16:05:00")
data<-as.data.frame(cbind(id,name,sex,punch_in,punch_out))

【问题讨论】:

  • 附注你可以只做data.frame(id,name,sex,punch_in,punch_out) 而不是as.data.frame(cbind(id,name,sex,punch_in,punch_out))

标签: r datetime duplicates


【解决方案1】:
id<-c(1,1,1,1,2,3,4)
name<-c("sam","sam","sam","sam","jack","john","jude")
sex<-c("M","M","M","M","M","M","F")
punch_in<-c("8/6/2015 8:00:00","8/6/2015 8:05:00","8/6/2015 8:00:00","8/6/2015 8:05:00","8/6/2015 8:06:00","8/6/2015 7:59:00","8/6/2015 8:00:00")
punch_out<-c("8/6/2015 16:00:00","8/6/2015 16:00:00","8/6/2015 16:05:00","8/6/2015 16:05:00","8/6/2015 16:00:00","8/6/2015 16:05:00","8/6/2015 16:05:00")
data<-as.data.frame(cbind(id,name,sex,punch_in,punch_out))

library(dplyr)

data %>%
  group_by(id, name, sex) %>%                 # for each combination of id, name, sex
  summarise(punch_in = first(punch_in),       # keep the first punch in
            punch_out = last(punch_out)) %>%  # keep the last punch out
  ungroup()                                   # forget the grouping

# # A tibble: 4 x 5
#   id    name  sex   punch_in         punch_out        
#   <fct> <fct> <fct> <fct>            <fct>            
# 1 1     sam   M     8/6/2015 8:00:00 8/6/2015 16:05:00
# 2 2     jack  M     8/6/2015 8:06:00 8/6/2015 16:00:00
# 3 3     john  M     8/6/2015 7:59:00 8/6/2015 16:05:00
# 4 4     jude  F     8/6/2015 8:00:00 8/6/2015 16:05:00

这假设行是按日期排序的,因此对于每个 id,第一个是最早的,最后一个是最近的。

【讨论】:

  • 如果我将三个变量 id、name 和 sex 分组在一起,但我想保留数据集中的其他变量但不将它们分组,则可以。原始数据集还有其他变量也需要保留。
  • 这取决于其他变量有多少唯一值。您必须更新您的示例并包含其他变量。
猜你喜欢
  • 2021-03-20
  • 2017-03-02
  • 2018-12-13
  • 2021-02-09
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-04-08
相关资源
最近更新 更多