【问题标题】:How to Remove Duplicates based on Timestamp? [closed]如何根据时间戳删除重复项? [关闭]
【发布时间】:2019-10-09 21:05:49
【问题描述】:

我正在尝试根据时间戳删除重复项。因此,首先进入的每个 ID 号码将保留,而旧的 ID 号码将被删除。

无法弄清楚如何解决这个问题。

structure(list(sample = c(101496859, 101496859, 101496189, 101496189, 
101495613, 101495613, 101486260, 101486260, 101463063, 101463063, 
101461751, 101461751, 101458494, 101458494, 101450202, 101450202, 
101446157, 101446157, 101446089, 101446089), time = c("10/4/2019 6:05:28 PM", 
"10/4/2019 4:57:02 PM", "10/4/2019 7:51:52 PM", "10/4/2019 4:24:14 PM", 
"10/4/2019 7:01:44 PM", "10/4/2019 3:53:41 PM", "10/4/2019 1:24:32 PM", 
"10/4/2019 3:04:04 PM", "10/4/2019 11:07:29 AM", "10/4/2019 10:18:38 AM", 
"10/4/2019 2:05:08 PM", "10/4/2019 12:06:21 PM", "10/4/2019 12:50:33 PM", 
"10/4/2019 9:41:40 AM", "10/4/2019 10:29:09 AM", "10/4/2019 11:48:47 AM", 
"10/4/2019 7:55:10 AM", "10/4/2019 12:19:13 PM", "10/4/2019 11:30:35 AM", 
"10/4/2019 8:54:41 AM")), row.names = c(NA, -20L), class = "data.frame")

【问题讨论】:

  • 请向我们展示您的尝试和遇到的问题。另外,ID number 不是您数据中的列,它是什么?

标签: r dataframe dplyr tidyr


【解决方案1】:

使用dplyr

library(dplyr)

df1 %>% 
  mutate(time = as.POSIXct(time, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())) %>% 
  group_by(sample) %>% 
  arrange(time) %>% 
  filter(time == first(time)) %>%
  mutate(time = format(strptime(time, "%Y-%m-%d %H:%M:%S"), "%m/%d/%Y %I:%M:%S %p"))

#> # A tibble: 10 x 2
#>       sample time                  
#>        <dbl> <chr>                 
#>  1 101446089 10/04/2019 08:54:41 AM
#>  2 101446157 10/04/2019 07:55:10 AM
#>  3 101450202 10/04/2019 10:29:09 AM
#>  4 101458494 10/04/2019 09:41:40 AM
#>  5 101461751 10/04/2019 12:06:21 PM
#>  6 101463063 10/04/2019 10:18:38 AM
#>  7 101486260 10/04/2019 01:24:32 PM
#>  8 101495613 10/04/2019 03:53:41 PM
#>  9 101496189 10/04/2019 04:24:14 PM
#> 10 101496859 10/04/2019 04:57:02 PM

【讨论】:

  • 所以当我使用完整的数据表时,我发现它丢弃了所有其他变量。我将如何运行它,同时保留所有其他变量?
  • @DannyRamirez 看到我的更新。我用filter代替了summarise
【解决方案2】:

这是转换为DateTime 类后的一个选项。按'sample'分组后,得到sliceminimum 'Datetime'类转换'time'列的索引以返回该行

library(dplyr)
library(lubridate)
df1 %>% 
    group_by(sample) %>% 
    slice(which.min(mdy_hms(time)))
# A tibble: 10 x 2
# Groups:   sample [10]
#      sample time                 
#       <dbl> <chr>                
# 1 101446089 10/4/2019 8:54:41 AM 
# 2 101446157 10/4/2019 7:55:10 AM 
# 3 101450202 10/4/2019 10:29:09 AM
# 4 101458494 10/4/2019 9:41:40 AM 
# 5 101461751 10/4/2019 12:06:21 PM
# 6 101463063 10/4/2019 10:18:38 AM
# 7 101486260 10/4/2019 1:24:32 PM 
# 8 101495613 10/4/2019 3:53:41 PM 
# 9 101496189 10/4/2019 4:24:14 PM 
#10 101496859 10/4/2019 4:57:02 PM 

【讨论】:

    【解决方案3】:

    在base R中,我们可以order基于time的数据,并为每个sample选择第一行

    aggregate(time~sample, df[do.call(order, transform(df, 
           time = as.POSIXct(time, format = "%m/%d/%Y %I:%M:%S %p"))),], head, 1)
    
    #      sample                  time
    #1  101446089  10/4/2019 8:54:41 AM
    #2  101446157  10/4/2019 7:55:10 AM
    #3  101450202 10/4/2019 10:29:09 AM
    #4  101458494  10/4/2019 9:41:40 AM
    #5  101461751 10/4/2019 12:06:21 PM
    #6  101463063 10/4/2019 10:18:38 AM
    #7  101486260  10/4/2019 1:24:32 PM
    #8  101495613  10/4/2019 3:53:41 PM
    #9  101496189  10/4/2019 4:24:14 PM
    #10 101496859  10/4/2019 4:57:02 PM
    

    要保留所有其他列,我们可以使用ave

    df[as.logical(with(df[do.call(order, transform(df, 
          time = as.POSIXct(time, format = "%m/%d/%Y %I:%M:%S %p"))),][1:2], 
         ave(time, sample, FUN = function(x) seq_along(x) == 1))), ]
    

    【讨论】:

      猜你喜欢
      • 2020-11-03
      • 1970-01-01
      • 2022-10-14
      • 1970-01-01
      • 1970-01-01
      • 2020-12-22
      • 2021-11-15
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多