【问题标题】:R: (un)reduce dataframeR:(取消)减少数据帧
【发布时间】:2021-07-06 07:23:59
【问题描述】:

我有以下假数据集。在每天 (dates) 的一段时间内,所有元素 (id) 的状态 (status) 都会被记录。

df <- data.frame( id = c(1, 1, 1, 1, 1,  2, 2, 2, 2, 2,  3, 3, 3, 3, 3,  4, 4, 4, 4, 4),
                  dates = c("2021-01-01",
                           "2021-01-02",
                           "2021-01-03",
                           "2021-01-04",
                           "2021-01-05",
                           
                           "2021-01-01",
                           "2021-01-02",
                           "2021-01-03",
                           "2021-01-04",
                           "2021-01-05",
                           
                           "2021-01-01",
                           "2021-01-02",
                           "2021-01-03",
                           "2021-01-04",
                           "2021-01-05",
                           
                           "2021-01-01",
                           "2021-01-02",
                           "2021-01-03",
                           "2021-01-04",
                           "2021-01-05"),
                 
                 status = c("A", "A", "A", "B", "C",
                            "A", "A", "B", "C", "C",
                            "A", "B", "C", "D", "E",
                            "A", "B", "B", "B", "B")
                 ) 

> df
   id      dates status
1   1 2021-01-01      A
2   1 2021-01-02      A
3   1 2021-01-03      A
4   1 2021-01-04      B
5   1 2021-01-05      C
6   2 2021-01-01      A
7   2 2021-01-02      A
8   2 2021-01-03      B
9   2 2021-01-04      C
10  2 2021-01-05      C
11  3 2021-01-01      A
12  3 2021-01-02      B
13  3 2021-01-03      C
14  3 2021-01-04      D
15  3 2021-01-05      E
16  4 2021-01-01      A
17  4 2021-01-02      B
18  4 2021-01-03      B
19  4 2021-01-04      B
20  4 2021-01-05      B

不幸的是,为了节省空间,减少了数据框:如果在随后的两天内状态相同,则删除了第二个条目。假设状态保持不变,直到它再次改变,所以实际的数据集如下所示:

> df %>% group_by(id) %>%
+   mutate(dupl = duplicated(status, 2)) %>%
+   ungroup() %>%
+   filter(dupl == FALSE) %>%
+   select(-dupl)
# A tibble: 13 x 3
      id dates      status
   <dbl> <chr>      <chr> 
 1     1 2021-01-01 A     
 2     1 2021-01-04 B     
 3     1 2021-01-05 C     
 4     2 2021-01-01 A     
 5     2 2021-01-03 B     
 6     2 2021-01-04 C     
 7     3 2021-01-01 A     
 8     3 2021-01-02 B     
 9     3 2021-01-03 C     
10     3 2021-01-04 D     
11     3 2021-01-05 E     
12     4 2021-01-01 A     
13     4 2021-01-02 B 

我现在的问题是:我怎样才能再次回到数据集的第一个(完整)版本?所有ids(2021-01-01 到 2021-01-05)的时间段始终相同

【问题讨论】:

    标签: r dplyr tidyverse tidyr


    【解决方案1】:
    library(tidyverse)
    
    # the reduced version can be created like this instead
    df_reduced <- df %>% 
      mutate(dates = lubridate::ymd(dates)) %>% 
      distinct(id, status, .keep_all = TRUE)
    

    对于这样的问题,我会查看tidyr 中与missing values 相关的函数。我们可以用expand生成完整的id/dates组合序列,然后用fill(status, .direction = "down")填充NA的值。

    df_reduced %>% 
      expand(id, dates = full_seq(dates, 1)) %>% 
      left_join(df_reduced) %>% 
      group_by(id) %>% 
      fill(status, .direction = "down")
    
    #> Joining, by = c("id", "dates")
    #> # A tibble: 20 x 3
    #> # Groups:   id [4]
    #>       id dates      status
    #>    <dbl> <chr>      <chr> 
    #>  1     1 2021-01-01 A     
    #>  2     1 2021-01-02 A     
    #>  3     1 2021-01-03 A     
    #>  4     1 2021-01-04 B     
    #>  5     1 2021-01-05 C     
    #>  6     2 2021-01-01 A     
    #>  7     2 2021-01-02 A     
    #>  8     2 2021-01-03 B     
    #>  9     2 2021-01-04 C     
    #> 10     2 2021-01-05 C     
    #> 11     3 2021-01-01 A     
    #> 12     3 2021-01-02 B     
    #> 13     3 2021-01-03 C     
    #> 14     3 2021-01-04 D     
    #> 15     3 2021-01-05 E     
    #> 16     4 2021-01-01 A     
    #> 17     4 2021-01-02 B     
    #> 18     4 2021-01-03 B     
    #> 19     4 2021-01-04 B     
    #> 20     4 2021-01-05 B
    

    reprex package (v1.0.0) 于 2021-07-06 创建

    【讨论】:

      【解决方案2】:

      了解您正在寻找 tidyverse 解决方案。只需添加一个 data.table 方法以供参考:

      dts <- seq(as.IDate("2021-01-01"), as.IDate("2021-01-05"), by="1 day")
      DT[CJ(id=id, dates=dts, unique=TRUE), on=.NATURAL, roll=TRUE]
      

      数据:

      library(data.table)
      DT <- fread("
      id dates      status
      1 2021-01-01 A     
      1 2021-01-04 B     
      1 2021-01-05 C     
      2 2021-01-01 A     
      2 2021-01-03 B     
      2 2021-01-04 C     
      3 2021-01-01 A     
      3 2021-01-02 B     
      3 2021-01-03 C     
      3 2021-01-04 D     
      3 2021-01-05 E     
      4 2021-01-01 A     
      4 2021-01-02 B")
      

      【讨论】:

        【解决方案3】:

        这是另一种方法

        seq_upto <- function(a, b) {
          head(seq.Date(a, b, by="1 day"), -1)
        }
        
        df2_collapsed %>% 
          group_by(id) %>% 
          mutate(start = lubridate::ymd(dates)) %>% 
          mutate(end = lead(start, default=as.Date("2021-01-05") + lubridate::days(1))) %>% 
          rowwise() %>% 
          mutate(dates = list(seq_upto(start, end))) %>% 
          ungroup %>% 
          select(-start, -end) %>% 
          tidyr::unnest(dates) 
        

        基本上,我们为每个 ID 及其状态创建日期范围。然后我们根据这些列表创建序列并取消列出序列以扩展行。

        【讨论】:

          【解决方案4】:

          这可以通过两步轻松完成

          df %>% 
            complete(nesting(id), dates = seq.Date(min(.$dates), max(.$dates), 1)) %>%
            fill(status)
          
          # A tibble: 20 x 3
                id dates      status
             <dbl> <date>     <chr> 
           1     1 2021-01-01 A     
           2     1 2021-01-02 A     
           3     1 2021-01-03 A     
           4     1 2021-01-04 B     
           5     1 2021-01-05 C     
           6     2 2021-01-01 A     
           7     2 2021-01-02 A     
           8     2 2021-01-03 B     
           9     2 2021-01-04 C     
          10     2 2021-01-05 C     
          11     3 2021-01-01 A     
          12     3 2021-01-02 B     
          13     3 2021-01-03 C     
          14     3 2021-01-04 D     
          15     3 2021-01-05 E     
          16     4 2021-01-01 A     
          17     4 2021-01-02 B     
          18     4 2021-01-03 B     
          19     4 2021-01-04 B     
          20     4 2021-01-05 B    
          

          【讨论】:

            【解决方案5】:
            df_reduce %>% 
              mutate(dates = ymd(dates)) %>% 
              complete(dates = seq(from = as.Date("2021-01-01"), by = "day", length.out = 5), nesting(id)) %>% 
              arrange(id) %>% 
              group_by(id) %>% 
              fill(status, .direction = "downup") %>% 
              ungroup()
            
            # A tibble: 20 x 3
               dates         id status
               <date>     <dbl> <chr> 
             1 2021-01-01     1 A     
             2 2021-01-02     1 A     
             3 2021-01-03     1 A     
             4 2021-01-04     1 B     
             5 2021-01-05     1 C     
             6 2021-01-01     2 A     
             7 2021-01-02     2 A     
             8 2021-01-03     2 B     
             9 2021-01-04     2 C     
            10 2021-01-05     2 C     
            11 2021-01-01     3 A     
            12 2021-01-02     3 B     
            13 2021-01-03     3 C     
            14 2021-01-04     3 D     
            15 2021-01-05     3 E     
            16 2021-01-01     4 A     
            17 2021-01-02     4 B     
            18 2021-01-03     4 B     
            19 2021-01-04     4 B     
            20 2021-01-05     4 B 
            

            【讨论】:

              猜你喜欢
              • 1970-01-01
              • 2021-06-25
              • 2021-03-13
              • 1970-01-01
              • 2016-08-27
              • 2020-07-10
              • 2019-08-13
              • 2022-02-02
              • 1970-01-01
              相关资源
              最近更新 更多