将多年期分成几行一年答案

【问题标题】：Splitting multi-year period into several seperate lines of one year将多年期分成几行一年
【发布时间】：2021-10-08 21:13:27
【问题描述】：

在我当前的数据集（40.000 人）中，我有关于人均医疗保健消费的多年数据。使用某个包的开始日期和结束日期。例如：有人在 2015 年 3 月 20 日到 2018 年 2 月 5 日期间使用包 A。

由于我的分析（在 R 中）需要年度数据，因此我必须将 A 包消费的这些时间段切成一年。我在之前的post找到了这几行代码：

library(lubridate)
library(purrr)

test %>% 
    ungroup() %>% # This isn't necessary if there are no groupings.
    split(rownames(test)) %>% 
    map_dfr(function(df){
        if (year(df$from_date) == year(df$to_date)) return(df)
        bind_rows(mutate(df, to_date = rollback(floor_date(to_date, "y"))),
                  mutate(df, from_date = floor_date(to_date, "y"))
                  )
    }
    )

但是，这似乎只适用于连续两年的时间段（在他的示例中为 2008-2009 年）。在我的数据集中，我有很多案例有人使用某个包 3-4 年（比如 2015-2018）

谁能帮我编写一个代码（或我已经尝试过的代码的重写版本）将这些数据行分成 2-3 行带有年度数据的单独行？最后，它应该看起来像这样（对于上述 2015-03-20 到 2018-02-05 的期间）：

Person_ID	Start date	End date	package
001	2015-03-20	2015-12-31	A
001	2016-01-01	2016-12-31	A
001	2017-01-01	2017-12-31	A
001	2018-01-01	2018-02-05	A

【问题讨论】：

标签： r date

【解决方案1】：

以下内容可能会助您一臂之力。

由于您可能会根据给定的Person_ID 扩展数据集，因此您正在逐行迭代您的数据框/小标题。出于演示目的，我分小步进行并重写函数。主要思想是创建一个每年一行的虚拟数据框/小标题，并填写正确的开始和结束日期。
它还可以帮助您在“管道”之外定义函数。
如果需要，这应该可以帮助您修改代码并使其适应您的问题。

您没有提供可重现的示例，因此，我生成了一个简单的 3 人用例。

library(dplyr)
library(lubridate)
library(purrr)

<- tibble(
   Person_ID = c("001","002","003")
 , Start_date = ymd(c("2015-03-20", "2016-01-12","2015-05-05"))
 , End_date   = ymd(c("2018-02-05", "2017-05-12","2019-04-17"))
 , Package = c("A","B","A")
)

第一部分列出了各个用例：

df %>% split(rownames(df))
$`1`
# A tibble: 1 x 4
  Person_ID Start_date End_date   Package
  <chr>     <date>     <date>     <chr>  
1 001       2015-03-20 2018-02-05 A      

$`2`
# A tibble: 1 x 4
  Person_ID Start_date End_date   Package
  <chr>     <date>     <date>     <chr>  
1 002       2016-01-12 2017-05-12 B      

$`3`
# A tibble: 1 x 4
  Person_ID Start_date End_date   Package
  <chr>     <date>     <date>     <chr>  
1 003       2015-05-05 2019-04-17 A

有了这个，我们现在可以构造一个函数来处理这些“案例”中的每一个。下面的函数向您展示了如何编写这样的扩展。有更优雅的方法可以做到这一点，但是如果您需要使函数适应您的测试数据，这个版本应该可以帮助您。

expand_over_multiple_years <- function(.df){
    # ---- check if we have a same year case and do nothing (aka return .df)
    if(year(.df$Start_date) == year(.df$End_date)) return(.df)
    
    # ----   create dummy tibble over all years
    ## ---   for this we create a tibble with rows per each year, i.e. seq_years
    ## ---   we set the dates to 1. Jan through 31. Dec
    my_df <- tibble(
         seq_years  = year(.df$Start_date):year(.df$End_date)  # sequence of years
        ,Start_date = paste(seq_years, "-01-01") %>% ymd() 
        ,End_date   = paste(seq_years, "-12-31") %>% ymd()
           ) %>%
    # ----   we add the additional columns to our dummy table to ensure we return
    ## ---   what is needed and delete the "helper" seq_year column
        mutate( Person_ID = .df$Person_ID
               ,Package   = .df$Package) %>%
        select(-seq_years)            # minus := "unselect" = delete column
    
    # ---- correct for Start- and End-date by overwriting the first and last date
    my_df$Start_date[1]         <- .df$Start_date
    my_df$End_date[nrow(my_df)] <- .df$End_date

    return(my_df %>% select(Person_ID, everything()))    # with the return we reshuffle the columns
}

让我们测试一个案例的功能：

df [1, ] %>% expand_over_multiple_years()
# A tibble: 4 x 4
  Person_ID Start_date End_date   Package
  <chr>     <date>     <date>     <chr>  
1 001       2015-03-20 2015-12-31 A      
2 001       2016-01-01 2016-12-31 A      
3 001       2017-01-01 2017-12-31 A      
4 001       2018-01-01 2018-02-05 A

现在将它们全部封装在一个迭代调用中：

df %>% split(rownames(df)) %>% purrr::map(expand_over_multiple_years)

$`1`
# A tibble: 4 x 4
  Person_ID Start_date End_date   Package
  <chr>     <date>     <date>     <chr>  
1 001       2015-03-20 2015-12-31 A      
2 001       2016-01-01 2016-12-31 A      
3 001       2017-01-01 2017-12-31 A      
4 001       2018-01-01 2018-02-05 A      

$`2`
# A tibble: 2 x 4
  Person_ID Start_date End_date   Package
  <chr>     <date>     <date>     <chr>  
1 002       2016-01-12 2016-12-31 B      
2 002       2017-01-01 2017-05-12 B      

$`3`
# A tibble: 5 x 4
  Person_ID Start_date End_date   Package
  <chr>     <date>     <date>     <chr>  
1 003       2015-05-05 2015-12-31 A      
2 003       2016-01-01 2016-12-31 A      
3 003       2017-01-01 2017-12-31 A      
4 003       2018-01-01 2018-12-31 A      
5 003       2019-01-01 2019-04-17 A

如果你想要/需要数据框/tibble 输出

> df %>% split(rownames(df)) %>% purrr::map_dfr(expand_over_multiple_years)
# A tibble: 11 x 4
   Person_ID Start_date End_date   Package
   <chr>     <date>     <date>     <chr>  
 1 001       2015-03-20 2015-12-31 A      
 2 001       2016-01-01 2016-12-31 A      
 3 001       2017-01-01 2017-12-31 A      
 4 001       2018-01-01 2018-02-05 A      
 5 002       2016-01-12 2016-12-31 B      
 6 002       2017-01-01 2017-05-12 B      
 7 003       2015-05-05 2015-12-31 A      
 8 003       2016-01-01 2016-12-31 A      
 9 003       2017-01-01 2017-12-31 A      
10 003       2018-01-01 2018-12-31 A      
11 003       2019-01-01 2019-04-17 A

【讨论】：

这很好用；非常感谢！