根据最新的编辑,OP 希望按日历年细分每个项目的持续时间。这可以通过使用data.table 包的foverlaps() 函数来完成。
读取数据
library(data.table)
projects <- fread(
"start_date end_date entityNo amount
4/1/2001 8/31/2012 1 500
1/1/2005 12/31/2007 2 100")
fread() 可用于从磁盘快速读取csv 文件。这里使用了一个方便的特性,允许从字符变量中读取数据。
准备数据
library(lubridate)
# convert dates from character to Date class
date_cols <- c("start_date", "end_date")
projects[, (date_cols) := lapply(.SD, mdy), .SDcols = date_cols]
# compute duration of project = number of years in which project was active
projects[, years_active := year(end_date) - year(start_date) + 1]
请注意,years_active 与 OP 给出的 no of years 不同。 years_active 是传播数据所需的行数。
为计算重叠创建日期范围
date_range <- projects[, .(year = seq(year(min(start_date)),
year(max(end_date))))]
date_range[, start_in_year := ymd(paste0(year, "-01-01"))]
date_range[, end_in_year := ymd(paste0(year, "-12-31"))]
setkey(date_range, start_in_year, end_in_year)
date_range
# year start_in_year end_in_year
# 1: 2001 2001-01-01 2001-12-31
# 2: 2002 2002-01-01 2002-12-31
# 3: 2003 2003-01-01 2003-12-31
# ...
#10: 2010 2010-01-01 2010-12-31
#11: 2011 2011-01-01 2011-12-31
#12: 2012 2012-01-01 2012-12-31
请注意,可以扩展此方法以创建按季度、月、ISO 周或天划分的持续时间。
计算重叠区间
projects_by_year <- foverlaps(projects, date_range, by.x = date_cols)
# adjust start_in_year to coincide with project start date
projects_by_year[, start_in_year := pmax(start_in_year, start_date)]
# adjust end_in_year to coincide with project end date
projects_by_year[, end_in_year := pmin(end_in_year, end_date)]
projects_by_year
# year start_in_year end_in_year start_date end_date entityNo amount years_active
# 1: 2001 2001-04-01 2001-12-31 2001-04-01 2012-08-31 1 500 12
# 2: 2002 2002-01-01 2002-12-31 2001-04-01 2012-08-31 1 500 12
# 3: 2003 2003-01-01 2003-12-31 2001-04-01 2012-08-31 1 500 12
# ...
#10: 2010 2010-01-01 2010-12-31 2001-04-01 2012-08-31 1 500 12
#11: 2011 2011-01-01 2011-12-31 2001-04-01 2012-08-31 1 500 12
#12: 2012 2012-01-01 2012-08-31 2001-04-01 2012-08-31 1 500 12
#13: 2005 2005-01-01 2005-12-31 2005-01-01 2007-12-31 2 100 3
#14: 2006 2006-01-01 2006-12-31 2005-01-01 2007-12-31 2 100 3
#15: 2007 2007-01-01 2007-12-31 2005-01-01 2007-12-31 2 100 3
项目 1 分 12 年/行,项目 2 分 3 年。调整start_in_year 和end_in_year 以匹配每个项目的相应开始和结束年中的正确开始和结束日期。
希望这是预期的结果。
按年份计算聚合
长格式非常适合计算每年的聚合。例如,每年的项目数:
projects_by_year[, .N, by = year]
# year N
# 1: 2001 1
# 2: 2002 1
# 3: 2003 1
# 4: 2004 1
# 5: 2005 2
# 6: 2006 2
# 7: 2007 2
# 8: 2008 1
# 9: 2009 1
#10: 2010 1
#11: 2011 1
#12: 2012 1
或每年的总金额:
projects_by_year[, sum(amount), by = year]
# year V1
# 1: 2001 500
# 2: 2002 500
# 3: 2003 500
# 4: 2004 500
# 5: 2005 600
# 6: 2006 600
# 7: 2007 600
# 8: 2008 500
# 9: 2009 500
#10: 2010 500
#11: 2011 500
#12: 2012 500