【问题标题】:Join two datasets and fill information for time intervals in r连接两个数据集并填充 r 中时间间隔的信息
【发布时间】:2020-11-13 04:37:21
【问题描述】:

我有两个如下所示的数据集:

country <- c("Albania","Albania","Albania","Albania","Albania",
             "Belgium","Belgium","Belgium","Belgium","Belgium",
             "Canada","Canada","Canada","Canada","Canada",
             "Denmark","Denmark","Denmark","Denmark","Denmark")
year <- c(1992, 1993, 1994, 1995, 1996, 1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996)
country.year <- data.frame(country, year)

    country.year

   country year
1  Albania 1992
2  Albania 1993
3  Albania 1994
4  Albania 1995
5  Albania 1996
6  Belgium 1992
7  Belgium 1993
8  Belgium 1994
9  Belgium 1995
10 Belgium 1996
11  Canada 1992
12  Canada 1993
13  Canada 1994
14  Canada 1995
15  Canada 1996
16 Denmark 1992
17 Denmark 1993
18 Denmark 1994
19 Denmark 1995
20 Denmark 1996
country <- c("Albania","Albania",
             "Belgium","Belgium",
             "Canada","Canada",
             "Denmark","Denmark","Denmark")
cabinet <- c(1200, 1201,
             1560, 1566,
             220, 440,
             880, 819, 870)
cabinet.position2 <- c(12,10,
                       0, 5,
                       -9, 2,
                       1,-15)
begining.date <- c("1991-12-01", "1996-01-10",
                   "1991-05-07", "1995-04-23",
                   "1992-01-01", "1996-01-01",
                   "1991-08-03", "1992-07-01", "1996-06-01")
end.date <- c("1996-01-09", "2000-02-01",
                   "1995-04-01", "1999-04-23",
                   "1995-09-01", "1999-11-30",
                   "1992-02-03", "1996-05-20", "2000-04-01")
cabinets <- data.frame(country, cabinet, begining.date, end.date)
> cabinets
  country cabinet begining.date   end.date
1 Albania    1200    1991-12-01 1996-01-09
2 Albania    1201    1996-01-10 2000-02-01
3 Belgium    1560    1991-05-07 1995-04-01
4 Belgium    1566    1995-04-23 1999-04-23
5  Canada     220    1992-01-01 1995-09-01
6  Canada     440    1996-01-01 1999-11-30
7 Denmark     880    1991-08-03 1992-02-03
8 Denmark     819    1992-07-01 1996-05-20
9 Denmark     870    1996-06-01 2000-04-01

我想要的是一个数据集,其中分析单位是国家*年,如数据框“country.year”中一样,但包括数据框“橱柜”中每个橱柜的位置变量。这个职位变量涉及内阁的政策立场,因此它与数据转换任务确实无关,但对以后很重要。所以是这样的:

country <- c("Albania","Albania","Albania","Albania","Albania",
             "Belgium","Belgium","Belgium","Belgium","Belgium",
             "Canada","Canada","Canada","Canada","Canada",
             "Denmark","Denmark","Denmark","Denmark","Denmark")
year2 <- c(1992, 1993, 1994, 1995, 1996,
           1992, 1993, 1994, 1995, 1996,
           1992, 1993, 1994, 1995, 1996,
           1992, 1993, 1994, 1995, 1996)
cabinet2 <- c(1200,1200,1200,1200, 1201,
             1560,1560,1560, 1566, 1566,
             220,220,220,220, 440,
             819, 819, 819, 819, 870)
cabinet.position2 <- c(12,12,12,12, 10,
              0,0,0, 5, 5,
              -9,-9,-9,-9, 2,
              1, 1, 1, 1, -15)
desired.df <- data.frame(country, year2, cabinet2,cabinet.position2)
desired.df
   country year2 cabinet2 cabinet.position2
1  Albania  1992     1200                12
2  Albania  1993     1200                12
3  Albania  1994     1200                12
4  Albania  1995     1200                12
5  Albania  1996     1201                10
6  Belgium  1992     1560                 0
7  Belgium  1993     1560                 0
8  Belgium  1994     1560                 0
9  Belgium  1995     1566                 5
10 Belgium  1996     1566                 5
11  Canada  1992      220                -9
12  Canada  1993      220                -9
13  Canada  1994      220                -9
14  Canada  1995      220                -9
15  Canada  1996      440                 2
16 Denmark  1992      819                 1
17 Denmark  1993      819                 1
18 Denmark  1994      819                 1
19 Denmark  1995      819                 1
20 Denmark  1996      870               -15

我在这里的主要问题是将橱柜分配给不同的年份。正如您在上面看到的,每年都需要分配一个内阁及其职位。更重要的是,对我来说真正困难的是,有时一年有多个机柜,所以我需要每一年的机柜都是在那一年中花费更多时间的机柜(例如,如果 1995 年的机柜 A 从1-5月,B柜在6-12月,1995年应该分配到B柜)。

有什么想法吗?

非常感谢!

【问题讨论】:

  • @DavidArenburg OP 定义了一个向量 cabinet.position2,他们显然打算将其添加到他们的 cabinets &lt;- data.frame(...) 调用中。 cabinets 有 9 行,但向量 cabinet.position2 只有 8 个元素。连接后的预期输出 desired.df 没有一行与 cabinet$cabinet == 880 匹配,因此我们不知道实际值。为了使问题可以回答,我为此行添加了NA。如果你看一下修订历史,应该很清楚。
  • 对不起,对不起。我的错。我编辑并明确了变量代表什么。

标签: r dplyr merge data.table lubridate


【解决方案1】:

使用 data.table,您可以同时进行不等连接、计算新变量并以非常快速的方式更新数据。这是一个选项

### Load data.table and convert the data.frames
library(data.table)
setDT(country.year) ; setDT(cabinets)

### Convert date columns to proper dates and create join columns 
date_cols <- grep("date", names(cabinets), value = TRUE)
cabinets[, (date_cols) := lapply(.SD, as.IDate), .SDcols = date_cols]
cabinets[, paste0(c("start", "end"), "_year") := lapply(.SD, year), .SDcols = date_cols]

### Join by year intervals, while calculating the larget time period and updating the data in place
country.year[
             , cabinet.position2 :=
               cabinets[.SD, 
                        cabinet.position2[which.max(end.date - as.IDate(paste0(year, "-01-01")))] 
                        , on = .(country, start_year <= year, end_year >= year)
                        , by = .EACHI]$V1
             ]


country.year
#     country year cabinet.position2
#  1: Albania 1992                12
#  2: Albania 1993                12
#  3: Albania 1994                12
#  4: Albania 1995                12
#  5: Albania 1996                10
#  6: Belgium 1992                 0
#  7: Belgium 1993                 0
#  8: Belgium 1994                 0
#  9: Belgium 1995                 5
# 10: Belgium 1996                 5
# 11:  Canada 1992                -9
# 12:  Canada 1993                -9
# 13:  Canada 1994                -9
# 14:  Canada 1995                -9
# 15:  Canada 1996                 2
# 16: Denmark 1992                 1
# 17: Denmark 1993                 1
# 18: Denmark 1994                 1
# 19: Denmark 1995                 1
# 20: Denmark 1996               -15

【讨论】:

    【解决方案2】:

    编辑:新版本包含合并并创建一个新变量来计算在办公室花费的时间,在我重新阅读问题(我的错误)和 OP 对内阁位置意味着什么的澄清之后。

    涉及非等连接的 TidyR 解决方案。

    library(dplyr)
    library(fuzzyjoin)
    library(lubridate)
    
    # putting data as Date
    country.year <- country.year %>%
      mutate(year = paste0(year,"/01","/01"),
             year = as.Date(year, format = "%Y/%m/%d")) 
    cabinets <- cabinets %>%
      mutate(begining.date = as.Date(begining.date),
             end.date = as.Date(end.date))
    
    desired.df <- fuzzy_inner_join(country.year,cabinets,
                                        by=c("country"="country",
                                             "year"="begining.date",
                                             "year"="end.date"),
                                        match_fun = list(`==`, `>=`, `<=`))%>%
      select(country=country.x,everything())%>%
      mutate(year=str_sub(year,1,4),
             time.as.cabinet = end.date - begining.date)%>%
      group_by(country,year)%>%
      filter(time.as.cabinet==max(time.as.cabinet)) %>%
      select(country,year,cabinet,cabinet.position2, -country.y)
    
    desired.df %>%
      head(10)
      country year  cabinet cabinet.position2
       <fct>   <chr>   <dbl>             <dbl>
     1 Albania 1992     1200                12
     2 Albania 1993     1200                12
     3 Albania 1994     1200                12
     4 Albania 1995     1200                12
     5 Albania 1996     1200                12
     6 Belgium 1992     1560                 0
     7 Belgium 1993     1560                 0
     8 Belgium 1994     1560                 0
     9 Belgium 1995     1560                 0
    10 Belgium 1996     1566                 5
    

    【讨论】:

    • 啊抱歉,我真的不太清楚。该职位与在那里度过的时间或时间无关。这关系到内阁在政策上的立场!
    • 添加了编辑,我不明白关于合并的问题,道歉@AntVal
    • 工作得很好。非常感谢!
    【解决方案3】:

    这是另一个使用data.table::foverlaps的选项:

    library(data.table)
    setDT(country.year)
    setDT(cabinets)
    
    #create start date and end date of the year
    country.year[, paste0("yr.", c("start", "end")) := lapply(c("-01-01", "-12-31"),
        function(x) as.Date(paste0(year, x), format="%Y-%m-%d"))]
    
    setkey(country.year, country, yr.start, yr.end)
    setkey(cabinets, country, beginning.date, end.date)
    foverlaps(country.year, cabinets)[, {
            k <- which.max(pmin(end.date, yr.end) - yr.start)
            .(cabinet2=cabinet[k], cabinet.position2=cabinet.position[k])
        }, .(country, year)]
    

    输出:

        country year cabinet2 cabinet.position2
     1: Albania 1992     1200                12
     2: Albania 1993     1200                12
     3: Albania 1994     1200                12
     4: Albania 1995     1200                12
     5: Albania 1996     1201                10
     6: Belgium 1992     1560                 0
     7: Belgium 1993     1560                 0
     8: Belgium 1994     1560                 0
     9: Belgium 1995     1566                 5
    10: Belgium 1996     1566                 5
    11:  Canada 1992      220                -9
    12:  Canada 1993      220                -9
    13:  Canada 1994      220                -9
    14:  Canada 1995      220                -9
    15:  Canada 1996      440                 2
    16: Denmark 1992      819                 1
    17: Denmark 1993      819                 1
    18: Denmark 1994      819                 1
    19: Denmark 1995      819                 1
    20: Denmark 1996      870               -15
    

    数据(带有日期转换、Ian Campbell 的数据修复和单词开头的小错字):

    country <- c("Albania","Albania","Albania","Albania","Albania","Belgium","Belgium","Belgium","Belgium","Belgium","Canada","Canada","Canada","Canada","Canada","Denmark","Denmark","Denmark","Denmark","Denmark")
    year <- c(1992, 1993, 1994, 1995, 1996, 1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996)
    country.year <- data.frame(country, year)
    
    country <- c("Albania","Albania","Belgium","Belgium","Canada","Canada","Denmark","Denmark","Denmark")
    cabinet <- c(1200, 1201, 1560, 1566, 220, 440, 880, 819, 870)
    cabinet.position <- c(12, 10, 0, 5, -9, 2, NA, 1,-15)
    beginning.date <- as.Date(c("1991-12-01", "1996-01-10","1991-05-07", "1995-04-23","1992-01-01", "1996-01-01","1991-08-03", "1992-07-01", "1996-06-01"))
    end.date <- as.Date(c("1996-01-09", "2000-02-01","1995-04-01", "1999-04-23","1995-09-01", "1999-11-30","1992-02-03", "1996-05-20", "2000-04-01"))
    cabinets <- data.frame(country, cabinet, cabinet.position, beginning.date, end.date)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2015-02-18
      • 1970-01-01
      • 2019-11-27
      • 1970-01-01
      • 2019-09-21
      • 2020-09-25
      • 1970-01-01
      • 2015-12-28
      相关资源
      最近更新 更多