【问题标题】:Spread valued column into binary 'time series' in R将值列传播到 R 中的二进制“时间序列”
【发布时间】:2020-02-06 06:48:38
【问题描述】:

我试图首先将一个有价值的列传播到一组二进制列中,然后以“时间序列”格式再次收集它们。

例如,考虑在特定时间被征服的位置,数据如下所示:

df1 <- data.frame(locationID = c(1,2,3), conquered_in = c(1931, 1932, 1929))

  locationID conquered_in
1          1         1931
2          2         1932
3          3         1929

我正在尝试将数据重塑为如下所示:

df2 <- data.frame(locationID = c(1,1,1,1,2,2,2,2,3,3,3,3), year = c(1929,1930,1931,1932,1929,1930,1931,1932,1929,1930,1931,1932), conquered = c(0,0,1,1,0,0,0,0,1,1,1,1))

   locationID year conquered
1           1 1929         0
2           1 1930         0
3           1 1931         1
4           1 1932         1
5           2 1929         0
6           2 1930         0
7           2 1931         0
8           2 1932         0
9           3 1929         1
10          3 1930         1
11          3 1931         1
12          3 1932         1

我最初的策略是在被征服时使用spread,然后尝试使用gatherThis answer 似乎很接近,但我似乎无法用 fill 正确处理它,因为我也在尝试用 1 填充晚年。

【问题讨论】:

    标签: r spread


    【解决方案1】:

    您可以使用complete()扩展数据框,然后在conquered等于1时使用cumsum()向下填充分组数据。

    library(tidyr)
    library(dplyr)
    
    df1 %>% 
      mutate(conquered = 1) %>%
      complete(locationID, conquered_in = seq(min(conquered_in), max(conquered_in)), fill = list(conquered = 0)) %>%
      group_by(locationID) %>%
      mutate(conquered = cumsum(conquered == 1))
    
    # A tibble: 12 x 3
    # Groups:   locationID [3]
       locationID conquered_in conquered
            <dbl>        <dbl>     <int>
     1          1         1929         0
     2          1         1930         0
     3          1         1931         1
     4          1         1932         1
     5          2         1929         0
     6          2         1930         0
     7          2         1931         0
     8          2         1932         1
     9          3         1929         1
    10          3         1930         1
    11          3         1931         1
    12          3         1932         1
    

    【讨论】:

    • 这太好了,谢谢。有没有办法做到这一点,如果您在数据框中有其他列,它们不会转向NA
    • @dmk32 - 包括包装在nesting() - complete(nesting(locationID, x, y, z), conquered_in = seq(min(conquered_in), max(conquered_in)), fill = list(conquered = 0)) 中的其他变量。
    【解决方案2】:

    使用完整的 tidyr 会是更好的选择。虽然我们需要注意的是,被征服的年份可能无法完全涵盖从战争开始到结束的全年。

    library(dplyr)
    library(tidyr)
    library(magrittr)
    
    df1 <- data.frame(locationID = c(1,2,3), conquered_in = c(1931, 1932, 1929))
    
    # A data frame full of all year you want to cover
    df2 <- data.frame(year=seq(1929, 1940, by=1))
    
    # Create a data frame full of combination of year and location + conquered data
    df3 <- full_join(df2, df1, by=c("year"="conquered_in")) %>%
      mutate(conquered=if_else(!is.na(locationID), 1, 0)) %>%
      complete(year, locationID) %>%
      arrange(locationID) %>%
      filter(!is.na(locationID))
    
    # calculate conquered depend on the first year it get conquered - using group by location
    df3 %<>%
      group_by(locationID) %>%
      # year 2000 in the min just for case if you have location that never conquered 
      mutate(conquered=if_else(year>=min(2000, year[conquered==1], na.rm=T), 1, 0)) %>%
      ungroup()
    
    df3 %>% filter(year<=1932)
    # A tibble: 12 x 3
        year locationID conquered
       <dbl>      <dbl>     <dbl>
     1  1929          1         0
     2  1930          1         0
     3  1931          1         1
     4  1932          1         1
     5  1929          2         0
     6  1930          2         0
     7  1931          2         0
     8  1932          2         1
     9  1929          3         1
    10  1930          3         1
    11  1931          3         1
    12  1932          3         1
    

    【讨论】: