【问题标题】:Add rows with missing years by group按组添加缺少年份的行
【发布时间】:2017-10-03 14:11:04
【问题描述】:

我想在 data.frame 中为每个组(公司和类型)的所有缺失年份创建新行。数据框如下所示:

minimal <- data.frame(firm = c("A","A","A","B","B","B","A","A","A","B","B","B"),
                  type = c("X","X","X","X","X","X","Y","Y","Y","Y","Y","Y"),
                  year = c(2000,2004,2007,2010,2008,2001,2002,2003,2007,2000,2001,2008),
                  value = c(1,3,7,9,9,2,3,3,7,5,9,15)
                  )

数据框:

firm type year value
A    X    2000     1
A    X    2004     3
A    X    2007     7
B    X    2010     9
B    X    2008     9
B    X    2001     2
A    Y    2002     3
A    Y    2003     3
A    Y    2007     7
B    Y    2000     5
B    Y    2001     9
B    Y    2008    15

现在,我想要得到的是以下内容: 我可以在数据中看到最小年份是 2000 年,最大年份是 2010 年。我想为每个公司类型组合的每个缺失年份添加一行。 例如。对于公司 A 和类型 X,我想添加如下所示的行:

最终输出:

firm type year value
A    X    2000     1
A    X    2004     3
A    X    2007     7
A    X    2001     1
A    X    2002     1
A    X    2003     1
A    X    2005     3
A    X    2006     3
A    X    2008     7
A    X    2009     7
A    X    2010     7

此外,我想将前一年的值写入所有后续年份的缺失行的“值”列,直到出现新的非缺失行(如最终输出示例所示)。

我还没有想出任何有用的代码,但到目前为止我发现以下可能是正确的方向:

setDT(minimal)[, .SD[match(2000:2010, year)],
                           by = c("firm","type")]

我不太了解 setDT 和 .SD 的概念,但这会为每个公司类型组合创建至少一行。但是,没有一年的内容。

提前非常感谢!

【问题讨论】:

  • 我认为这有骗局。检查?complete 来自tidyr?expand.grid 来自base RCJ 来自data.table
  • 好吧,我想出了min2 &lt;- expand.grid(year = min(minimal$year):max(minimal$year), firm = unique(minimal$firm), type = unique(minimal$type))merge(min2,minimal, by = c("firm","type","year"), all.x = T)。现在我只需要为每一行添加正确的值,我现在还不知道该怎么做。
  • 试试这个:library(dplyr); library(tidyr); minimal %&gt;% group_by(firm, type) %&gt;% complete(year = full_seq(year, 1)) %&gt;% fill(value)
  • 很酷,这是非常好的代码。但是,我仍然有每个组(公司,类型)需要最短和最长年份的问题。我实际上需要总最小值和最大值,这通常与组最小值和最大值不同。
  • 好吧,就是year = full_seq(2000:2010,1)。谢谢!

标签: r dataframe


【解决方案1】:

我找不到确切的骗局,所以这是一个可能的解决方案,

library(dplyr)
library(tidyr)

minimal %>% 
  group_by(firm, type) %>% 
  complete(year = full_seq(2000:2010, 1)) %>% 
  fill(value)

【讨论】:

  • 很好的解决方案!
【解决方案2】:

我写了这段代码来做你想做的事,也许它不是那么高效或优雅,但它确实有效:

# Input dataframe
minimal <- data.frame(firm = c("A","A","A","B","B","B","A","A","A","B","B","B"),
                      type = c("X","X","X","X","X","X","Y","Y","Y","Y","Y","Y"),
                      year = c(2000,2004,2007,2010,2008,2001,2002,2003,2007,2000,2001,2008),
                      value = c(1,3,7,9,9,2,3,3,7,5,9,15)
)

# Sorting is needed
minimal = minimal[order(minimal$firm, minimal$type, minimal$year),]

# Variables used
table = table(minimal$firm=="A", minimal$type=="X")
minYear = min(minimal$year)
maxYear = max(minimal$year)
startPos = 0

# Iterates the dataframe
for(i in 1:2){
  for(j in 1:2){
    prevValue = 0
    currYear = minYear

    # Adds minimum year if needed
    if(minimal$year[1+startPos] != currYear){
      newRow = c(as.character(minimal$firm[1+startPos]), as.character(minimal$type[1+startPos]), currYear, prevValue)
      minimal = rbind(minimal, newRow)
    }

    # Adds years
    for(k in (1+startPos):(table[i,j]+startPos)){
      if(minimal$year[k]!=currYear){
        currYear = currYear + 1
        while(minimal$year[k]!=currYear){
          newRow = c(as.character(minimal$firm[k]), as.character(minimal$type[k]), currYear, prevValue)
          minimal = rbind(minimal, newRow)
          currYear = currYear + 1
        }
      }
      prevValue = minimal$value[k]
    }

    # Adds years from last to maximum
    if(currYear < maxYear){
      for(l in 1:(maxYear - currYear)){
        newRow = c(as.character(minimal$firm[k]), as.character(minimal$type[k]), currYear+l, prevValue)
        minimal = rbind(minimal, newRow)
      }
    }
    startPos = startPos + table[i,j]

  }
}

# Result
minimal = minimal[order(minimal$firm, minimal$type, minimal$year),]
minimal

【讨论】:

    【解决方案3】:

    这是data.table 解决方案。

    library(data.table)
    
    dt <- setDT(minimal)[CJ(firm=firm, type=type, year=seq(min(year), max(year)), unique=TRUE),
                  on=.(firm, type, year), roll=TRUE]
    

    返回

    head(dt, 15)
        firm type year value
     1:    A    X 2000     1
     2:    A    X 2001     1
     3:    A    X 2002     1
     4:    A    X 2003     1
     5:    A    X 2004     3
     6:    A    X 2005     3
     7:    A    X 2006     3
     8:    A    X 2007     7
     9:    A    X 2008     7
    10:    A    X 2009     7
    11:    A    X 2010     7
    12:    A    Y 2000    NA
    13:    A    Y 2001    NA
    14:    A    Y 2002     3
    15:    A    Y 2003     3
    

    请注意,第二个公司类型组合的初始行是 NA。如果您想在接下来的年份中填写这些,您可以将 fill 的参数调整为“最近”,尽管这可能会影响数据中间的值。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-12-06
      • 1970-01-01
      • 2021-12-31
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-11-30
      相关资源
      最近更新 更多