【问题标题】:Conditional NA filling by group按组有条件的 NA 填充
【发布时间】:2015-02-06 17:52:15
【问题描述】:

编辑
该问题最初是针对data.table 提出的。任何包的解决方案都会很有趣。


我有点卡在一个更普遍的问题的特定变体上。我有与 data.table 一起使用的面板数据,我想使用 data.table 的 group by 功能填充一些缺失值。不幸的是它们不是数字,所以我不能简单地插值,但它们只能根据条件填写。是否可以在 data.tables 中执行一种有条件的 na.locf?

基本上我只想填写 NAs 如果在 NAs 之后的下一个观察是之前的观察,但更普遍的问题是如何有条件地填写 NAs。

例如,在下面的数据中,我想按每个 id 组来填写 associatedid 变量。所以 id==1year==2003 将填写为 ABC123 因为它是 NA 之前和之后的值,但不是 2000 相同的 id。 id== 2 不会更改,因为下一个值与 NA 之前的值不同。 id==3 将填写 2003 年和 2004 年。

mydf <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L), associatedid = structure(c(NA, 1L, 1L, NA, 1L, 1L, NA, 1L, 1L, NA, 2L, 2L, NA, 1L, 1L, NA, NA, 1L), .Label = c("ABC123", "DEF456"), class = "factor")), class = "data.frame", row.names = c(NA, -18L))

mydf
#>    id year associatedid
#> 1   1 2000         <NA>
#> 2   1 2001       ABC123
#> 3   1 2002       ABC123
#> 4   1 2003         <NA>
#> 5   1 2004       ABC123
#> 6   1 2005       ABC123
#> 7   2 2000         <NA>
#> 8   2 2001       ABC123
#> 9   2 2002       ABC123
#> 10  2 2003         <NA>
#> 11  2 2004       DEF456
#> 12  2 2005       DEF456
#> 13  3 2000         <NA>
#> 14  3 2001       ABC123
#> 15  3 2002       ABC123
#> 16  3 2003         <NA>
#> 17  3 2004         <NA>
#> 18  3 2005       ABC123

dt = data.table(mydf, key = c("id"))

想要的输出

#>    id year associatedid
#> 1   1 2000         <NA>
#> 2   1 2001       ABC123
#> 3   1 2002       ABC123
#> 4   1 2003       ABC123
#> 5   1 2004       ABC123
#> 6   1 2005       ABC123
#> 7   2 2000         <NA>
#> 8   2 2001       ABC123
#> 9   2 2002       ABC123
#> 10  2 2003         <NA>
#> 11  2 2004       DEF456
#> 12  2 2005       DEF456
#> 13  3 2000         <NA>
#> 14  3 2001       ABC123
#> 15  3 2002       ABC123
#> 16  3 2003       ABC123
#> 17  3 2004       ABC123
#> 18  3 2005       ABC123

【问题讨论】:

标签: r dplyr data.table plyr na


【解决方案1】:

这都是关于编写修改后的 na.locf 函数。之后,您可以像任何其他函数一样将其插入 data.table。

new.locf <- function(x){
  # might want to think about the end of this loop
  # this works here but you might need to add another case
  # if there are NA's as the last value.
  #
  # anyway, loop through observations in a vector, x.
  for(i in 2:(length(x)-1)){
    nextval = i
    # find the next, non-NA value
    # again, not tested but might break if there isn't one?
    while(nextval <= length(x)-1 & is.na(x[nextval])){
      nextval = nextval + 1
    }
    # if the current value is not NA, great!
    if(!is.na(x[i])){
      x[i] <- x[i]
    }else{
      # if the current value is NA, and the last value is a value
      # (should given the nature of this loop), and
      # the next value, as calculated above, is the same as the last
      # value, then give us that value. 
      if(is.na(x[i]) & !is.na(x[i-1]) & x[i-1] == x[nextval]){
        x[i] <- x[nextval]
      }else{
        # finally, return NA if neither of these conditions hold
        x[i] <- NA
      }
    }
  }
  # return the new vector
  return(x) 
}

一旦我们有了这个函数,我们就可以像往常一样使用 data.table 了:

dt2 <- dt[,list(year = year,
                # when I read your data in, associatedid read as factor
                associatedid = new.locf(as.character(associatedid))
                ),
          by = "id"
          ]

这会返回:

> dt2
    id year associatedid
 1:  1 2000           NA
 2:  1 2001       ABC123
 3:  1 2002       ABC123
 4:  1 2003       ABC123
 5:  1 2004       ABC123
 6:  1 2005       ABC123
 7:  2 2000           NA
 8:  2 2001       ABC123
 9:  2 2002       ABC123
10:  2 2003           NA
11:  2 2004       DEF456
12:  2 2005       DEF456
13:  3 2000           NA
14:  3 2001       ABC123
15:  3 2002       ABC123
16:  3 2003       ABC123
17:  3 2004       ABC123
18:  3 2005       ABC123

据我所知,这就是您要寻找的东西。

我在 new.locf 定义中提供了一些对冲功能,因此您可能仍有一些想法要做,但这应该可以帮助您入门。

【讨论】:

  • 这对于我发布的案例来说肯定是可行的。正如您所建议的,一旦我尝试将其应用于更大的数据集,我发现当 NA 填写行时它会中断,因此我在最终的 ifelse 中添加了另一个条件来处理这种情况。
【解决方案2】:

如果na.locf0向前和向后应用相同,则使用na.locf0;否则,如果它们不相等或其中一个为 NA,则使用 NA。

library(data.table)
library(zoo)

dt[, associatedid := 
    ifelse(na.locf0(associatedid) == na.locf0(associatedid, fromLast=TRUE), 
      na.locf0(associatedid), NA), by = id]

给予:

> dt
    id year associatedid
 1:  1 2000         <NA>
 2:  1 2001       ABC123
 3:  1 2002       ABC123
 4:  1 2003       ABC123
 5:  1 2004       ABC123
 6:  1 2005       ABC123
 7:  2 2000         <NA>
 8:  2 2001       ABC123
 9:  2 2002       ABC123
10:  2 2003         <NA>
11:  2 2004       DEF456
12:  2 2005       DEF456
13:  3 2000         <NA>
14:  3 2001       ABC123
15:  3 2002       ABC123
16:  3 2003       ABC123
17:  3 2004       ABC123
18:  3 2005       ABC123

【讨论】:

  • 非常干净的逻辑和代码。您仍然可以使用 nafill_char &lt;- function(x, dir = "locf") x[nafill(replace(seq_along(x), is.na(x), NA), dir)]dt[, associatedid := as.character(associatedid)][, associatedid := fifelse(nafill_char(associatedid) == nafill_char(associatedid, "nocb"), nafill_char(associatedid), NA_character_), by = id] 之类的方式删除 zoo 依赖项。
  • nafill 在我使用的 data.table 版本中不存在。必须是最近添加的。
  • 是的 - 可能是 2 个月前,但只处理数字向量。 fifelse() 也是新的。
【解决方案3】:

这是一个纯粹的 tidyverse 解决方案:

library(tidyverse)
mydf %>%
  mutate(up = associatedid, down = associatedid) %>%
  group_by(id) %>%
  fill(up,.direction = "up") %>%
  fill(down) %>%
  mutate_at("associatedid", ~if_else(is.na(.) & up == down, up, .)) %>%
  ungroup() %>%
  select(-up, - down)
#> # A tibble: 18 x 3
#>       id  year associatedid
#>    <int> <int> <fct>       
#>  1     1  2000 <NA>        
#>  2     1  2001 ABC123      
#>  3     1  2002 ABC123      
#>  4     1  2003 ABC123      
#>  5     1  2004 ABC123      
#>  6     1  2005 ABC123      
#>  7     2  2000 <NA>        
#>  8     2  2001 ABC123      
#>  9     2  2002 ABC123      
#> 10     2  2003 <NA>        
#> 11     2  2004 DEF456      
#> 12     2  2005 DEF456      
#> 13     3  2000 <NA>        
#> 14     3  2001 ABC123      
#> 15     3  2002 ABC123      
#> 16     3  2003 ABC123      
#> 17     3  2004 ABC123      
#> 18     3  2005 ABC123

或者使用zoo::na.locf

library(dplyr)
library(zoo)
mydf %>%
  group_by(id) %>%
  mutate_at("associatedid", ~if_else(
    is.na(.) & na.locf(.,F) == na.locf(.,F,fromLast = TRUE), na.locf(.,F), .)) %>%
  ungroup()
#> # A tibble: 18 x 3
#>       id  year associatedid
#>    <int> <int> <fct>       
#>  1     1  2000 <NA>        
#>  2     1  2001 ABC123      
#>  3     1  2002 ABC123      
#>  4     1  2003 ABC123      
#>  5     1  2004 ABC123      
#>  6     1  2005 ABC123      
#>  7     2  2000 <NA>        
#>  8     2  2001 ABC123      
#>  9     2  2002 ABC123      
#> 10     2  2003 <NA>        
#> 11     2  2004 DEF456      
#> 12     2  2005 DEF456      
#> 13     3  2000 <NA>        
#> 14     3  2001 ABC123      
#> 15     3  2002 ABC123      
#> 16     3  2003 ABC123      
#> 17     3  2004 ABC123      
#> 18     3  2005 ABC123

同样的想法,但使用 data.table :

library(zoo)
library(data.table)
setDT(mydf)
mydf[,associatedid := fifelse(
  is.na(associatedid) & na.locf(associatedid,F) == na.locf(associatedid,F,fromLast = TRUE), 
  na.locf(associatedid,F), associatedid),
  by = id]
mydf
#>     id year associatedid
#>  1:  1 2000         <NA>
#>  2:  1 2001       ABC123
#>  3:  1 2002       ABC123
#>  4:  1 2003       ABC123
#>  5:  1 2004       ABC123
#>  6:  1 2005       ABC123
#>  7:  2 2000         <NA>
#>  8:  2 2001       ABC123
#>  9:  2 2002       ABC123
#> 10:  2 2003         <NA>
#> 11:  2 2004       DEF456
#> 12:  2 2005       DEF456
#> 13:  3 2000         <NA>
#> 14:  3 2001       ABC123
#> 15:  3 2002       ABC123
#> 16:  3 2003       ABC123
#> 17:  3 2004       ABC123
#> 18:  3 2005       ABC123

最后是使用 base 的一个有趣的想法,注意只有当常量插值和线性插值相同时,如果这个字符变量是数字,你才想插值:

i <- ave( as.numeric(factor(mydf$associatedid)), mydf$id,FUN = function(x) ifelse(
  approx(x,xout = seq_along(x))$y == (z<- approx(x,xout = seq_along(x),method = "constant")$y),
  z, x))
mydf$associatedid <- levels(mydf$associatedid)[i]
mydf
#>    id year associatedid
#> 1   1 2000         <NA>
#> 2   1 2001       ABC123
#> 3   1 2002       ABC123
#> 4   1 2003       ABC123
#> 5   1 2004       ABC123
#> 6   1 2005       ABC123
#> 7   2 2000         <NA>
#> 8   2 2001       ABC123
#> 9   2 2002       ABC123
#> 10  2 2003         <NA>
#> 11  2 2004       DEF456
#> 12  2 2005       DEF456
#> 13  3 2000         <NA>
#> 14  3 2001       ABC123
#> 15  3 2002       ABC123
#> 16  3 2003       ABC123
#> 17  3 2004       ABC123
#> 18  3 2005       ABC123

【讨论】:

    【解决方案4】:

    您可以向前和向后滚动缺失的行,比较值并在它们相等时进行赋值:

    library(data.table)
    DT = data.table(mydf)
    
    w  = DT[is.na(associatedid), which=TRUE]
    dn = DT[w, DT[-w][.SD, on=.(id, year), roll=TRUE, x.associatedid]]
    up = DT[w, DT[-w][.SD, on=.(id, year), roll=-Inf, x.associatedid]]
    ww = na.omit(w[up == dn])
    DT[ww, associatedid := dn[ww]]
    
        id year associatedid
     1:  1 2000         <NA>
     2:  1 2001       ABC123
     3:  1 2002       ABC123
     4:  1 2003       ABC123
     5:  1 2004       ABC123
     6:  1 2005       ABC123
     7:  2 2000         <NA>
     8:  2 2001       ABC123
     9:  2 2002       ABC123
    10:  2 2003         <NA>
    11:  2 2004       DEF456
    12:  2 2005       DEF456
    13:  3 2000         <NA>
    14:  3 2001       ABC123
    15:  3 2002       ABC123
    16:  3 2003         <NA>
    17:  3 2004         <NA>
    18:  3 2005       ABC123
    

    【讨论】:

      【解决方案5】:

      这是dplyr 的另一个尝试:

      library(dplyr)
      
      mydf %>%
        #Detect NA values in associatedid
        mutate(isReplaced = is.na(associatedid), ans = associatedid) %>%
        group_by(id) %>%
        #Fill all NA values
        tidyr::fill(associatedid) %>%
        #Detect the NA values which were replaced
        mutate(isReplaced = isReplaced & !is.na(associatedid)) %>%
        #Group by id and associatedid 
        group_by(associatedid, add = TRUE) %>%
        #Add NA values if it was isReplaced and is first or last row of the group
        mutate(ans = replace(associatedid,row_number() %in% c(1, n()) & isReplaced, NA)) %>%
        ungroup() %>%
        select(-isReplaced, -associatedid)
      
      
      # A tibble: 18 x 3
      #      id  year ans   
      #   <int> <int> <fct> 
      # 1     1  2000 NA    
      # 2     1  2001 ABC123
      # 3     1  2002 ABC123
      # 4     1  2003 ABC123
      # 5     1  2004 ABC123
      # 6     1  2005 ABC123
      # 7     2  2000 NA    
      # 8     2  2001 ABC123
      # 9     2  2002 ABC123
      #10     2  2003 NA    
      #11     2  2004 DEF456
      #12     2  2005 DEF456
      #13     3  2000 NA    
      #14     3  2001 ABC123
      #15     3  2002 ABC123
      #16     3  2003 ABC123
      #17     3  2004 ABC123
      #18     3  2005 ABC123
      

      【讨论】:

        【解决方案6】:

        我一直在尝试组合一个两遍方法,在第一遍中将更改 NA 以将“p_”粘贴到起始值的前面(在一个 id 内),然后在第二遍中检查最后一个一个序列的值与下一个实际值一致。到目前为止,我提供了我的代码,这并不是一个真正的答案,所以不要期待任何支持。 (可能更容易将 associatedid 重命名为 asid。)

        lapply( split(df, df$id), 
            function(d){ d$associatedid <- as.character(d$associatedid)
            missloc <- with( d, tapply(is.na(associatedid), id,  which))
            for (n in missloc) if( 
                   d$associatedid[n+1] %in% c(d$associatedid[n-1],
                                           paste0("p_" , d$associatedid[n-1])&
            grepl( gsub("p\\_", "",  d$associatedid[n-1]), d$associatedid[n+1] )
                                { d$associatedid[n] <- d$associatedid[n-1]
                             } else{
                       #tentative NA replacement
                 d$associatedid[n] <- paste0("p_" , d$associatedid[n-1])}
         })
        

        【讨论】:

        • 感谢您的意见。 “两次通过”方法是我没有真正想到的,所以我会看到我能找到一种方法来利用它。所以,下次我将使用更简单的变量名作为示例时,你是对的。然而,这只是在这一点上的猜测,但通常这种在 data.table 中的拆分-操作-重组过程只需要引用一次变量名。
        • 我已经为这个老问题提供了赏金,也许你想再试一次:)
        • 当@G.Grothendieck 回答涉及 na.locf 的问题时,我认为它是规范的。
        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-12-18
        • 2021-10-21
        • 1970-01-01
        • 1970-01-01
        • 2022-01-27
        • 2022-07-23
        相关资源
        最近更新 更多