R - 如何在NA中填写值，但仅当结束值与起始值相同时？答案

【问题标题】：R - How to fill in values in NA, but only when ending value is the same as the beginning value?R - 如何在NA中填写值，但仅当结束值与起始值相同时？
【发布时间】：2021-11-30 03:01:00
【问题描述】：

我有以下示例数据：

示例

col1
1
NA
NA
4
NA
NA
6
NA
NA
NA
6
8
NA
2
NA

我想用上面的值填充 NA，但前提是 NA 在 2 个相同的值之间。在此示例中，从 1 到 4 的第一个 NA 间隙不应用 1 填充。但是第一个6和第二个6之间的差距应该用6s来填补。所有其他值应保持 NA。因此，之后它应该是这样的：

col1
1
NA
NA
4
NA
NA
6
6
6
6
6
8
NA
2
NA

但实际上我不是只有 15 个观察值，而是超过 50000 个。因此我需要一个有效的解决方案，这比我想象的要困难。我尝试使用填充功能，但无法提出解决方案。

【问题讨论】：

标签： r na fill

【解决方案1】：

dplyr 和 zoo 选项可以是：

df %>%
    mutate(cond = na.locf0(col1) == na.locf0(col1, fromLast = TRUE),
           col1 = ifelse(cond, na.locf0(col1), col1)) %>%
    select(-cond)

   col1
1     1
2    NA
3    NA
4     4
5    NA
6    NA
7     6
8     6
9     6
10    6
11    6
12    8
13   NA
14    2
15   NA

【讨论】：

您的动物园解决方案运行良好。但是对于 dplyr 它不起作用，在示例中它很好，但在我的真实数据中却没有。似乎 NA 在开始和结束值之后被填充，它应该真正结束。就像数据是：(6, NA, NA, 6, NA, 8) 你的 dplyr 解决方案会给出：(6, 6, 6, 6, 6, 8)。但同样，动物园解决方案工作正常。谢谢。

【解决方案2】：

这是使用dplyr 和tidyr 的tidyverse 方法：逻辑：

创建一个id 列
删除所有 na 行
如果下一个值相同则标记
right_join 与第一个 Example df
fill 下flag 和对应的col1.y
mutate 和 ifelse

library(dplyr)
library(tidyr)

Example <- Example %>% 
  mutate(id=row_number())

Example %>% 
  na.omit() %>% 
  mutate(flag = ifelse(col1==lead(col1), TRUE, FALSE)) %>% 
  right_join(Example, by="id") %>% 
  arrange(id) %>% 
  fill(col1.y, .direction="down") %>% 
  fill(flag, .direction="down") %>% 
  mutate(col1.x = ifelse(flag==TRUE, col1.y, col1.x), .keep="unused") %>% 
  select(col1 = col1.x)

输出：

【讨论】：

【解决方案3】：

df <- data.frame(col1 =c(1, NA, NA, 4, NA, NA, 6, NA, NA, NA, 6, 8, NA, 2, NA))

library(data.table)
library(magrittr)

setDT(df)[!is.na(col1), n := .N, by = col1] %>% 
  .[, n := nafill(n, type = "locf")] %>% 
  .[n == 2, col1 := nafill(col1, type = "locf")] %>% 
  .[, n := NULL] %>% 
  .[]
#>     col1
#>  1:    1
#>  2:   NA
#>  3:   NA
#>  4:    4
#>  5:   NA
#>  6:   NA
#>  7:    6
#>  8:    6
#>  9:    6
#> 10:    6
#> 11:    6
#> 12:    8
#> 13:   NA
#> 14:    2
#> 15:   NA

^{由reprex package (v2.0.1) 于 2021 年 10 月 11 日创建}

【讨论】：

【解决方案4】：

这是一个 dply 解决方案：

首先我以 tibble 格式创建数据：

df <- tibble(
  x = c(1, NA_real_, NA_real_, 
        4, NA_real_, NA_real_,
        6, NA_real_, NA_real_, NA_real_, 
        6, 8, NA_real_, 2, NA_real_)
)

接下来，我创建两个分组变量，这将有助于识别第一个和最后一个非 NA 值。然后我将这些参考值保存到ref_start 和ref_end。最后我覆盖了x的值：

df %>%
  mutate(gr1 = cumsum(!is.na(x))) %>%
  group_by(gr1) %>%
  mutate(ref_start = first(x)) %>%
  ungroup() %>%
  mutate(gr2 = lag(gr1, default = 1)) %>%
  group_by(gr2) %>%
  mutate(ref_end = last(x)) %>%
  ungroup() %>%
  mutate(x = if_else(is.na(x) & ref_start == ref_end, ref_start, x))

# A tibble: 15 x 1
       x
   <dbl>
 1     1
 2    NA
 3    NA
 4     4
 5    NA
 6    NA
 7     6
 8     6
 9     6
10     6
11     6
12     8
13    NA
14     2
15    NA

【讨论】：