【问题标题】:Fill gaps using a group mean in R使用 R 中的组均值填补空白
【发布时间】:2019-09-30 23:02:48
【问题描述】:

我有一个数据集,其中一列 (temp) 中有间隙。我正在尝试使用来自“传感器”的“温度”数据或相同“处理”中“传感器”的平均值,当然还有相同的日期戳来填补空白。我正在尝试使用 tidyverse/lubridate 来做到这一点。

date    treatment   sensor  temp
1/01/2019   1   A   30
2/01/2019   1   A   29.1
3/01/2019   1   A   21.2
4/01/2019   1   A   NA
1/01/2019   1   B   20.5
2/01/2019   1   B   19.8
3/01/2019   1   B   35.1
4/01/2019   1   B   23.5
1/01/2019   2   C   31.2
2/01/2019   2   C   32.1
3/01/2019   2   C   28.1
4/01/2019   2   C   31.2
1/01/2019   2   D   NA
2/01/2019   2   D   26.5
3/01/2019   2   D   27.9
4/01/2019   2   D   28

这是我所期待的:

date    treatment   sensor  temp
1/01/2019   1   A   30
2/01/2019   1   A   29.1
3/01/2019   1   A   21.2
4/01/2019   1   A   23.5
1/01/2019   1   B   20.5
2/01/2019   1   B   19.8
3/01/2019   1   B   35.1
4/01/2019   1   B   23.5
1/01/2019   2   C   31.2
2/01/2019   2   C   32.1
3/01/2019   2   C   28.1
4/01/2019   2   C   31.2
1/01/2019   2   D   31.2
2/01/2019   2   D   26.5
3/01/2019   2   D   27.9
4/01/2019   2   D   28

非常感谢您的帮助。

【问题讨论】:

  • 您的意思是在 temp 列中提供一些值吗?您是要在时间上向前或向后传递一个值,还是仅仅在另一列中传递一个值?
  • tidyr::fill() 在这个例子中很难使用,因为当数据按日期和处理分组时,一个填充是“向下”,另一个是“向上”。

标签: r tidyverse


【解决方案1】:

这个怎么样:

df <- df %>%
group_by(date, treatment) %>%
mutate(
  fill = mean(temp, na.rm=TRUE), # value to fill in blanks
  temp2 = case_when(!is.na(temp) ~ temp,
                    TRUE ~ fill)
  )   

【讨论】:

  • 这非常适合我想要实现的目标。非常感谢
  • 在这种情况下你也可以使用if_else,这样更紧凑,但我更倾向于使用case_when,因为它更灵活。
  • 我也更喜欢使用 case_when
【解决方案2】:

na.aggregate 的另一个选项来自zoo

library(dplyr)
library(zoo)
df %>% 
   group_by(date, treatment) %>%
   mutate(temp = na.aggregate(temp))
# A tibble: 16 x 4
# Groups:   date, treatment [8]
#   date      treatment sensor  temp
#   <fct>         <int> <fct>  <dbl>
# 1 1/01/2019         1 A       30  
# 2 2/01/2019         1 A       29.1
# 3 3/01/2019         1 A       21.2
# 4 4/01/2019         1 A       23.5
# 5 1/01/2019         1 B       20.5
# 6 2/01/2019         1 B       19.8
# 7 3/01/2019         1 B       35.1
# 8 4/01/2019         1 B       23.5
# 9 1/01/2019         2 C       31.2
#10 2/01/2019         2 C       32.1
#11 3/01/2019         2 C       28.1
#12 4/01/2019         2 C       31.2
#13 1/01/2019         2 D       31.2
#14 2/01/2019         2 D       26.5
#15 3/01/2019         2 D       27.9
#16 4/01/2019         2 D       28  

数据

df <- structure(list(date = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("1/01/2019", 
"2/01/2019", "3/01/2019", "4/01/2019"), class = "factor"), treatment = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
    sensor = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 
    3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"
    ), class = "factor"), temp = c(30, 29.1, 21.2, NA, 20.5, 
    19.8, 35.1, 23.5, 31.2, 32.1, 28.1, 31.2, NA, 26.5, 27.9, 
    28)), class = "data.frame", row.names = c(NA, -16L))

【讨论】:

    【解决方案3】:

    这是使用来自purrrmap2_dbl 的一个选项。我们group_bytreatment 并将NA temp 替换为组中的第一个非NA temp 和相同的date

    library(dplyr)
    library(purrr)
    
    df %>%
      group_by(treatment) %>%
      mutate(temp = map2_dbl(temp, date, ~if (is.na(.x)) 
                        temp[which.max(date == .y & !is.na(temp))] else .x))
    
    #   date      treatment sensor  temp
    #   <fct>         <int> <fct>  <dbl>
    # 1 1/01/2019         1 A       30  
    # 2 2/01/2019         1 A       29.1
    # 3 3/01/2019         1 A       21.2
    # 4 4/01/2019         1 A       23.5
    # 5 1/01/2019         1 B       20.5
    # 6 2/01/2019         1 B       19.8
    # 7 3/01/2019         1 B       35.1
    # 8 4/01/2019         1 B       23.5
    # 9 1/01/2019         2 C       31.2
    #10 2/01/2019         2 C       32.1
    #11 3/01/2019         2 C       28.1
    #12 4/01/2019         2 C       31.2
    #13 1/01/2019         2 D       31.2
    #14 2/01/2019         2 D       26.5
    #15 3/01/2019         2 D       27.9
    #16 4/01/2019         2 D       28  
    

    数据

    df <- structure(list(date = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 
    4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("1/01/2019", 
    "2/01/2019", "3/01/2019", "4/01/2019"), class = "factor"), treatment = 
    c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
    sensor = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 
    3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"
    ), class = "factor"), temp = c(30, 29.1, 21.2, NA, 20.5, 
    19.8, 35.1, 23.5, 31.2, 32.1, 28.1, 31.2, NA, 26.5, 27.9, 
    28)), class = "data.frame", row.names = c(NA, -16L))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2014-01-26
      • 2015-08-19
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-03-07
      相关资源
      最近更新 更多