插值 NA 值答案

【问题标题】：Interpolate NA values插值 NA 值
【发布时间】：2011-11-03 13:35:24
【问题描述】：

我有两组时间无关的样本。我想合并它们并计算缺失值在我没有这两种价值观的时代。简化示例：

A <- cbind(time=c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
           Avalue=c(1, 2, 3, 2, 1, 2, 3, 2, 1, 2))
B <- cbind(time=c(15, 30, 45, 60), Bvalue=c(100, 200, 300, 400))
C <- merge(A,B, all=TRUE)

   time Avalue Bvalue
1    10      1     NA
2    15     NA    100
3    20      2     NA
4    30      3    200
5    40      2     NA
6    45     NA    300
7    50      1     NA
8    60      2    400
9    70      3     NA
10   80      2     NA
11   90      1     NA
12  100      2     NA

通过假设每个样本之间的线性变化，可以计算缺失的 NA 值。直观地很容易看出，在 15 和 45 时刻的 A 值应该是 1.5。但是对于 B 的正确计算例如在时间 20 将是

100 + (20 - 15) * (200 - 100) / (30 - 15)

等于 133.33333。第一个括号是估计时间和最后一个可用样本之间的时间。第二个括号是最近样本之间的差异。第三个括号是最近样本之间的时间。

如何使用 R 计算 NA 值？

【问题讨论】：

应该重命名为“interpolate”还是“impute”（“...缺失值”）？我不认为“外推”适用于此。
是的，你是对的，插值是正确的术语。我会更新

标签： r interpolation

【解决方案1】：

一个丑陋且可能效率低下的 Base R 解决方案：

# Data provided:
A <- cbind(time=c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
           Avalue=c(1, 2, 3, 2, 1, 2, 3, 2, 1, 2))
B <- cbind(time=c(15, 30, 45, 60), Bvalue=c(100, 200, 300, 400))
C <- merge(A,B, all=TRUE)

# Scalar valued at the minimum time difference: -> min_time_diff

min_time_diff <- min(diff(C$time))

# Adjust frequency of the series to hold all steps in range: -> df

df <- merge(C, 
            data.frame(time = seq(min_time_diff, 
                                 max(C$time), 
                                 by = min_time_diff)),
           by = "time",
           all = TRUE)



# Linear interpolation function handling ties,
# returns interpolated vector the same length 
# a the input vector: -> vector

l_interp_vec <- function(na_vec){

  approx(x = na_vec,

         method = "linear",

         ties = "constant",

         n = length(na_vec))$y

}

# Applied to a dataframe, replacing NA values
# in each of the numeric vectors, 
# with interpolated values. 
# input is dataframe: -> dataframe()

interped_df <- data.frame(lapply(df, function(x){

      if(is.numeric(x)){

        # Store a scalar of min row where x isn't NA: -> min_non_na

        min_non_na <- min(which(!(is.na(x))))

        # Store a scalar of max row where x isn't NA: -> max_non_na

        max_non_na <- max(which(!(is.na(x))))

        # Store scalar of the number of rows needed to impute prior 
        # to first NA value: -> ru_lower

        ru_lower <- ifelse(min_non_na > 1, min_non_na - 1, min_non_na)

        # Store scalar of the number of rows needed to impute after
        # the last non-NA value: -> ru_lower

        ru_upper <- ifelse(max_non_na == length(x), 

                           length(x) - 1, 

                           (length(x) - (max_non_na + 1)))

        # Store a vector of the ramp to function: -> l_ramp_up: 

        ramp_up <- as.numeric(
          cumsum(rep(x[min_non_na]/(min_non_na), ru_lower))
          )

        # Apply the interpolation function on vector "x": -> y

        y <- as.numeric(l_interp_vec(as.numeric(x[min_non_na:max_non_na])))

        # Create a vector that combines the ramp_up vector 
        # and y if the first NA is at row 1: -> z

        if(length(ramp_up) > 1 & max_non_na != length(x)){

          # Create a vector interpolations if there are 
          # multiple NA values after the last value: -> lower_l_int

          lower_l_int <- as.numeric(cumsum(rep(mean(diff(c(ramp_up, y))),
                                               ru_upper+1)) +
                                  as.numeric(x[max_non_na]))

          # Store the linear interpolations in  a vector: -> z

          z <- as.numeric(c(ramp_up, y, lower_l_int))

        }else if(length(ramp_up) > 1 & max_non_na == length(x)){

          # Store the linear interpolations in  a vector: -> z

          z <- as.numeric(c(ramp_up, y))

        }else if(min_non_na == 1 & max_non_na != length(x)){

          # Create a vector interpolations if there are 
          # multiple NA values after the last value: -> lower_l_int

          lower_l_int <- as.numeric(cumsum(rep(mean(diff(c(ramp_up, y))),
                                               ru_upper+1)) +
                                  as.numeric(x[max_non_na]))


          # Store the linear interpolations in  a vector: -> z

          z <- as.numeric(c(y, lower_l_int))

        }else{

          # Store the linear interpolations in  a vector: -> z

          z <- as.numeric(y)

        }

        # Interpolate between points in x, return new x:

        return(as.numeric(ifelse(is.na(x), z, x)))

      }else{

        x

      }

    }

  )

)

# Subset interped df to only contain 
# the time values in C, store a data frame: -> int_df_subset

int_df_subset <- interped_df[interped_df$time %in% C$time,]

【讨论】：

【解决方案2】：

在统计上执行此操作并仍然获得有效置信区间的正确方法是使用多重插补。看鲁宾的经典book，还有优秀的Rpackage for this (mi)。

【讨论】：

是否愿意为 Rubin 论文提供引用？
找不到论文。他的书也很经典。如果我稍后找到我正在考虑的论文，我会进一步编辑。

【解决方案3】：

使用zoo 包：

library(zoo)
Cz <- zoo(C)
index(Cz) <- Cz[,1]
Cz_approx <- na.approx(Cz)

【讨论】：

太棒了。我不太明白index(Cz) <- Cz[,1] 语句在做什么，需要解释一下吗？
默认情况下，na.approx() 函数使用 index(obj) 作为插入数据帧每一列的点。默认索引为 1:12，因此我使用 index() 将其替换为实际时间测量值。但是，如果您想保留默认索引，可以调用 na.approx(Cz, x=Cz$time)。
图书馆（动物园）； ?index “描述：用于提取对象索引并替换它的通用函数。”您正在操作动物园对象的各个部分。在提出问题之前向 RTFM 提出建议总是一个好主意。
请注意，将数据框转换为动物园也可以写为Cz <- read.zoo(C)，它会自动假定第一列包含时间。此外，zoo 的 na.approx 有一个适用于普通向量的默认方法，因此即使不将 C 转换为 zoo，我们也可以这样做：C$Bvalue <- na.approx(C$Bvalue, C$time, na.rm = FALSE)。
也可以考虑在该命令周围添加na.fill(na.approx(Cz), "extend")，因此前导和尾随 NA 不会造成额外的困难。