用平均值填补时间序列中的空白答案

【问题标题】：fill gaps in a timeseries with averages用平均值填补时间序列中的空白
【发布时间】：2011-09-08 09:39:24
【问题描述】：

我有一个这样的数据框：

day         sum_flux  samples mean
2005-10-26     0.02     48    0.02
2005-10-27     0.12     12    0.50

这是跨越 5 年的一系列每日读数，但有些日子不见了。我想用其他年份的那个月的平均值来填充这些天。

即如果 26-10-2005 缺失，我想使用数据集中所有 10 月的平均值。如果错过了整个 10 月，我想将此平均值应用于每个缺失的日子。

我想我需要构建一个函数（可能使用 plyr）来评估日子。但是，我对使用 R 中的各种时间序列对象以及有条件的子集数据非常缺乏经验，并且希望得到一些建议。特别是关于我应该使用哪种类型的时间序列。

非常感谢

【问题讨论】：

通过这样做，您将假设没有趋势，即每年都有与其他年份相似的值。你确定你相信吗？
另外，您要将平均值应用于哪一列，sum_flux 或 mean？

标签： r time-series

【解决方案1】：

一些示例数据。我假设 sum_flux 是包含缺失值的列，并且您要为其计算值。

library(lubridate)
days <- seq.POSIXt(ymd("2005-10-26"), ymd("2010-10-26"), by = "1 day")
n_days <- length(days)
readings <- data.frame(
  day      = days,
  sum_flux = runif(n_days),
  samples  = sample(100, n_days, replace = TRUE),
  mean     = runif(n_days)
)
readings$sum_flux[sample(n_days, floor(n_days / 10))] <- NA

添加月份列。

readings$month <- month(readings$day, label = TRUE)

使用tapply 获取月平均流量。

monthly_avg_flux <- with(readings, tapply(sum_flux, month, mean, na.rm = TRUE))

当助焊剂丢失时使用此值，否则保留助焊剂。

readings$sum_flux2 <- with(readings, ifelse(
  is.na(sum_flux), 
  monthly_avg_flux[month], 
  sum_flux
))

【讨论】：

+1 表示润滑并指出您评论中的效果
非常感谢 Richie，对于延迟回复感到抱歉。 RE：假设没有趋势，通常年度变化大于任何可测量的趋势（时间序列太短）。
刚刚浏览了数据，正是我想要的，再次感谢。

【解决方案2】：

这是data.table 中的一种（非常快）方式。

使用来自 Richie 的优秀示例数据：

require(data.table)
days <- seq(as.IDate("2005-10-26"), as.IDate("2010-10-26"), by = "1 day")
n_days <- length(days)
readings <- data.table(
    day      = days,
    sum_flux = runif(n_days),
    samples  = sample(100, n_days, replace = TRUE),
    mean     = runif(n_days)
)
readings$sum_flux[sample(n_days, floor(n_days / 10))] <- NA
readings
             day   sum_flux samples       mean
 [1,] 2005-10-26 0.32838686      94 0.09647325
 [2,] 2005-10-27 0.14686591      88 0.48728321
 [3,] 2005-10-28 0.25800913      51 0.72776002
 [4,] 2005-10-29 0.09628937      81 0.80954124
 [5,] 2005-10-30 0.70721591      23 0.60165240
 [6,] 2005-10-31 0.59555079       2 0.96849533
 [7,] 2005-11-01         NA      42 0.37566491
 [8,] 2005-11-02 0.01649860      89 0.48866220
 [9,] 2005-11-03 0.46802818      49 0.28920807
[10,] 2005-11-04 0.13024856      30 0.29051080
First 10 rows of 1827 printed.

按每个组的出现顺序创建每个月的平均值：

> avg = readings[,mean(sum_flux,na.rm=TRUE),by=list(mnth = month(day))]
> avg
      mnth        V1
 [1,]   10 0.4915999
 [2,]   11 0.5107873
 [3,]   12 0.4451787
 [4,]    1 0.4966040
 [5,]    2 0.4972244
 [6,]    3 0.4952821
 [7,]    4 0.5106539
 [8,]    5 0.4717122
 [9,]    6 0.5110490
[10,]    7 0.4507383
[11,]    8 0.4680827
[12,]    9 0.5150618

下一次重新订购avg 将于 1 月开始：

avg = avg[order(mnth)]
avg
      mnth        V1
 [1,]    1 0.4966040
 [2,]    2 0.4972244
 [3,]    3 0.4952821
 [4,]    4 0.5106539
 [5,]    5 0.4717122
 [6,]    6 0.5110490
 [7,]    7 0.4507383
 [8,]    8 0.4680827
 [9,]    9 0.5150618
[10,]   10 0.4915999
[11,]   11 0.5107873
[12,]   12 0.4451787

现在通过引用 (:=) 更新sum_flux 列，其中sum_flux 是NA，使用来自avg 的那个月的值。

readings[is.na(sum_flux), sum_flux:=avg$V1[month(day)]]
             day   sum_flux samples       mean
 [1,] 2005-10-26 0.32838686      94 0.09647325
 [2,] 2005-10-27 0.14686591      88 0.48728321
 [3,] 2005-10-28 0.25800913      51 0.72776002
 [4,] 2005-10-29 0.09628937      81 0.80954124
 [5,] 2005-10-30 0.70721591      23 0.60165240
 [6,] 2005-10-31 0.59555079       2 0.96849533
 [7,] 2005-11-01 0.51078729**    42 0.37566491  # ** updated with the Nov avg
 [8,] 2005-11-02 0.01649860      89 0.48866220
 [9,] 2005-11-03 0.46802818      49 0.28920807
[10,] 2005-11-04 0.13024856      30 0.29051080
First 10 rows of 1827 printed.

完成。

【讨论】：