如何将 NA 值替换为赋予相同 ID 的先前非 NA 值答案

【问题标题】：How to replace NA value with previous non NA value given to same ID如何将 NA 值替换为赋予相同 ID 的先前非 NA 值
【发布时间】：2020-03-22 01:24:52
【问题描述】：

我在 R 中工作并且正在使用 data.table。我有一个如下所示的数据集：

ID   country_id    weight
1    BGD           56
1    NA            57
1    NA            63
2    SA            12
2    NA            53
2    SA            54

如果 country_id 中的值是 NA，我需要将其替换为赋予相同 ID 的 non-na country_id 值。我希望数据集看起来像这样：

ID   country_id    weight
1    BGD           56
1    BGD           57
1    BGD           63
2    SA            12
2    SA            53
2    SA            54

此数据集包含数百万个 ID，因此无法手动修复每个 ID。

感谢您的帮助！

编辑：已解决！

我使用了以下代码： dt[, country_id := country_id[!is.na(country_id)][1], by = ID]

【问题讨论】：

这能回答你的问题吗？ Replace missing values (NA) with most recent non-NA by group
或 dt[, country_id := country_id[!is.na(country_id)][1], by = ID] 应该可以工作
@Andrew 谢谢！！这有效！
@sindri_baldur 这很公平，尽管我标记的骗子的 list of 29 questions linking 包括几个 data.table 。我试图找到很多其他人链接回的帖子。你想标记一个更具体的 data.table 吗？这肯定是一个已经讨论过的问题
您可能希望在 nafill 和 setnafill 上跟踪此 GitHub issue 以获取字符列。

标签： r data.table

【解决方案1】：

另一种选择是使用连接：

DT[is.na(country_id), country_id := 
    DT[!is.na(country_id)][.SD, on=.(ID), mult="first", country_id]]

解释：

DT[is.na(country_id) 将数据集子集到 country_id 列中具有 NA 的那些
.SD 是上一步的数据子集（也是 data.table）。
DT[!is.na(country_id)][.SD, on=.(ID) 左连接 .SD 和 DT[!is.na(country_id)]，使用 ID 作为键。
j=country_id从右表DT[!is.na(country_id)]返回country_id列，如果有多个匹配项，mult="first"返回第一个匹配项。
country_id := 将is.na(country_id) 为 TRUE 的 DT 行中的列 country_id 更新为连接的结果。

按照 Andrew 的时序代码和类似但更大的数据：

library(data.table)
set.seed(42)

nr <- 1e7
dt <- data.table(ID = rep(1:(nr/4), each = 4),
    country_id = rep(rep(c("BGD", "SA", "USA", "DEN", "THI"), each = 4)),
    weight = sample(10:100, nr, TRUE))
dt[sample(1:nr, nr/2), country_id := NA]
DT <- copy(dt)

microbenchmark::microbenchmark(
    first_nonmissing = dt[, country_id := country_id[!is.na(country_id)][1L], by = ID],
    use_join=DT[is.na(country_id), country_id := DT[!is.na(country_id)][.SD, on=.(ID), mult="first", country_id]],
    times = 1L
)

时间安排：

Unit: milliseconds
             expr       min        lq      mean    median        uq       max neval
 first_nonmissing 3282.1373 3282.1373 3282.1373 3282.1373 3282.1373 3282.1373     1
         use_join  554.5314  554.5314  554.5314  554.5314  554.5314  554.5314     1

【讨论】：

Chinsoon，你介意分解一下 join 在这里所做的事情吗：)？

【解决方案2】：

根据 cmets 中的答案/建议，您有几个选择。我模拟了一个包含 1,000,000 行且在您的 country_id 列中缺失 30% 的数据集，以了解在您的情况下最适合扩展的数据集。

在此基准测试中扩展性最好的答案将NA 替换为具有相同ID 的第一个非缺失值：dt[, country_id := country_id[!is.na(country_id)][1], by = ID]。

Unit: milliseconds
             expr       min        lq      mean    median        uq       max neval
 first_nonmissing  253.0039  267.0272  284.3988  271.4015  274.5101  405.2004    10
            tidyr  943.6658  951.9638  970.7185  960.6233  971.0660 1069.3023    10
          na.locf 7173.9556 7218.2757 7267.6968 7271.0279 7325.6820 7344.9142    10

基准代码：

microbenchmark::microbenchmark(
  first_nonmissing = dt[, country_id := country_id[!is.na(country_id)][1], by = ID],
  tidyr = tidyr::fill(dplyr::group_by(dt, ID), country_id),
  na.locf = dt[, country_id := zoo::na.locf(country_id, na.rm = FALSE), by = ID],
  times = 10
)

数据：

library(data.table)
set.seed(42)

dt <- data.table(ID = rep(1:250000, each = 4),
                 country_id = rep(rep(c("BGD", "SA", "USA", "DEN", "THI"), each = 4)),
                 weight = sample(10:100, 1e6, replace = T))

dt$country_id[sample(1:1e6, 3e5)] <- NA

【讨论】：

【解决方案3】：

希望下面的代码可以帮助你填写NA

res <- Reduce(rbind,
       lapply(split(df,df$ID), function(v) 
         {v$country_id <- head(v$country_id[!is.na(v$country_id)],1);v}))

屈服

  ID country_id weight
1  1        BGD     56
2  1        BGD     57
3  1        BGD     63
4  2         SA     12
5  2         SA     53
6  2         SA     54

【讨论】：