r 按 ID 将值从一个数据集传输到另一个数据集答案

【问题标题】：r transfer values from one dataset to another by IDr 按 ID 将值从一个数据集传输到另一个数据集
【发布时间】：2021-10-11 02:25:52
【问题描述】：

我有两个数据集，第一个数据集是这样的

   ID     Weight     State
   1      12.34      NA
   2      11.23      IA
   2      13.12      IN
   3      12.67      MA 
   4      10.89      NA
   5      14.12      NA

第二个数据集是按 ID 查找状态值的查找表

   ID    State
   1     WY
   2     IA
   3     MA
   4     OR
   4     CA
   5     FL

如您所见，ID 4 有两个不同的状态值，这是正常的。

我想要做的是用数据集 2 中的状态值替换数据集 1 状态列中的 NA。预期数据集

  ID     Weight     State
   1      12.34      WY
   2      11.23      IA
   2      13.12      IN
   3      12.67      MA 
   4      10.89      OR,CA
   5      14.12      FL

由于 ID 4 在 dataset2 中有两个状态值，这两个值被折叠并用分隔，用于替换 dataset1 中的 NA。任何关于实现这一点的建议都非常感谢。提前致谢。

【问题讨论】：

标签： r dplyr replace missing-data transfer

【解决方案1】：

折叠df2 值并将其与df1 通过'ID' 连接起来。使用 coalesce 来使用两个状态列中的非 NA 值。

library(dplyr)

df1 %>%
  left_join(df2 %>%
              group_by(ID) %>%
              summarise(State = toString(State)), by = 'ID') %>%
  mutate(State = coalesce(State.x, State.y)) %>%
  select(-State.x, -State.y)

#  ID Weight  State
#1  1   12.3     WY
#2  2   11.2     IA
#3  2   13.1     IN
#4  3   12.7     MA
#5  4   10.9 OR, CA
#6  5   14.1     FL

在带有merge 和transform 的基础R 中。

merge(df1, aggregate(State~ID, df2, toString), by = 'ID') |>
  transform(State = ifelse(is.na(State.x), State.y, State.x))

【讨论】：

【解决方案2】：

Tidyverse 方式：

library(tidyverse)
df1 %>%
  left_join(df2 %>%
              group_by(ID) %>%
              summarise(State = toString(State)) %>%
              ungroup(), by = 'ID') %>%
  transmute(ID, Weight, State = coalesce(State.x, State.y))

基础 R 替代方案：

na_idx <- which(is.na(df1$State))
df1$State[na_idx] <- with(
  aggregate(State ~ ID, df2, toString),
  State[match(df1$ID, ID)]
)[na_idx]

数据：

df1 <- structure(list(ID = c(1L, 2L, 2L, 3L, 4L, 5L), Weight = c(12.34, 
11.23, 13.12, 12.67, 10.89, 14.12), State = c("WY", "IA", "IN", 
"MA", "OR, CA", "FL")), row.names = c(NA, -6L), class = "data.frame")

df2 <- structure(list(ID = c(1L, 2L, 3L, 4L, 4L, 5L), State = c("WY", 
"IA", "MA", "OR", "CA", "FL")), class = "data.frame", row.names = c(NA, 
-6L))

【讨论】：