【问题标题】:Search value through multiple columns, return id of row通过多列搜索值,返回行的id
【发布时间】:2017-08-17 07:19:52
【问题描述】:

假设我有两个数据框:

A = 由唯一电话号码和附加因子列组成的数据框。假设 nrow(A) = 20

B = 数据框由代表唯一家庭的行和列出的电话号码的四列以及唯一家庭 ID 的第五列组成。有可能在多个 B 列中重复相同的数字。假设 nrow(B) = 100

在检查 A 电话号码是否在四列中的任何一列中后,我想返回一个包含“A”唯一电话号码和来自“B”的家庭 ID 的表。

例如:

a <- data.frame(phone=c("12345","12346","12456"),
                factor=c("OK","BAD","BAD"))
b <- data.frame(ph1 = c("12345","","12346","12347",""), 
                ph2 = c("","","12346","","12348"), 
                ph3 = c("","","","12456","67890"), 
                hhid = seq(1121,1125))

如何返回如下所示的 C:

c <- data.frame(phone = c("12345","12346","12456"),
                factor = c("OK","BAD","BAD"), 
                hhid = c("1121","1123","1124"))

我确信以非常优雅的方式或使用最少的代码可以做到这一点。我考虑过使用 for 循环或合并,但认为这是错误的轨道。开放使用任何包。

【问题讨论】:

  • 更新 - 我收到了一堆使用不同软件包的不同建议。这有助于我了解不同的包以及 base 可以做什么。我的需求已满 - 但请随时分享有关此问题的任何其他信息。

标签: r dplyr plyr


【解决方案1】:
library(dplyr)
library(tidyr)

a <- data.frame(phone=c("12345","12346","12456"),
                factor=c("OK","BAD","BAD"))
b <- data.frame(ph1 = c("12345","","12346","12347",""), 
                ph2 = c("","","12346","","12348"), 
                ph3 = c("","","","12456","67890"), 
                hhid = seq(1121,1125))

# reshape data and keep unique combinations
b2 = b %>% 
  gather(ph, phone, -hhid) %>% 
  select(-ph) %>% 
  distinct()

# join data frames
left_join(a, b2, by = "phone")

#   phone factor hhid
# 1 12345     OK 1121
# 2 12346    BAD 1123
# 3 12456    BAD 1124

【讨论】:

  • 啊 - 这很棒而且很优雅。非常感谢!我希望我能想到收集。
【解决方案2】:

这是data.table的一个选项

library(data.table)
setDT(a)[unique(setDT(b)[, .(phone = unlist(.SD)), hhid][phone != ""]),
          hhid := hhid, on = .(phone)]
a
#   phone factor hhid
#1: 12345     OK 1121
#2: 12346    BAD 1123
#3: 12456    BAD 1124

【讨论】:

  • 啊,我听说过的臭名昭著的 data.table 包。非常感谢。我试过了,效果很好;不过,还需要一些时间来了解你施了什么魔法!
【解决方案3】:

这里是base R 解决方案,因为您将数据作为字符或选项读取:options(stringsAsFactors = F)

tmp <- unique(reshape(b, 
    direction="long",
    varying = 1:3,
    v.names="phone",
    timevar = "variable")[,c(1, 3)])
tmp[tmp$phone!="",]
merge(tmp, a, by="phone")
#  phone hhid factor
#1 12345 1121     OK
#2 12346 1123    BAD
#3 12456 1124    BAD

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-12-31
    • 1970-01-01
    • 2015-05-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多