具有链式连接的复杂条件答案

【问题标题】：Complex condition with chained joins具有链式连接的复杂条件
【发布时间】：2017-01-30 05:15:14
【问题描述】：

我需要根据与第一个链接成链的另外两个数据帧中的值更新数据帧。

目标 df t_offices 此处有 4 个感兴趣的字段：

       administrative_area_level_1 administrative_area_level_2       country     locality
     1                     Arizona             Maricopa County United States      Phoenix
     2        District of Columbia                        <NA> United States   Washington
     3                        <NA>                        <NA>         India         <NA>
     4                    New York               Albany County United States       Albany
     5                     Utrecht                  Nieuwegein   Netherlands   Nieuwegein
     6                 Connecticut            Fairfield County United States     Stamford
   707                    Illinois                        <NA> United States         <NA>
  4241                    Illinois                        <NA> United States West Chicago
999998                     Alabama                        <NA> United States      Altoona
999999                Pennsylvania                        <NA> United States   Washington

我需要将administrative_area_level_2 中的 NA 值更新为美国记录的县。值在 df t_places:

      state_ab           place_name                  county_name place_nameshort
     1      AL           Abanda CDP              Chambers County          Abanda
     2      AL       Abbeville city                 Henry County       Abbeville
     3      AL      Adamsville city             Jefferson County      Adamsville
     4      AL         Addison town               Winston County         Addison 
     5      AL           Akron town                  Hale County           Akron
     6      AL       Alabaster city                Shelby County       Alabaster
    12      AL         Altoona town Blount County, Etowah County         Altoona
  4298      DC      Washington city         District of Columbia      Washington
  7527      IL    West Chicago city                DuPage County      Washington
 32611      PA  Washington township             Armstrong County    West Chicago
 32612      PA  Washington township                 Berks County      Washington

place_nameshort 是 place_name 的截断版本，没有指定（例如“城市”、“城镇”等）

我加入t_offices 和t_places 的州和地方，以便获得正确的县。这可能会返回多个县 1) 因为county_name 可以包含多个县，用逗号分隔，2) 因为截断的place_nameshort 可能会返回同一州内的同义词。我需要只需要县明确的情况（返回单个县）。

由于t_places 仅包含state_ab，我需要第三个数据框r_states 用于state_name：

   state_ab             state_name
 1       AL                Alabama
 2       AK                 Alaska
 3       AZ                Arizona
 4       AR               Arkansas
 5       CA             California
 6       CO               Colorado
 9       DC   District of Columbia
17       IL               Illinois
42       PA           Pennsylvania

通过在state_ab 上加入t_places 和r_states，我可以获得与t_offices$administrative_area_level_1 匹配的state_name。

这是我的尝试，这是不完整的，因为由于州内同义词，它无法控制多个县，并且无论如何都不起作用。

no_county <- (!is.na(t_offices$country) 
          & t_offices$country == "United States" 
          & !is.na(t_offices$administrative_area_level_1) 
          & is.na(t_offices$administrative_area_level_2) 
          & !is.na(t_offices$locality))

t_offices$administrative_area_level_2[no_county] <-
  t_places$county_name[!grepl(",", t_places$county_name) 
                       & match(t_places$place_nameshort, t_offices$locality[no_county]) 
                       & match(t_places$state_ab, 
                               r_states$state_ab[match(r_states$state_name, 
                                                       t_offices$administrative_area_level_1[no_county])])]

编辑：遵循@r2evans 的建议，这是我的新编码尝试，但仍然不起作用：

# split multiple counties into columns
library(splitstackshape)
t_places <- cSplit(t_places, "county_name", sep = ", ", drop = F, type.convert = F)

# merge state names into places  
places_statename <- merge(t_places, r_states[,2:3])

# define condition to select t_offices records in U.S. with state and place but no county
no_county <- (
  # country is U.S.
  !is.na(t_offices$country)
  & t_offices$country == "United States"
  # with state
  & !is.na(t_offices$administrative_area_level_1)
  # blank county
  & is.na(t_offices$administrative_area_level_2)
  # with place
  & !is.na(t_offices$locality))

# update blank counties
t_offices$administrative_area_level_2[no_county] <-
  # unambiguous counties
  places_statename$county_name_1[is.na(places_statename$county_name_2)
                                 # locality matches place
                                 & match(t_offices$locality[no_county], places_statename$place_nameshort)
                                 # administrative_area_level_1 matches state
                                 & match(t_offices$administrative_area_level_1[no_county],places_statename$state_name)]

【问题讨论】：

我建议您修改您的数据以支持直接加入（通过merge 或dplyr::left_join 和朋友）。这使一切变得更容易、更健壮，并且更容易使用/排除故障。一个开始：如果county_name 可以包含多个逗号分隔的值，则用tidyr::separate 和tidyr::gather 之类的东西将它们分开（这样加入更直观/容易。第二个建议：请让这个问题可重复；就目前而言，我们没有满足您所有要求的代表性数据。
@r2evans 感谢您的建议！我添加了（真实的和虚构的）样本数据，以使问题可重现。至于您的第一个建议，我是否应该合并 t_places 和 r_states 并将 County_name 合并到一个表中，然后将该表与 t_offices 连接起来？
@r2evans 不会融化，而是转置成多列
“可重现”在这里意味着“易于在我自己的笔记本电脑上使用”。这表明不必手动输入。如果您在您提供的数据子集上提供dput 的输出，将会有所帮助。
@r2evans 我已经输入了 r_states、t_places 和 t_offices 样本（所有缺少的县——有些记录不存在）here。

标签： r

【解决方案1】：

这是我的长期解决方案。可能有更短、更优雅的。

# split multiple counties into columns
library(splitstackshape)
t_places <- cSplit(t_places, "county_name", sep = ", ", drop = F, type.convert = F)
# subset original places with single county
places_singlecounty <- t_places[is.na(places_statename$county_name_2), c(1,8,9)]
# subset truncated places with single county
library(data.table)
setDT(places_singlecounty)
places_singlecounty <- merge(places_singlecounty, 
                             places_singlecounty[, .N, by = c("state_ab", "place_nameshort")][N == 1, 1:2])
# merge state names into single-county truncated places
places_statename <- merge(places_singlecounty, r_states[,2:3], by = "state_ab")

# define condition to select t_offices records in U.S. with state and place but no county
no_county <- (
  # country is U.S.
  !is.na(t_offices$country) 
  & t_offices$country == "United States" 
  # with state
  & !is.na(t_offices$administrative_area_level_1) 
  # NA county
  & is.na(t_offices$administrative_area_level_2) 
  # with place
  & !is.na(t_offices$locality))

# update t_offices NA counties based on single-county truncated places
setDT(t_offices)
t_offices[no_county, administrative_area_level_2 := 
           places_statename[.(.SD), county_name_1,
                            on = c(state_name = "administrative_area_level_1",  
                                   place_nameshort = "locality")]]

【讨论】：