将一个数据集中的多行与另一个数据集中的值匹配以创建新列答案

【问题标题】：Matching multiple rows in one dataset to a value in another dataset to create a new column将一个数据集中的多行与另一个数据集中的值匹配以创建新列
【发布时间】：2021-11-23 22:31:34
【问题描述】：

我有一个数据集，其中包含美国城市，但目前所有州都标记为美国 - 数据集称为 us_data。我找到了一个包含所有美国城市和州的数据集 - 称为我们 - 我正在尝试在我的 us_data 数据集中创建一个新的州列，方法是获取城市名称，在我们中找到它并从我们那里拉出州以添加到us_data 中的新列。

我正在使用 R，但不知道该怎么做。我不相信常规连接会起作用，因为在 us_data 中可以有多个观察到同一个城市，所以需要匹配同一个城市的所有行。我正在考虑使用 dplyr 中的 mutate() 但不确定如何在函数调用中引用两个数据集，因此将不胜感激！我附上了这两个数据集的一瞥以供参考。

us_data

> dput(us_data[1:10,1:7])
structure(list(name = c("Carpenter Rd.", "1515 N. Sheridan - Wilmette", 
"S McCarran & E Greg St - Sparks", "Hwy 20 & Tharp - Yuba City", 
"Greenmount & I-64", "Veterans Blvd & Kingman St", "Hampden & Dayton - Denver", 
"50th and Kipling-Wheatridge, CO", "Higuera & Tank Farm", "Burr Ridge-I-55 & County Line Rd"
), url = c("https://www.starbucks.com/store-locator/store/6323", 
"https://www.starbucks.com/store-locator/store/6325", "https://www.starbucks.com/store-locator/store/6327", 
"https://www.starbucks.com/store-locator/store/6328", "https://www.starbucks.com/store-locator/store/6329", 
"https://www.starbucks.com/store-locator/store/6330", "https://www.starbucks.com/store-locator/store/6334", 
"https://www.starbucks.com/store-locator/store/6333", "https://www.starbucks.com/store-locator/store/6331", 
"https://www.starbucks.com/store-locator/store/6340"), street_address = c("3650 Carpenter Rd.", 
"1515 North Sheridan, Building 4", "1560 S. Stanford Way, Suite A", 
"1615 Colusa Hwy, Ste 100", "1126 Central Park Drive", "4312 Veterans Blvd.", 
"9925 East Hampden Ave", "4975 Kipling St", "3971 S. Higuera Street", 
"515 Village Center Dr."), city = c("Pittsfield", "Wilmette", 
"Sparks", "Yuba City", "OFallon", "Metairie", "Denver", "Wheat Ridge", 
"San Luis Obispo", "Burr Ridge"), state = c("US", "US", "US", 
"US", "US", "US", "US", "US", "US", "US"), zip_code = c("48104", 
"600911822", "894316331", "959939437", "622691769", "70006", 
"802314903", "800332340", "934011580", "605274516"), country = c("US", 
"US", "US", "US", "US", "US", "US", "US", "US", "US")), row.names = c(NA, 
10L), class = "data.frame")

> dput(us[1:20,])
structure(list(city = c("New York", "Los Angeles", "Chicago", 
"Miami", "Dallas", "Philadelphia", "Houston", "Atlanta", "Washington", 
"Boston", "Phoenix", "Seattle", "San Francisco", "Detroit", "San Diego", 
"Minneapolis", "Tampa", "Denver", "Brooklyn", "Queens"), city_ascii = c("New York", 
"Los Angeles", "Chicago", "Miami", "Dallas", "Philadelphia", 
"Houston", "Atlanta", "Washington", "Boston", "Phoenix", "Seattle", 
"San Francisco", "Detroit", "San Diego", "Minneapolis", "Tampa", 
"Denver", "Brooklyn", "Queens"), state_id = c("NY", "CA", "IL", 
"FL", "TX", "PA", "TX", "GA", "DC", "MA", "AZ", "WA", "CA", "MI", 
"CA", "MN", "FL", "CO", "NY", "NY"), state_name = c("New York", 
"California", "Illinois", "Florida", "Texas", "Pennsylvania", 
"Texas", "Georgia", "District of Columbia", "Massachusetts", 
"Arizona", "Washington", "California", "Michigan", "California", 
"Minnesota", "Florida", "Colorado", "New York", "New York")), row.names = c(NA, 
20L), class = "data.frame")

【问题讨论】：

您好，欢迎您！请提供您的数据集的reproducible example，将dput(us_data) 和dput(us) 的输出粘贴到您的答案中。
好的。没有查看您的数据并不容易，但正如我从 us_data 数据集中观察到的那样，您也有 Zip_codes。因此，最简单的方法是创建一个包含 5 位 Zip_code 的新列（从数据集中每个 zip_code 中提取前 5 位），然后使用包含 zip_codes:usa::zipcodes 的包usa 匹配它的状态。跨度>
@Greg 我用 dput 输出更新了问题，如果我做错了请告诉我！
@famato 它对我有用。 :) 尽管您可能希望包含来自us 的一些城市，其名称出现在us_data 中。
@famato 你读过我的评论了吗？我写过你必须将你的邮政编码缩短到前 5 位数字。是的，这是正确的！

标签： r dplyr

【解决方案1】：

使用match:

us_data$State_NEW <- us$state_name[match(us_data$city, us$city)]

【讨论】：

这很好用，谢谢！
@famato 仅供参考，如果任何一个答案解决了您的问题，您应该支持它*和mark it as accepted。 *您只有在声望达到 15 次以上时才能投票。

【解决方案2】：

dplyr::*_join() 的以下解决方案应该可以工作，它应该可以帮助您熟悉dplyr 工作流程。

我 am 假设您的目标是在 state 上提供更详细的信息来丰富 us_data。

library(dplyr)

# ...
# Code to generate 'us_data' and 'us'.
# ...


us_data %>%
  # OPTIONALLY reduce ZIP codes to only their first 5 digits.
  mutate(
    zip_code = substr(zip_code, 1, 5),
  ) %>%
  # Match the US data to the proper states.
  left_join(
    us,
    by = "city"
  ) %>%
  # Remove the unhelpful 'state' column, which only shows "US".
  select(!state)

【讨论】：