【发布时间】:2021-11-23 22:31:34
【问题描述】:
我有一个数据集,其中包含美国城市,但目前所有州都标记为美国 - 数据集称为 us_data。我找到了一个包含所有美国城市和州的数据集 - 称为我们 - 我正在尝试在我的 us_data 数据集中创建一个新的州列,方法是获取城市名称,在我们中找到它并从我们那里拉出州以添加到us_data 中的新列。
我正在使用 R,但不知道该怎么做。我不相信常规连接会起作用,因为在 us_data 中可以有多个观察到同一个城市,所以需要匹配同一个城市的所有行。我正在考虑使用 dplyr 中的 mutate() 但不确定如何在函数调用中引用两个数据集,因此将不胜感激!我附上了这两个数据集的一瞥以供参考。
> dput(us_data[1:10,1:7])
structure(list(name = c("Carpenter Rd.", "1515 N. Sheridan - Wilmette",
"S McCarran & E Greg St - Sparks", "Hwy 20 & Tharp - Yuba City",
"Greenmount & I-64", "Veterans Blvd & Kingman St", "Hampden & Dayton - Denver",
"50th and Kipling-Wheatridge, CO", "Higuera & Tank Farm", "Burr Ridge-I-55 & County Line Rd"
), url = c("https://www.starbucks.com/store-locator/store/6323",
"https://www.starbucks.com/store-locator/store/6325", "https://www.starbucks.com/store-locator/store/6327",
"https://www.starbucks.com/store-locator/store/6328", "https://www.starbucks.com/store-locator/store/6329",
"https://www.starbucks.com/store-locator/store/6330", "https://www.starbucks.com/store-locator/store/6334",
"https://www.starbucks.com/store-locator/store/6333", "https://www.starbucks.com/store-locator/store/6331",
"https://www.starbucks.com/store-locator/store/6340"), street_address = c("3650 Carpenter Rd.",
"1515 North Sheridan, Building 4", "1560 S. Stanford Way, Suite A",
"1615 Colusa Hwy, Ste 100", "1126 Central Park Drive", "4312 Veterans Blvd.",
"9925 East Hampden Ave", "4975 Kipling St", "3971 S. Higuera Street",
"515 Village Center Dr."), city = c("Pittsfield", "Wilmette",
"Sparks", "Yuba City", "OFallon", "Metairie", "Denver", "Wheat Ridge",
"San Luis Obispo", "Burr Ridge"), state = c("US", "US", "US",
"US", "US", "US", "US", "US", "US", "US"), zip_code = c("48104",
"600911822", "894316331", "959939437", "622691769", "70006",
"802314903", "800332340", "934011580", "605274516"), country = c("US",
"US", "US", "US", "US", "US", "US", "US", "US", "US")), row.names = c(NA,
10L), class = "data.frame")
> dput(us[1:20,])
structure(list(city = c("New York", "Los Angeles", "Chicago",
"Miami", "Dallas", "Philadelphia", "Houston", "Atlanta", "Washington",
"Boston", "Phoenix", "Seattle", "San Francisco", "Detroit", "San Diego",
"Minneapolis", "Tampa", "Denver", "Brooklyn", "Queens"), city_ascii = c("New York",
"Los Angeles", "Chicago", "Miami", "Dallas", "Philadelphia",
"Houston", "Atlanta", "Washington", "Boston", "Phoenix", "Seattle",
"San Francisco", "Detroit", "San Diego", "Minneapolis", "Tampa",
"Denver", "Brooklyn", "Queens"), state_id = c("NY", "CA", "IL",
"FL", "TX", "PA", "TX", "GA", "DC", "MA", "AZ", "WA", "CA", "MI",
"CA", "MN", "FL", "CO", "NY", "NY"), state_name = c("New York",
"California", "Illinois", "Florida", "Texas", "Pennsylvania",
"Texas", "Georgia", "District of Columbia", "Massachusetts",
"Arizona", "Washington", "California", "Michigan", "California",
"Minnesota", "Florida", "Colorado", "New York", "New York")), row.names = c(NA,
20L), class = "data.frame")
【问题讨论】:
-
您好,欢迎您!请提供您的数据集的reproducible example,将
dput(us_data)和dput(us)的输出粘贴到您的答案中。 -
好的。没有查看您的数据并不容易,但正如我从
us_data数据集中观察到的那样,您也有 Zip_codes。因此,最简单的方法是创建一个包含 5 位 Zip_code 的新列(从数据集中每个 zip_code 中提取前 5 位),然后使用包含 zip_codes:usa::zipcodes的包usa匹配它的状态。跨度> -
@Greg 我用 dput 输出更新了问题,如果我做错了请告诉我!
-
@famato 它对我有用。 :) 尽管您可能希望包含来自
us的一些城市,其名称出现在us_data中。 -
@famato 你读过我的评论了吗?我写过你必须将你的邮政编码缩短到前 5 位数字。是的,这是正确的!