【发布时间】:2020-07-01 09:20:07
【问题描述】:
我有一个 DT names_nightlight 具有如下所示的标准区域名称。另一个 DT disasters,其中一列 Location,具有标准和非标准区域名称以及城市/直辖市名称。我想将disasters$Location 中的非标准区域名称替换为names_nightlight$region 中的标准区域名称。
names_nightlight:
| country | region | ISO |
|-------------------|-------------|-----|
| American Samoa | Eastern | ASM |
| American Samoa | Manu'a | ASM |
| American Samoa | Unorganized | ASM |
| American Samoa | Western | ASM |
| Antigua & Barbuda | Barbuda | ATG |
| Antigua & Barbuda | Redonda | ATG |
| Antigua & Barbuda | Saint George| ATG |
| ... | ... | ... |
我需要使用grep() 来查找匹配项,其中disasters$Location 具有区域名称,然后创建disasters$Location := names_nightlight$region(标准名称)并用disasters$matched := 1 标记它。稍后,我可以使用 Google 手动查找那些遭受灾难的城市/直辖市的区域$Location。
for (j in names_nightlight[!region == "just one region", ISO]){
for (i in names_nightlight[ISO == j, region]){
disasters[ISO == j][grep(i, Location), Location := i]
disasters[ISO == j & Location == i, matched := 1]
}
}
但是,我的循环中的 grep 函数似乎没有完全发挥作用,只匹配了确切的字符。例如,“Manu'a island”与“Manu'a”不匹配,“Saint George”(以空格结尾)与“Saint George”(不以空格结尾)不匹配。
在没有匹配的结果中
disasters[is.na(matched) == TRUE]
| Start.date | End.date | ISO | Location | Disaster.No. | matched |
|------------|------------|-----|---------------|---------------|---------|
| 2005-02-16 | 2005-02-16 | ASM | Manu'a island | 2005-0151 | NA |
| 2017-09-06 | 2017-09-06 | ATG | Saint George | 2017-0381 | NA |
| 2017-09-06 | 2017-09-06 | ATG | Crosbies | 2017-0381 | NA |
| 2017-09-06 | 2017-09-06 | ATG | Fort Road | 2017-0381 | NA |
| 2017-09-06 | 2017-09-06 | ATG | Clare Hall | 2017-0381 | NA |
| 2017-09-06 | 2017-09-06 | ATG | Grays Farm | 2017-0381 | NA |
| ... | ... | ... | ... | ... | ... |
dput(names_nightlight[1:10])
structure(list(country = c("American Samoa", "American Samoa",
"American Samoa", "American Samoa", "Antigua & Barbuda", "Antigua & Barbuda",
"Antigua & Barbuda", "Antigua & Barbuda", "Antigua & Barbuda",
"Antigua & Barbuda"), region = c("Eastern", "Manu'a", "Unorganized",
"Western", "Barbuda", "Redonda", "Saint George", "Saint John",
"Saint Mary", "Saint Paul"), ISO = c("ASM", "ASM", "ASM", "ASM",
"ATG", "ATG", "ATG", "ATG", "ATG", "ATG")), row.names = c(NA,
-10L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x7f835380dae0>)
输入(灾难[1:10])
structure(list(Start.date = structure(c(12422, 12422, 12422,
12422, 12830, 17415, 14167, 14167, 14167, 14167), class = "Date"),
End.date = structure(c(12422, 12422, 12422, 12422, 12830,
17415, 14168, 14168, 14168, 14168), class = "Date"), Country = c("American Samoa",
"American Samoa", "American Samoa", "American Samoa", "American Samoa",
"Anguilla", "Antigua and Barbuda", "Antigua and Barbuda",
"Antigua and Barbuda", "Antigua and Barbuda"), ISO = c("ASM",
"ASM", "ASM", "ASM", "ASM", "AIA", "ATG", "ATG", "ATG", "ATG"
), Location = c("Eastern", "Manu'a", "Unorganized", "Western",
"Manu'a island", "just one region", "Barbuda", "Redonda",
"Saint George", "Saint John"), Latitude = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), Longitude = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), Magnitude.value = c(310, 310, 310, 310, NA, NA,
NA, NA, NA, NA), Magnitude.scale = c("Kph", "Kph", "Kph",
"Kph", "Kph", "Kph", "Kph", "Kph", "Kph", "Kph"), Disaster.type = c("Storm",
"Storm", "Storm", "Storm", "Storm", "Storm", "Storm", "Storm",
"Storm", "Storm"), Disaster.subtype = c("Tropical cyclone",
"Tropical cyclone", "Tropical cyclone", "Tropical cyclone",
"Tropical cyclone", "Tropical cyclone", "Tropical cyclone",
"Tropical cyclone", "Tropical cyclone", "Tropical cyclone"
), Associated.disaster = c("--", "--", "--", "--", "--",
"--", "Flood", "Flood", "Flood", "Flood"), Associated.disaster2 = c("--",
"--", "--", "--", "--", "--", "--", "--", "--", "--"), Total.deaths = c(0L,
0L, 0L, 0L, 0L, 4L, 0L, 0L, 0L, 0L), Total.affected = c(23060L,
23060L, 23060L, 23060L, 0L, 15000L, 25800L, 25800L, 25800L,
25800L), Total.damage...000.US.. = c(150000L, 150000L, 150000L,
150000L, 0L, 200000L, 0L, 0L, 0L, 0L), Insured.losses...000.US.. = c(0,
0, 0, 0, 0, 6700, 0, 0, 0, 0), Disaster.name = c("Heta",
"Heta", "Heta", "Heta", "Olaf", "Hurricane 'Irma'", "Hurricane \"Omar\"",
"Hurricane \"Omar\"", "Hurricane \"Omar\"", "Hurricane \"Omar\""
), Disaster.No. = c("2004-0004", "2004-0004", "2004-0004",
"2004-0004", "2005-0151", "2017-0381", "2008-0604", "2008-0604",
"2008-0604", "2008-0604"), empty_region = c(0, 0, 0, 0, 0,
1, 0, 0, 0, 0), matched = c(NA, NA, NA, NA, NA, 1, NA, NA,
NA, NA)), .internal.selfref = <pointer: 0x7f835380dae0>, row.names = c(NA,
-10L), class = c("data.table", "data.frame"))
【问题讨论】:
-
您能否让您的问题可重现?
dput在数据集通信方面非常方便。 -
@RomanLuštrik 例如,我刚刚添加了来自
dput(DT[1:10])的一些输出。这是你想要的吗?