【问题标题】:grep() doesn't match fully in for loops, only matching the exact charactersgrep() 在 for 循环中不完全匹配,只匹配确切的字符
【发布时间】:2020-07-01 09:20:07
【问题描述】:

我有一个 DT names_nightlight 具有如下所示的标准区域名称。另一个 DT disasters,其中一列 Location,具有标准和非标准区域名称以及城市/直辖市名称。我想将disasters$Location 中的非标准区域名称替换为names_nightlight$region 中的标准区域名称。

names_nightlight:

|      country      |   region    | ISO |
|-------------------|-------------|-----|
| American Samoa    | Eastern     | ASM |
| American Samoa    | Manu'a      | ASM |
| American Samoa    | Unorganized | ASM |
| American Samoa    | Western     | ASM |
| Antigua & Barbuda | Barbuda     | ATG |
| Antigua & Barbuda | Redonda     | ATG |
| Antigua & Barbuda | Saint George| ATG |
| ...               | ...         | ... |

我需要使用grep() 来查找匹配项,其中disasters$Location 具有区域名称,然后创建disasters$Location := names_nightlight$region(标准名称)并用disasters$matched := 1 标记它。稍后,我可以使用 Google 手动查找那些遭受灾难的城市/直辖市的区域$Location。

for (j in names_nightlight[!region == "just one region", ISO]){
    for (i in names_nightlight[ISO == j, region]){
        disasters[ISO == j][grep(i, Location), Location := i]
        disasters[ISO == j & Location == i, matched := 1]
    }
}

但是,我的循环中的 grep 函数似乎没有完全发挥作用,只匹配了确切的字符。例如,“Manu'a island”与“Manu'a”不匹配,“Saint George”(以空格结尾)与“Saint George”(不以空格结尾)不匹配。

在没有匹配的结果中

disasters[is.na(matched) == TRUE]

| Start.date |  End.date  | ISO |   Location    |  Disaster.No. | matched |
|------------|------------|-----|---------------|---------------|---------|
| 2005-02-16 | 2005-02-16 | ASM | Manu'a island | 2005-0151     | NA      |
| 2017-09-06 | 2017-09-06 | ATG | Saint George  | 2017-0381     | NA      |
| 2017-09-06 | 2017-09-06 | ATG | Crosbies      | 2017-0381     | NA      |
| 2017-09-06 | 2017-09-06 | ATG | Fort Road     | 2017-0381     | NA      |
| 2017-09-06 | 2017-09-06 | ATG | Clare Hall    | 2017-0381     | NA      |
| 2017-09-06 | 2017-09-06 | ATG | Grays Farm    | 2017-0381     | NA      |
| ...        | ...        | ... | ...           | ...           | ...     |

dput(names_nightlight[1:10])

structure(list(country = c("American Samoa", "American Samoa", 
"American Samoa", "American Samoa", "Antigua & Barbuda", "Antigua & Barbuda", 
"Antigua & Barbuda", "Antigua & Barbuda", "Antigua & Barbuda", 
"Antigua & Barbuda"), region = c("Eastern", "Manu'a", "Unorganized", 
"Western", "Barbuda", "Redonda", "Saint George", "Saint John", 
"Saint Mary", "Saint Paul"), ISO = c("ASM", "ASM", "ASM", "ASM", 
"ATG", "ATG", "ATG", "ATG", "ATG", "ATG")), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x7f835380dae0>)

输入(灾难[1:10])

structure(list(Start.date = structure(c(12422, 12422, 12422, 
12422, 12830, 17415, 14167, 14167, 14167, 14167), class = "Date"), 
    End.date = structure(c(12422, 12422, 12422, 12422, 12830, 
    17415, 14168, 14168, 14168, 14168), class = "Date"), Country = c("American Samoa", 
    "American Samoa", "American Samoa", "American Samoa", "American Samoa", 
    "Anguilla", "Antigua and Barbuda", "Antigua and Barbuda", 
    "Antigua and Barbuda", "Antigua and Barbuda"), ISO = c("ASM", 
    "ASM", "ASM", "ASM", "ASM", "AIA", "ATG", "ATG", "ATG", "ATG"
    ), Location = c("Eastern", "Manu'a", "Unorganized", "Western", 
    "Manu'a island", "just one region", "Barbuda", "Redonda", 
    "Saint George", "Saint John"), Latitude = c(NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_), Longitude = c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), Magnitude.value = c(310, 310, 310, 310, NA, NA, 
    NA, NA, NA, NA), Magnitude.scale = c("Kph", "Kph", "Kph", 
    "Kph", "Kph", "Kph", "Kph", "Kph", "Kph", "Kph"), Disaster.type = c("Storm", 
    "Storm", "Storm", "Storm", "Storm", "Storm", "Storm", "Storm", 
    "Storm", "Storm"), Disaster.subtype = c("Tropical cyclone", 
    "Tropical cyclone", "Tropical cyclone", "Tropical cyclone", 
    "Tropical cyclone", "Tropical cyclone", "Tropical cyclone", 
    "Tropical cyclone", "Tropical cyclone", "Tropical cyclone"
    ), Associated.disaster = c("--", "--", "--", "--", "--", 
    "--", "Flood", "Flood", "Flood", "Flood"), Associated.disaster2 = c("--", 
    "--", "--", "--", "--", "--", "--", "--", "--", "--"), Total.deaths = c(0L, 
    0L, 0L, 0L, 0L, 4L, 0L, 0L, 0L, 0L), Total.affected = c(23060L, 
    23060L, 23060L, 23060L, 0L, 15000L, 25800L, 25800L, 25800L, 
    25800L), Total.damage...000.US.. = c(150000L, 150000L, 150000L, 
    150000L, 0L, 200000L, 0L, 0L, 0L, 0L), Insured.losses...000.US.. = c(0, 
    0, 0, 0, 0, 6700, 0, 0, 0, 0), Disaster.name = c("Heta", 
    "Heta", "Heta", "Heta", "Olaf", "Hurricane 'Irma'", "Hurricane \"Omar\"", 
    "Hurricane \"Omar\"", "Hurricane \"Omar\"", "Hurricane \"Omar\""
    ), Disaster.No. = c("2004-0004", "2004-0004", "2004-0004", 
    "2004-0004", "2005-0151", "2017-0381", "2008-0604", "2008-0604", 
    "2008-0604", "2008-0604"), empty_region = c(0, 0, 0, 0, 0, 
    1, 0, 0, 0, 0), matched = c(NA, NA, NA, NA, NA, 1, NA, NA, 
    NA, NA)), .internal.selfref = <pointer: 0x7f835380dae0>, row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))

【问题讨论】:

  • 您能否让您的问题可重现? dput 在数据集通信方面非常方便。
  • @RomanLuštrik 例如,我刚刚添加了来自dput(DT[1:10]) 的一些输出。这是你想要的吗?

标签: r string character


【解决方案1】:

不完全确定您想要达到的目标,但请注意:

disasters[ISO == j][grep(i, Location), Location := i]

“什么都没有”,因为disasters[ISO == j] 返回子集 data.table 但您不将其分配给任何变量,然后您对未分配给任何变量的对象执行 [grep(i, Location), Location := i]。这不一样:

DT[some subseting, new_var := ...]
DT[some subseting][new_var := ...]

阅读?":="Note 部分。所以尝试替换:

disasters[ISO == j][grep(i, Location), Location := i]
disasters[ISO == j & Location == i, matched := 1]

与:

disasters[ISO == j & str_detect(Location, i), ":="(Location = i, matched = 1)]

【讨论】:

  • disasters[ISO == j],当不同国家/地区的地区名称相似时,我会尽量避免匹配。
  • 我刚试过你的代码。它给了我所有重复的Location 条目,只有与i 匹配的第一个项目。 @det
  • @Ziroque 就像我说的,我不确定你到底想要什么。我刚刚指出了可能没有按照您的想法执行的部分代码,并为您提供了如何修复它的建议。这并不意味着其余代码对于您正在尝试做的事情是好的(甚至是“固定”部分)。
  • @Ziroque 你可能应该写出预期的输出。另外我不确定你为什么需要区域。你不能试试类似的东西:for (i in unique(names_nightlight$region)) disasters[str_detect(Location, i), ":="(Location = i, matched = 1)]
  • 感谢您的帮助和解释。 str_detect() 工作。
最近更新 更多