如何 gsub 匹配字符串并同时删除不匹配的字符串？答案

【问题标题】：How to gsub for matching strings and simultaneously remove non-matching strings?如何 gsub 匹配字符串并同时删除不匹配的字符串？
【发布时间】：2019-06-03 19:15:26
【问题描述】：

我有一个数据框，其中包含一列字符串，我想将其进一步标记为以下类别：城市、国家和大陆。我使用 gsub 将所有城市替换为“City”，将所有国家替换为“Country”，将所有大陆替换为“Continent”。

#This is what I have
dataframe
Color     Letter     Words
red       A          Paris,Asia,parrot,Antarctica,North America,cat,lizard
blue      A          Panama,New York,Africa,dog,Tokyo,Washington DC,fish
red       B          Copenhagen,bird,USA,Japan,Chicago,Mexico,insect
blue      B          Israel,Antarctica,horse,South America,North America,turtle,Brazil

#This is what I want
dataframe
Color     Letter     New
red       A          City,Continent
blue      A          Country,City,Continent
red       B          City,Country
blue      B          Country,Continent


#This is the code I have so far
dataframe$New <- NA

#groups all the cities
dataframe$New <- lapply)dataframe$Words, function(x) {
   gsub("Paris|New York|Tokyo|Washington DC|Copenhagen|Chicago", "City", x)})

#groups all the countries
dataframe$New <- lapply)dataframe$Words, function(x) {
   gsub("Panama|USA|Japan|Mexico|Israel|Brazil", "Country", x)})

#groups all the continents
dataframe$New <- lapply)dataframe$Words, function(x) {
   gsub("Asia|Antarctica|Africa|North America|South America", "Continent", x)})

dataframe$Words <- NULL

如何防止每次都覆盖 dataframe$New 以及如何删除多余的单词（即 fish、 horse、cat）？

以上数据是基于非常大的数据集的示例。在数据集中，单词列有很多重复。请参阅下面的 dataframe$Words 中的一些示例行：

Words
Panama,Paris
Panama,Israel,cat
Panama,Paris,horse,
Panama,Asia
Panama
Panama,Chicago
Israel,Chicago
Israel,lizard,Paris
Israel,Panama,horse,Africa
```

【问题讨论】：

标签： r lapply gsub

【解决方案1】：

考虑粘贴几个 ifelse 调用来检查特定字符串：

dataframe$New <- paste(ifelse(grepl("Paris|New York|Tokyo|Washington DC|Copenhagen|Chicago", dataframe$Words), "City", "N/A"), 
                       ifelse(grepl("Panama|USA|Japan|Mexico|Israel|Brazil", dataframe$Words), "Country", "N/A"),
                       ifelse(grepl("Asia|Antarctica|Africa|North America|South America", dataframe$Words), "Continent", "N/A"),
                       sep=",")

dataframe$New <- gsub("N/A,|,N/A", "", dataframe$New)

dataframe

#   Color Letter                                                             Words                    New
# 1   red      A             Paris,Asia,parrot,Antarctica,North America,cat,lizard         City,Continent
# 2  blue      A               Panama,New York,Africa,dog,Tokyo,Washington DC,fish City,Country,Continent
# 3   red      B                   Copenhagen,bird,USA,Japan,Chicago,Mexico,insect           City,Country
# 4  blue      B Israel,Antarctica,horse,South America,North America,turtle,Brazil      Country,Continent

或带有do.call + lapply 的烘干机版本：

strs <- list(c("Paris|New York|Tokyo|Washington DC|Copenhagen|Chicago", "City"),
             c("Panama|USA|Japan|Mexico|Israel|Brazil", "Country"),
             c("Asia|Antarctica|Africa|North America|South America", "Continent"))

df$New2 <- do.call(paste,
                   c(lapply(strs, function(s) ifelse(grepl(s[1], df$Words), s[2], "N/A")), 
                     list(sep=",")))
df$New2 <- gsub("N/A,|,N/A", "", df$New2)

【讨论】：

这几乎可以工作，但 N/As 有一些问题... ifelse 需要为每个语句提供“否”参数，但如果我将单元格中的其他所有内容都设置为“N/A”然后稍后 ifelse 语句将不起作用。结果列表中有大量不应该出现的“N/A”。
那是什么？如您所见，它适用于您的样品。我使用N/A 作为占位符，在下一行替换。
有没有办法让“no”参数在最后一个 ifelse 语句之前“什么都不做”？
请使用产生您的问题的数据更新您的帖子。您可以使用空字符串""，但您的结果可能有空的逗号括起来的空格：（例如，,Country,Continent）。
dataframe$Words 有很多重复，因此当“no”参数在一行中为 N/A 时，下一个 ifelse 语句无法识别字符串存在。我在上面添加了一个示例。

【解决方案2】：

最好创建一个list的键/值对，然后通过匹配'key's来提取替换后的元素

library(gsubfn)
# key val list
lst1 <- list(Paris = "City", `New York` = "City", Tokyo = "City", 
  `Washington DC` = "City", 
    Copenhagen = "City", Chicago = "City", Panama = "Country", 
    USA = "Country", Japan = "Country", Mexico = "Country", Israel = "Country", 
    Brazil = "Country", Asia = "Continent", Antarctica = "Continent",      
    Africa = "Continent", `North America` = "Continent", 
    `South America` = "Continent")

将strapply 的匹配值提取到list 中，循环使用sapply 和paste 的list unique 字符串'City'、'Continent' 或'Country'

nm1 <- c("City", "Continent", "Country")
df1$New <- sapply(strapply(df1$Words,  "([^,]+)", lst1), function(x)  
        paste(unique(x[x %in% nm1]), collapse=","))
df1$New
#[1] "City,Continent"         "Country,City,Continent"
#[3] "City,Country"           "Country,Continent"

数据

df1 <- structure(list(Color = c("red", "blue", "red", "blue"), Letter = c("A", 
"A", "B", "B"), Words = c("Paris,Asia,parrot,Antarctica,North America,cat,lizard", 
"Panama,New York,Africa,dog,Tokyo,Washington DC,fish", 
  "Copenhagen,bird,USA,Japan,Chicago,Mexico,insect", 
"Israel,Antarctica,horse,South America,North America,turtle,Brazil"
)), class = "data.frame", row.names = c(NA, -4L))

【讨论】：