从字符串到正则表达式再到新字符串答案

【问题标题】：From string to regex to new string从字符串到正则表达式再到新字符串
【发布时间】：2017-02-14 20:22:10
【问题描述】：

我有一个数据框，其中包含一列凌乱的字符串。每个凌乱的字符串都包含其中某个国家/地区的名称。这是一个玩具版本：

df <- data.frame(string = c("Russia is cool (2015) ",
                            "I like - China",
                            "Stuff happens in North Korea"),
                 stringsAsFactors = FALSE)

感谢countrycode 包，我还有第二个数据集，其中包括两个有用的列：一个包含国家名称的正则表达式 (regex)，另一个包含相关的国家名称 (country.name)。我们可以像这样加载这个数据集：

library(countrycode)
data(countrycode_data)

我想编写代码，使用countrycode_data$regex 中的正则表达式来找出df$string 每一行中的国家/地区名称；将该正则表达式与countrycode_data$country.name 中的正确国家名称相关联；最后，将该名称写入新列df$country 中的相关位置。执行此 TBD 操作后，df 将如下所示：

                        string                                country
1       Russia is cool (2015)                      Russian Federation
2               I like - China                                  China
3 Stuff happens in North Korea Korea, Democratic People's Republic of

我无法完全理解如何做到这一点。我尝试过使用grepl、which、tolower 和%in% 的各种组合，但我弄错了方向或尺寸（或两者）。

【问题讨论】：

我在countrycode_data 数据框中没有看到regex 列？... 编辑，没关系，我想我找到了，叫做country.name.en.regex？
countrycode_data 中的相关列应仅称为regex。具有专有名称的关联列是country.name。
可能这样的事情会有所帮助：stackoverflow.com/questions/21165256/…
@ulfelder 正则表达式列在包的 0.19 版本中重命名为 country.name.en.regex。我是国家代码作者，cjyetman 在下面给出了正确答案。 countrycode 应该适用于您的用例，但您只是遇到了一个已知的朝鲜正则表达式问题。应该适用于大多数其他国家/地区。

标签： r regex string match

【解决方案1】：

这正是 countrycode 包的用途，因此没有理由自己重新编码。就这样用吧……

library(countrycode)
df <- data.frame(string = c("Russia is cool (2015) ", "I like - China",
                            "Stuff happens in North Korea"), stringsAsFactors = FALSE)

df$country.name <- countrycode(df$string, 'country.name', 'country.name')

特别是在这种情况下，它不会找到“朝鲜发生的事情”的明确匹配，但这实际上是朝鲜和韩国正则表达式的问题（我在这里打开了一个问题 https://github.com/vincentarelbundock/countrycode/issues/139） .否则，您想要做的应该在原则上起作用。

（特别注解@ulfelder：countrycode 的新版本刚刚在 CRAN 上发布，v0.19。自从我们添加了新语言后，列名发生了一些变化，所以 country.name 现在是 country.name.en , 而regex 现在是country.name.en.regex)

【讨论】：

【解决方案2】：

我是国家代码维护者。 @cj-yetman 给出了正确答案。您遇到的具体朝鲜问题现已在 Github 的 countrycode 开发版中修复。

您可以直接使用countrycode将句子转换为国家名称或代码：

> library(devtools)
> install_github('vincentarelbundock/countrycode')
> library(countrycode)
> df <- data.frame(string = c("Russia is cool (2015) ",
+                             "I like - China",
+                             "Stuff happens in North Korea"),
+                  stringsAsFactors = FALSE)
> df$iso3c = countrycode(df$string, 'country.name', 'country.name')
> df
                        string                                 iso3c
1       Russia is cool (2015)                     Russian Federation
2               I like - China                                 China
3 Stuff happens in North Korea Democratic People's Republic of Korea

【讨论】：

谢谢@Vincent！在某种程度上，我很高兴在获得countrycode 特定答案之前得到了更一般的答案，因为在没有解决问题的软件包的情况下，这可能会再次出现。
有没有一种有效的方法可以使用countrycode 在单个字符串中捕获多个国家/地区名称？例如，如果我有字符串“秘书长关于苏丹和南苏丹的报告”，我想返回一个字符串，如“苏丹；南苏丹”？我知道如何进行崩溃。它返回的不止一场比赛让我难过。
不是开箱即用的国家代码，但是如果您查看内部代码，该包已经跟踪了多个匹配项。您可以使用相同的代码并捕获destination_list。见这里：github.com/vincentarelbundock/countrycode/blob/master/R/…

【解决方案3】：

在这种情况下，我会使用 for 循环，但要在 countrycode_data data.frame 的行上进行循环，因为它只有大约 200 行，而现实世界的原始数据可能要大几个数量级。

由于名字比较长，我提取了两列国家代码数据：

patt <- countrycode_data$country.name.en.regex[!is.na(countrycode_data$country.name.en.regex)]
name <- countrycode_data$country.name.en[!is.na(countrycode_data$country.name.en.regex)]

然后我们可以循环写入新列：

for(i in seq_along(patt)) {
  df$country[grepl(patt[i], df$string, ignore.case=TRUE, perl=TRUE)] <- name[i]
}

正如其他人所指出的，朝鲜与国家代码数据中指定的正则表达式不匹配。

【讨论】：

优雅，谢谢。（而且，碰巧，我实际上也得到了“朝鲜”的预期结果。）
是的，好主意。我在使用stringi 时也有同样的想法，比如which(sapply(countrycode_data$country.name.en.regex, stringi::stri_detect_regex, str = tolower(df$string)), arr.ind = TRUE)（其中col 是countrycode_data$country.name.en 中的行索引）
@DavidArenburg 也是一个不错的选择。最后，您必须以某种方式制作一个（并且只有一个）循环。 stringi 可能会显着提高正则表达式匹配（当然也可以在我的方法中采用）

【解决方案4】：

这是一个可行的解决方案，但我在 countrycode_data 框架中引用了不同的列名，因为它们在我的系统上出现的方式不同。我还打了几个*apply 电话，这可能并不理想。我相信你可以对其中的一些进行矢量化，我只是不确定自己是怎么做的。

matches <- sapply( df$string, function( x ) {

    # find matches by running all regex strings (maybe cound be vectorised?)
    find.match <- lapply( countrycode_data$country.name.en.regex, grep, x = x, ignore.case = TRUE, perl = TRUE )

    # note down which patterns came up with a match
    matches <- which( sapply( find.match, length ) > 0 )

    # now cull the matches list down to only those with a match
    find.match <- find.match[ sapply( find.match, length ) > 0 ]

    # get rid of NA matches (not sure why these come up)
    matches <- matches[ sapply( find.match, is.na ) == FALSE ]

    # now only return the value (reference to the match) if there is one (otherwise we get empty returns)
    ifelse( length( matches ) == 0, NA_integer_, matches )
} )

# now use the vector of references to match up country names
df$country <- countrycode_data$country.name.en[ matches ]

> df
                        string            country
1       Russia is cool (2015)  Russian Federation
2               I like - China              China
3 Stuff happens in North Korea               <NA>

grepl( "^(?=.*democrat|people|north|d.*p.*.r).*\\bkorea|dprk|korea.*(d.*p.*r)",
       c( "korea", "north korea", "aaa north korea" ),
       perl = TRUE, ignore.case = TRUE )
# [1] FALSE  TRUE FALSE

【讨论】：

【解决方案5】：

这是交叉连接的可能解决方案（这会炸毁您的数据）

library(countrycode)
data(countrycode_data)

library(data.table)
df <- data.table(string = c("Russia is cool (2015) ",
                            "I like - China",
                            "Stuff happens in North Korea"),
                 stringsAsFactors = FALSE)

# adding dummy for full cross-join merge
df$dummy <- 0L
country.dt <- data.table(countrycode_data[, c("country.name.en", "country.name.en.regex")])
country.dt$dummy <- 0L

# merging original data to countries to get all possible combinations
res.dt <- merge(df, country.dt, by ="dummy", all = TRUE, allow.cartesian = TRUE)

# there are cases with NA regex
res.dt <- res.dt[!is.na(country.name.en.regex)]

# find matches
res.dt[, match := grepl(country.name.en.regex, string, perl = T, ignore.case = T), by = 1:nrow(res.dt)]

# filter out matches
res.dt <- res.dt[match == TRUE, .(string, country.name.en)]
res.dt

#                    string    country.name.en
# 1:  Russia is cool (2015) Russian Federation
# 2:         I like - China              China

【讨论】：

如果您最终只是按行操作，为什么还要交叉连接？可以做一个简单的sapplyIMO。
我同意，在这种特殊情况下，它不是一个很好的解决方案，因为预期的匹配数量很少。但对于类似的任务，它可能很有用