gsub - 在 & 字符之前/之后添加空格答案

【问题标题】：gsub - adding whitespace before/after & charactergsub - 在 & 字符之前/之后添加空格
【发布时间】：2016-11-17 08:48:37
【问题描述】：

stackoverflow 上的第一篇文章，希望是众多文章中的第一篇。

我正在清理其中一个列中包含作者列表的数据集。当有多个作者时，这些作者用＆符号分隔，例如。史密斯和班克斯。但是，间距并不总是一致的，例如。史密斯和班克斯，史密斯和班克斯。

为了解决这个问题，我试过了：

     gsub('\\S&','\\S &', dataset[,author.col])

这给出了 Smith&Banks -> SmitS & Banks。我怎样才能获得 -> Smith & Banks？

【问题讨论】：

您是否遇到过例如Smith &&Banks 的情况，即同一作者之间存在多个与号？
我没有这些情况，不同名称之间的唯一分隔符是和号。

标签： regex r gsub

【解决方案1】：

这是一个对gsub 进行两次调用的解决方案：

dataset[,author.col] <- gsub('([^\\s])&','\\1\\s&', dataset[,author.col])
dataset[,author.col] <- gsub('&([^\\s])','&\\s\\1', dataset[,author.col])

【讨论】：

【解决方案2】：

这是一种只使用sub的方法

sub("\\b(?=&)|(?<=&)\\b", " ",  v1, perl = TRUE)
#[1] "Smith & Banks" "Smith & Banks"

使用具有更多组合的数据。在上面，我只考虑了 OP 帖子中显示的选项。

 gsub("\\s*(?=&)|(?<=&)\\s*", " ", data, perl = TRUE)
 #[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"

 gsub("\\s*&+|\\&+\\s*", " & ", data1)
 #[1] "Smith &  Banks" "Smith & Banks"  "Smith & Banks"  
 #[4]"Smith & Banks"  "Smith & Banks"  "Smith &  Banks" "Smith & Banks"

或strsplit

sapply(strsplit(data1, "\\s*&+\\s*"), paste, collapse = " & ")
#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" 
#[5] "Smith & Banks" "Smith & Banks" "Smith & Banks"

本质上，如果有很多模式，strsplit 方法会更好。

数据

v1 <- c("Smith& Banks", "Smith &Banks")
data = c("Smith& Banks", "Smith &Banks", "Smith & Banks", 
     "Smith &     Banks", "Smith&Banks")
data1 <- c(v1, "Smith&& Banks", "Smith && Banks", "Smith&&Banks")

【讨论】：

【解决方案3】：

这是另一个gsub 方法：

# some test cases
authors <- c("Smith& Banks", "Smith   &Banks", "Smith&Banks", "Smith & Banks")
gsub("\\s*&\\s*", " & ", authors)
#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"

更多测试用例（超过 2 个作者，单作者）：

authors <- c("Smith& Banks", "Smith   &Banks &Nash", "Smith&Banks", "Smith & Banks", "Smith")
gsub("\\s*&\\s*", " & ", authors)
#[1] "Smith & Banks"        "Smith & Banks & Nash" "Smith & Banks"        "Smith & Banks"        "Smith"

正如 OP 在 cmets 中对他们的问题所指出的那样，两个作者之间的多个 & 符号不会出现在数据中。

【讨论】：

如果名称包含多个琥珀色，您的正则表达式可能会产生意想不到的后果。
取决于您对“名称”的定义。字符串中相同名称之间的多个＆符号的情况（如Smith &&Banks）是有问题的 - 我同意，但我不理解这个问题 - 而向量元素中不同名称之间的多个＆符号（如@ 987654325@）不要造成任何问题
当然，其他一些答案也不能正确处理多个＆符号 - 但我首先希望 OP 澄清这是否会发生
这对我有用，谢谢。在多行中有多个 & 符号，但不能超过一个分隔两个名称。

【解决方案4】：

使用stringi 的一种矫枉过正的方式：

v <- c("Smith & Banks", "Smith& Banks", "Smith &Banks", "Smith&Banks", "Smith Banks")

library(stringi)
#create an index of entries containing "&"
indx <- grepl("&", v)
#subset "v" using that index
amp  <- v[indx]
#perform the transformation on that subset and combine the result with the rest of "v"
c(sapply(stri_extract_all_words(amp), 
         function(x) { paste0(x, collapse = " & ") }), v[!indx])

这给出了：

#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith Banks"

【讨论】：

此解决方案将在它们不存在的地方添加&："Smith Banks" -> "Smith & Banks"
@WiktorStribiżew 你是对的。只是重新阅读问题并注意到也有非放大器。词。
这是一个不错的选择，我以为你删除了它，所以我使用了类似的 strsplit。
@akrun 我最初是按照 Wiktor 的评论进行的，但最终对其进行了调整。不错的strsplit

【解决方案5】：

data = c("Smith& Banks", "Smith &Banks", "Smith & Banks", 
         "Smith &     Banks", "Smith&Banks")

# Take the 0 or more spaces before and after the ampersand, replace that by " & ""
gsub("[ ]*&[ ]*", " & ", data) 
# [1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"

【讨论】：

【解决方案6】：

也试试这个：

gsub("([^& ]+)\\W+([^&]+)","\\1 & \\2",authors)
[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"

【讨论】：