删除 R 中包含冒号的字符串答案

【问题标题】：Remove strings that contain a colon in R删除 R 中包含冒号的字符串
【发布时间】：2018-03-03 20:46:49
【问题描述】：

这是我的数据集的示例性摘录。如下所示：

Description;ID;Date
wa119:d Here comes the first row;id_112;2018/03/02
ax21:3 Here comes the second row;id_115;2018/03/02
bC230:13 Here comes the third row;id_234;2018/03/02

我想删除那些包含冒号的单词。在这种情况下，这将是 wa119:d、ax21:3 和 bC230:13，因此我的新数据集应如下所示：

Description;ID;Date
Here comes the first row;id_112;2018/03/02
Here comes the second row;id_115;2018/03/02
Here comes the third row;id_234;2018/03/02

很遗憾，我无法使用 gsub 找到正则表达式/解决方案？有人可以帮忙吗？

【问题讨论】：

示例数据的第三行包含两个带冒号的单词。请在您的问题文本中阐明您想要的输出。
谢谢。我相应地更新了数据框。

标签： r string replace

【解决方案1】：

这是一种方法：

## reading in yor data
dat <- read.table(text ='
Description;ID;Date
wa119:d Here comes the first row;id_112;2018/03/02
ax21:3 Here comes the second row;id_115;2018/03/02
bC230:13 Here comes the third row;id:234;2018/03/02
', sep = ';', header = TRUE, stringsAsFactors = FALSE)

## \\w+ = one or more word characters
gsub('\\w+:\\w+\\s+', '', dat$Description)

## [1] "Here comes the first row"  
## [2] "Here comes the second row"
## [3] "Here comes the third row"

更多关于\\w 的信息，一个与[A-Za-z0-9_]:https://www.regular-expressions.info/shorthand.html 相同的速记字符类

【讨论】：

gsub 构造的输出似乎与引用的所需输入返回的值不同。 OP 不清楚数据的实际结构，与期望的输出存在差异。
谢谢！但是如何删除带冒号的单词后留下的第一个空格？
您可以通过gsub('\\w+:\\w+\\s+', '', dat$Description) 删除它，但不确定这是否适用于您的实际数据。我想这取决于您的数据是什么样的。如果冒号词总是出现在字符串的前面，这个问题就会变得更容易。如果它总是出现在开头，我会使用像 '^\\w+:\\w+\\s+' 这样的行锚^ 的开头，因为它更明确并且可能更有效。

【解决方案2】：

假设你要修改的列是dat：

dat <- c("wa119:d Here comes the first row",
         "ax21:3 Here comes the second row",
         "bC230:13 Here comes the third row")

然后你可以获取每个元素，将其拆分为单词，删除包含冒号的单词，然后将剩下的内容粘贴在一起，得到你想要的：

dat_colon_words_removed <- unlist(lapply(dat, function(string){
  words <- strsplit(string, split=" ")[[1]]
  words <- words[!grepl(":", words)]
  paste(words, collapse=" ")
}))

【讨论】：

【解决方案3】：

另一个与 OP 的预期结果完全匹配的解决方案可能是：

#data
df <- read.table(text = "Description;ID;Date
wa119:d Here comes the first row;id_112;2018/03/02
ax21:3 Here comes the second row;id_115;2018/03/02
bC230:13 Here comes the third row;id:234;2018/03/02", stringsAsFactors = FALSE, sep="\n")

gsub("[a-zA-Z0-9]+:[a-zA-Z0-9]+\\s", "", df$V1)

#[1] "Description;ID;Date"                        
#[2] "Here comes the first row;id_112;2018/03/02" 
#[3] "Here comes the second row;id_115;2018/03/02"
#[4] "Here comes the third row;id:234;2018/03/02"

【讨论】：