从具有逗号分隔值的列中提取多个字符串答案

【问题标题】：Extracting multiple strings from a column with comma separated values从具有逗号分隔值的列中提取多个字符串
【发布时间】：2018-10-10 20:20:04
【问题描述】：

我有一个这样的数据框：

structure(list(mut = c("Q184H/CAA-CAT", "I219V/ATC-GTC", "A314T/GCG-ACG, P373Q/CCG-CAG, A653E/GCG-GAA","0")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))

我想要做的是在新列中获取“/”之后的所有内容，对于每行中的每个逗号分隔值，无论每行中有多少条目。

我想得到什么：

    mut                    nt
1   Q184H                  CAA-CAT
2   I219V                  ATC-GTC
3   A314T, P373Q, A653E    GCG-ACG, CCG-CAG, CGC-GAA
4   0                      0

我已尝试为此使用正则表达式，但似乎无法匹配以逗号分隔的每个条目。

library(dplyr)
df %>%
    mutate(nt = gsub(".+/(.*?)", "\\1", mut))

如何使每个条目都匹配？我必须将它们分开然后进行匹配吗？

【问题讨论】：

标签： r regex

【解决方案1】：

你只需要稍微调整一下你的正则表达式；请注意我如何将您的 .s 更改为 [^,]s。在正则表达式中，如果您将字符用方括号括起来并以^ 开头，则表示匹配之外的任何字符。所以[^,]+ 表示尽可能多地匹配非逗号的连续字符。

df = structure(list(mut = c("Q184H/CAA-CAT", "I219V/ATC-GTC",
                            "A314T/GCG-ACG, P373Q/CCG-CAG, A653E/GCG-GAA","0")),
               row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df %>%
    mutate(nt = gsub("[^,]+?/([^,]+?)", "\\1", mut),
           mut = gsub("([^/]+)/[^,]+", "\\1", mut))
#> # A tibble: 4 x 2
#>   mut                 nt                     
#>   <chr>               <chr>                  
#> 1 Q184H               CAA-CAT                
#> 2 I219V               ATC-GTC                
#> 3 A314T, P373Q, A653E GCG-ACG,CCG-CAG,GCG-GAA
#> 4 0                   0

^{由reprex package (v0.2.1) 于 2018 年 10 月 10 日创建}

【讨论】：

谢谢！所以方括号指定逗号“可以”在那里，但不必？
@Haakonkas 我添加了一些额外的解释。

【解决方案2】：

不要接受这个作为答案（@duckmayr 进行了正则表达式调试）。发布这个独家向人们展示通过使用stringi，我们可以获得自我记录的正则表达式，这样我们未来的自我就不会最终讨厌我们过去的自我：

library(stringi) # it's what stringr uses
library(tidyverse)

xdf <- structure(list(mut = c("Q184H/CAA-CAT", "I219V/ATC-GTC", "A314T/GCG-ACG, P373Q/CCG-CAG, A653E/GCG-GAA","0")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))

mutate(
  xdf, 
  nt = stri_replace_all_regex(
    str = mut,
    pattern = "
[^,]+?  # match anything but a comma and if there is one, match at most once
/       # followed by a forward slash
(       # start of match group
 [^,]+? # same as above
)       # end of match group
",
    replacement = "$1", # take the match group value as the value
    opts_regex = stri_opts_regex(comments=TRUE)
  ),
  mut = stri_replace_all_regex(
    str = mut,
    pattern = "
(      # start of match group
 [^/]+ # match anything but a forward slash
)      # end of match group
/      # followed by a forward slash
[^,]+  # match anything but a comma
",
    replacement = "$1", # take the match group value as the value
    opts_regex = stri_opts_regex(comments=TRUE)
  )
)

【讨论】：

这真的很酷——我不知道有这个功能
是的。对于那些真正粗糙的正则表达式来说，这是一个救生员提供一个紧凑的示例（带有文档的正则表达式示例很少；即使我用stringi 回答 q，我也和任何人一样内疚。