从文本 R 中提取多个正则表达式答案

【问题标题】：Multiple regex extract from a text R从文本 R 中提取多个正则表达式
【发布时间】：2018-02-09 20:34:18
【问题描述】：

我有以下 df

df = data.frame(id = c(1,2,3), text = c('Label issues as ISS101  and ISS 201 on label 23 with x203 17','issue as ISS5051 with label 01 as l018','there is nothing here')

我想从 df 中提取并创建以下数据框

id  iss     label  ext1 ext2
 1  ISS101  23     x203  17
 1  ISS201  23     x203  17
 2  ISS5051 01     l018  NA
 3    NA    NA      NA   NA

如示例中所示，iss 的长度可能会有所不同。它在“ISS”和后续数字之间可能有也可能没有空格，这在例如标签的长度，ext1 & ext2 是固定的。我已经尝试了使用 stringr 和 dplyr 的正则表达式的各种可能性。但这些都不是解决方案，因此值得在这里一提。期待帮助，如果您需要更多详细信息，请告诉我。

【问题讨论】：

标签： r regex gsub stringr alphanumeric

【解决方案1】：

您可以像这样使用dplyr 和stringr...

library(dplyr)
library(stringr)

df2 <- df %>% mutate(iss=str_extract_all(str_replace_all(text,"ISS\\s+(\\d+)","ISS\\1"),
                        "ISS\\d+"), #remove spaces and then extract ISSnnn
                 label=str_match(text,"label\\s+(\\d+)")[,2], #extract label + nn
                 ext1=str_match(text,"label\\s+\\d+.*?([a-z]\\d+)")[,2], #extract Xnnn after label
                 ext2=str_match(text,"\\s(\\d+)$")[,2]) %>% #extract digits at end of string
              unnest(iss) %>% #unnest iss (creates one row for each iss)
              select(iss,label,ext1,ext2) #select wanted variables

df2

      iss label ext1 ext2
1  ISS101    23 x203   17
2  ISS201    23 x203   17
3 ISS5051    01 l018 <NA>

【讨论】：

有效！！什么都不带走，你能推荐一份正则表达式备忘单吗？一直在使用 Rstudio 的那个
@Param 当我遇到困难时，我倾向于使用这个...rexegg.com/regex-quickstart.html
regex101.com 是很棒的 IMO，你也可以直接+交互式地测试你的表达方式

【解决方案2】：

这可能是一个开始：

do.call(plyr::rbind.fill, 
  lapply(df$text, function(x) {
    as.data.frame(cbind(
      iss = unlist(stringr::str_extract_all(x, "(ISS\\s?\\d{3,4})")),
      label = unlist(stringr::str_extract_all(x, "(?<=label)\\s?(\\d{1,2})")),
      ext1 = unlist(stringr::str_extract_all(x, "((x|l)\\d{3})")),
      ext2 = unlist(stringr::str_extract_all(x, "(?<=x|l\\d{3})\\s?\\d{1,3}"))
    ))}
    ))

      iss label ext1 ext2
1  ISS101    23 x203  203
2 ISS 201    23 x203  203
3 ISS5051    01 l018 <NA>

【讨论】：

【解决方案3】：

根据您的描述，我已尽我所能。在没有看到更多数据的情况下，我不能保证这将是可推广的，但它会为您提供的 df 产生所需的输出，所以它应该是一个好的开始。

# create data frame
df = data.frame(id = c(1,2,3), text = c('Label issues as ISS101  and ISS 201 on label 23 with x203 17','issue as ISS5051 with label 01 as l018','there is nothing here'))

# parse text into fields
df <- df %>% mutate(
    iss = str_extract(text, "ISS\\d+\\D"),
    iss_space = str_extract(text, "ISS\\s\\d+\\D"),
    label = str_extract(text, "label.+\\D"), 
    label = str_extract(label, "\\d+\\D"),
    ext1 = str_extract(text, "\\s\\D\\d{3}"),
    ext2 = str_extract(text, "\\s\\D\\d{3}\\s\\d{2}"),
    ext2 = str_extract(ext2, "\\s\\d{2}")) 

# clean up into correct format
df <- df %>%
   gather(iss, iss_space, key = "type", value = "iss") %>%
    select(-text, -type) %>% 
    distinct() %>%
    filter(!(duplicated(id) == T & is.na(iss) == T)) %>%
    arrange(id) %>%
    select(id, iss, label, ext1, ext2) %>%
    mutate(iss = str_replace_all(iss, " ", ""))

df

  id     iss label  ext1 ext2
1  1  ISS101   23   x203   17
2  1  ISS201   23   x203   17
3  2 ISS5051   01   l018 <NA>
4  3    <NA>  <NA>  <NA> <NA>

【讨论】：