【问题标题】:Split comma- and colon- separated string in R在 R 中拆分逗号和冒号分隔的字符串
【发布时间】:2021-06-14 17:22:28
【问题描述】:
Input$Freq                                                          
                                                                             Freq
                                        AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.
     AFR:0.1546,AMR:0.2581,EAS:0.0825,FIN:0.2270,NFE:0.0822,OTH:0.1706,ASJ:0.0729
                                        AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.
     AFR:0.1546,AMR:0.2581,EAS:0.0825,FIN:0.2270,NFE:0.0822,OTH:0.1706,ASJ:0.0729
                                        AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.
                                        AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.

这是数据框的一列,其中包含用逗号和冒号分隔的字符串。我想在EAS: 之后提取点或数字。我想要这样的输出

Output
                                                                                 Freq       EAS
                                            AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.         .
         AFR:0.1546,AMR:0.2581,EAS:0.0825,FIN:0.2270,NFE:0.0822,OTH:0.1706,ASJ:0.0729    0.0825
                                            AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.         .
         AFR:0.1546,AMR:0.2581,EAS:0.0825,FIN:0.2270,NFE:0.0822,OTH:0.1706,ASJ:0.0729    0.0825
                                            AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.         .
                                            AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.         .

我试过在 tidyr 中提取

maf_snv_intervar <- extract(Input, Freq, into = 'EAS', 
                            "^[^,]+,[^,]+,([^,]+),.*", remove = F, convert = T)

但我得到了这样的输出

Output
                                                                                 Freq          EAS
                                            AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.        EAS:.
         AFR:0.1546,AMR:0.2581,EAS:0.0825,FIN:0.2270,NFE:0.0822,OTH:0.1706,ASJ:0.0729   EAS:0.0825
                                            AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.        EAS:.
         AFR:0.1546,AMR:0.2581,EAS:0.0825,FIN:0.2270,NFE:0.0822,OTH:0.1706,ASJ:0.0729   EAS:0.0825
                                            AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.        EAS:.
                                            AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.        EAS:.

我不知道如何修改正则表达式。

【问题讨论】:

    标签: r regex


    【解决方案1】:

    我们可以用str_extract 更改正则表达式,并指定一个正则表达式环视,以匹配在任何不是, ([^,]+) 的字符之前的EAS 子字符串((?&lt;=EAS:))

    library(dplyr)
    library(stringr)
    Input <- Input %>%
        mutate(EAS = str_extract(Freq, '(?<=EAS:)[^,]+'))
    

    -输出

    Input
                                                                              Freq    EAS
    1                                    AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.      .
    2 AFR:0.1546,AMR:0.2581,EAS:0.0825,FIN:0.2270,NFE:0.0822,OTH:0.1706,ASJ:0.0729 0.0825
    3                                    AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.      .
    4 AFR:0.1546,AMR:0.2581,EAS:0.0825,FIN:0.2270,NFE:0.0822,OTH:0.1706,ASJ:0.0729 0.0825
    5                                    AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.      .
    6                                    AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.      .
    

    在带有extract 的OP 代码中,将正则表达式替换为

    library(tidyr)
    Input %>% 
        extract(Freq, into = 'EAS', "^[^,]+,[^,]+,EAS:([^,]+),.*", remove = FALSE)
                                                                              Freq    EAS
    1                                    AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.      .
    2 AFR:0.1546,AMR:0.2581,EAS:0.0825,FIN:0.2270,NFE:0.0822,OTH:0.1706,ASJ:0.0729 0.0825
    3                                    AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.      .
    4 AFR:0.1546,AMR:0.2581,EAS:0.0825,FIN:0.2270,NFE:0.0822,OTH:0.1706,ASJ:0.0729 0.0825
    5                                    AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.      .
    6                                    AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.      .
    

    数据

    Input <- structure(list(Freq = c("AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.", 
    "AFR:0.1546,AMR:0.2581,EAS:0.0825,FIN:0.2270,NFE:0.0822,OTH:0.1706,ASJ:0.0729", 
    "AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.", "AFR:0.1546,AMR:0.2581,EAS:0.0825,FIN:0.2270,NFE:0.0822,OTH:0.1706,ASJ:0.0729", 
    "AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:.", "AFR:.,AMR:.,EAS:.,FIN:.,NFE:.,OTH:.,ASJ:."
    )), class = "data.frame", row.names = c(NA, -6L))
    

    【讨论】:

      猜你喜欢
      • 2015-03-07
      • 2018-12-21
      • 1970-01-01
      • 2017-01-05
      • 2014-01-03
      • 2023-01-03
      • 2012-05-24
      相关资源
      最近更新 更多