【问题标题】:subset a DNAStringSet by subpattern and remove subpattern in R通过子模式对 DNAStringSet 进行子集化并删除 R 中的子模式
【发布时间】:2017-07-07 18:45:34
【问题描述】:

我只想对包含子字符串的行进行子集化,然后删除子字符串。我可以做第一部分,但我不知道如何删除子字符串

这是一个例子

library(Biostrings)
myseq <-DNAStringSet(c("CCCATGAAAGATCGGAAGAGCACACGTCTGAACCCATGAA", "CCCATGAACATAGATCC", "CCCGTACAGATCACGTG"))
names(myseq) <- letters[1:3]
myseq

A DNAStringSet instance of length 3
width seq                                                                                                           names               
[1]    40 CCCATGAAAGATCGGAAGAGCACACGTCTGAACCCATGAA                                                                    a
[2]    17 CCCATGAACATAGATCC                                                                                           b
[3]    17 CCCGTACAGATCACGTG                                                                                           c

我要删除的序列是 AGATCGGAAGAGCACACGTCTGAA,它位于第一行。

matchPattern("AGATCGGAAGAGCACACGTCTGAA", myseq[[1]])

Views on a 40-letter DNAString subject
subject: CCCATGAAAGATCGGAAGAGCACACGTCTGAACCCATGAA
views:
    start end width
[1]     9  32    24 [AGATCGGAAGAGCACACGTCTGAA]

对子集我执行以下操作:

pat <- vmatchPattern("AGATCGGAAGAGCACACGTCTGAA", myseq)
myseq[ lapply(lapply(pat, isEmpty), function(x) x == FALSE) ]

A DNAStringSet instance of length 3
    width seq                                                                                                         names               
[1]    40 CCCATGAAAGATCGGAAGAGCACACGTCTGAACCCATGAA                                                                    a
[2]     0                                                                                                             b
[3]     0                                                                                                             c

输出应该是

A DNAStringSet instance of length 3
    width seq                                                                                                         names               
[1]    11 CCCCCCATGAA                                                                                                 a
[2]     0                                                                                                             b
[3]     0                                                                                                             c

【问题讨论】:

    标签: r regex bioinformatics fasta bioconductor


    【解决方案1】:

    您可以使用vcountPattern 计算ifelse 语句中的匹配项,将匹配项替换为str_replace 的输出,将不匹配项替换为空字符串:

    myseq2 <- DNAStringSet(
                unlist(
                  lapply(
                    vcountPattern(
                     'AGATCGGAAGAGCACACGTCTGAA', myseq) > 0, 
                      ifelse, 
                      str_replace(
                        myseq, 
                       'AGATCGGAAGAGCACACGTCTGAA', 
                       ''),
                    '')
                  )
                )
    names(myseq2) <- names(myseq)
    myseq2
    
    >A DNAStringSet instance of length 3
    >width seq                                                     names               
    >[1]    16 CCCATGAACCCATGAA                                        a
    >[2]     0                                                         b
    >[3]     0                                                         c
    

    使用管道符号更易读:

    lapply(vcountPattern('AGATCGGAAGAGCACACGTCTGAA', myseq) > 0, ifelse, str_replace(myseq, 'AGATCGGAAGAGCACACGTCTGAA', ''), '') %>%
        unlist() %>%
        DNAStringSet() -> myseq2
    

    【讨论】:

      【解决方案2】:

      我不熟悉 bioinformatics 包,但如果您可以将数据转换为列表(我相信应该可以将列表转换为包中使用的格式),可以使用以下方法:

      1) 使用 stringr 库删除所需的模式 2) 计算新模式的长度

      # load biostrings package
      library(Biostrings)
      
      # create sample dataset
      myseq <-DNAStringSet(c("CCCATGAAAGATCGGAAGAGCACACGTCTGAACCCATGAA", "CCCATGAACATAGATCC", "CCCGTACAGATCACGTG"))
      names(myseq) <- letters[1:3]
      
      # remove sequences with no match
      pat <- vmatchPattern("AGATCGGAAGAGCACACGTCTGAA", myseq)
      data <- myseq[ lapply(lapply(pat, isEmpty), function(x) x == FALSE) ]
      
      # load stringr library
      library(stringr)
      
      # replace the matched sequence
      test <- lapply(test, str_replace, "AGATCGGAAGAGCACACGTCTGAA", "")
      # put together the new sequence and its length
      test <- mapply(c, lapply(test, nchar), test, SIMPLIFY = FALSE)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多