【问题标题】:Output left or right part of a matched string [closed]输出匹配字符串的左侧或右侧部分[关闭]
【发布时间】:2014-09-27 23:01:41
【问题描述】:

我有两个文件,file1 包含 file2 的子字符串。我想将 file1 匹配到 file2 并输出匹配左侧的部分而不是匹配本身。我还想知道如何输出匹配右侧的内容而不是匹配本身。 这是我的部分数据(这些字符串可能不匹配,只是示例数据:

文件1

 ACUGUACAGGCCACUGCCUUGC
 CUGCGCAAGCUACUGCCUUGCU
 UGGAAUGUAAAGAAGUAUGUAU
 CGAAUCAUUAUUUGCUGCUCUA
 AUCACAUUGCCAGGGAUUACC
 UUCACAGUGGCUAAGUUCUGC

文件2

 CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG
 CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG
 GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC
 CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG
 GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC

示例:

文件1:

                                                  GCUGUGGAGAUAACUGCGC

文件2

  CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGC

输出

  CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCC

【问题讨论】:

    标签: r perl


    【解决方案1】:

    这里有几种方法可以只保留模式之前的文本(如果存在)

    a <- "GCUGUGGAGAUAACUGCGC"
    b <- "CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGC"
    
    strsplit(b, a)[[1]][1]
    sub(paste0(a, ".*$"), "", b)
    

    现在,您只需将文件读入 R 并遍历每个模式。我不确定您在寻找什么,但这里有一个想法

    # read data into 2 variables, a and b
    # you could use readLines() to do read from disk
    a <- readLines(textConnection("ACUGUACAGGCCACUGCCUUGC
    CUGCGCAAGCUACUGCCUUGCU
    UGGAAUGUAAAGAAGUAUGUAU
    CGAAUCAUUAUUUGCUGCUCUA
    AUCACAUUGCCAGGGAUUACC
    UUCACAGUGGCUAAGUUCUGC"))
    
    b <- readLines(textConnection("CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG
    CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG
    GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC
    CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG
    GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC"))
    

    现在,遍历第一个文件中的每个值

    lapply(a, function(x) sapply(strsplit(b, x), "[", 1))
    

    【讨论】:

    • @GracieD:输出的每个元素都是相同的。试试: ll = lapply(a, function(i) sapply(strsplit(b, a[i]), "[[", 1)); for(i in 2:length(ll)) print(相同(ll[[i]], ll[[i-1]]))
    • @rnso 谢谢。已更新。
    【解决方案2】:

    打开文件句柄到字符串进行测试:

    use strict;
    use warnings;
    use autodie;
    
    open my $fh1, '<', \ "ACUGUACAGGCCACUGCCUUGC\nCUGCGCAAGCUACUGCCUUGCU\nUGGAAUGUAAAGAAGUAUGUAU\nCGAAUCAUUAUUUGCUGCUCUA\nAUCACAUUGCCAGGGAUUACC\nUUCACAGUGGCUAAGUUCUGC\n";
    open my $fh2, '<', \ "CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG\nCUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG\nGCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC\nCUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG\nGGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC\n";
    
    while ( !eof $fh1 && !eof $fh2 ) {
        chomp( my $line1 = <$fh1> );
        chomp( my $line2 = <$fh2> );
    
        print join( ' ', split /$line1/, $line2, 2 ), "\n";
    }
    

    输出:

    GUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUA CAGG
    CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAA AG
    GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUA UUCAGGC
    CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUG G
    GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAA ACGCAACC
    

    【讨论】:

      【解决方案3】:

      您甚至可以在 Perl 代码中尝试使用 $PREMATCH($`)、$POSTMATCH($') 和 $MATCH($&) 在字符串的 before 、 after 和 match 下进行此操作:

      输入文件:

      file1.txt:

      ACUGUACAGGCCACUGCCUUGC
      CUGCGCAAGCUACUGCCUUGCU
      UGGAAUGUAAAGAAGUAUGUAU
      CGAAUCAUUAUUUGCUGCUCUA
      AUCACAUUGCCAGGGAUUACC
      UUCACAGUGGCUAAGUUCUGC
      

      file2.txt:

      CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG
      CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG
      GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC
      CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG
      GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC
      

      代码:

      use strict;
      use warnings;
      
      open my $fh1, '<', "file1.txt" or die "Couldnt open the file file1.txt : $!";
      open my $fh2, '<', "file2.txt" or die "Couldnt open the file file2.txt : $!";
      
      while(!eof $fh1 && !eof $fh2)
       {
          chomp( my $line1 = <$fh1> );
          chomp( my $line2 = <$fh2> );
      
          if($line2 =~ /$line1/isg)
           {
                print "Prematch: $`\n";         
                print "Postmatch: $'\t";
                }
           }     
      close($fh1);
      close($fh2);
      

      输出:

      Prematch: CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUA    Postmatch: CAGG
      Prematch: CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAA Postmatch: AG
      Prematch: GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUA  Postmatch: UUCAGGC
      Prematch: CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUG Postmatch: G
      Prematch: GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAA Postmatch: ACGCAACC
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2014-10-25
        • 2021-09-02
        • 2022-12-05
        • 1970-01-01
        • 2022-11-19
        • 1970-01-01
        • 2016-09-25
        相关资源
        最近更新 更多