根据某些模式和块内的信息拆分文件答案

【问题标题】：Splitting files based on some pattern and the information inside the chunk根据某些模式和块内的信息拆分文件
【发布时间】：2016-12-20 03:42:37
【问题描述】：

我正在处理很多具有这种结构的文件：

BEGIN
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=1393
PEPMASS=946.3980102539062
CHARGE=3.0+
USER03=
SEQ=DDDIAAL
TAXONOMY=9606
272.228 126847.000
273.252 33795.000
END
BEGIN IONS
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=1383
PEPMASS=911.3920288085938
CHARGE=2.0+
USER03=
SEQ=QGKFEAAETLEEAAMR
TAXONOMY=9606
1394.637    71404.000
1411.668    122728.000
END
BEGIN IONS
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=2965
PEPMASS=946.3900146484375
CHARGE=3.0+
TAXONOMY=9606
1564.717    92354.000
1677.738    33865.000
END

这种结构重复了数千次，但里面的数据不同。如您所见，在一些开始和结束之间，有时 SEQ 和 USER03 不存在。这是因为蛋白质没有被识别出来……我的问题来了。

我想知道有多少蛋白质被鉴定出来，有多少未被鉴定出来。为此，我决定使用 bash，因为它更容易管理文件。

for i in $(ls *.txt ); do
    echo $i

    awk '/^BEGIN/{n++;w=1} n&&w{print > "./cache/out" n ".txt"} /^END/{w=0}' $i

done

我在这里找到了这个 (Split a file into multiple files based on a pattern and name the new files by the search pattern in Unix?)

然后使用输出并对其进行分类：

for i in $(ls cache/*.txt ); do
    echo $i

    if grep -q 'SEQ' $i; then
        mv $i ./archive_identified
    else
        mv $i ./archive_unidentified
    fi
done

在此之后，我想从分类文件中获取一些数据（例如：光谱、USER03、SEQ、TAXONOMY）。

for i in $( ls archive_identified/*.txt ); do
    echo $i
    grep 'SEQ' $i | cut -d "=" -f2- | tr ',' '\n' >> ./sequences_ide.txt
    grep 'TAXONOMY' $i | cut -d "=" -f2- | tr ',' '\n' >> ./taxonomy_ide.txt
    grep 'USER' $i | cut -d "=" -f2- >> ./modifications_ide.txt
    grep 'TITLE' $i | sed 's/^.*\(spectrum.*\)/\1/g' | cut -d "=" -f2-  >> ./spectrum.txt

done

for i in $( ls archive_unidentified/*.txt ); do
    echo $i
    grep 'TAXONOMY' $i | cut -d "=" -f2- | tr ',' '\n' >> ./taxonomy_unide.txt
    grep 'TITLE' $i | sed 's/^.*\(spectrum.*\)/\1/g' | cut -d "=" -f2-  >> ./spectrum_unide.txt

done

问题是脚本的第一部分花费了太多时间（我在 7 天前在 LSF 中运行了脚本，它仍然继续运行），因为数据量很大（每个文件 12-15gb）和生成数千个文件。有没有办法在 Python 或 Perl 中做到这一点？

【问题讨论】：

你能从这些数据中阐明你到底需要什么吗？有多少（USER03 和SEQ）缺少或一起？还有什么？你没有指定你需要什么。
您是否真的需要将所有单独的BEGIN,END 块分离到单独的文件中，因为您可以通过awk 轻松生成您似乎需要的所有答案或Perl?
@zdim 正如我所说，有一些文本有 USER03 和 SEQ 以及其他没有它们的文本。基本上我想要的是将这些部分分开以便能够对它们进行分类。
@MarkSetchell 是的，但是如果 SEQ 存在于测试块中，则使用条件。
'USER03='，这总是空的吗？

标签： python bash perl

【解决方案1】：

根据您的评论：“我想要一个文件只包含具有 SEQ 的块，而另一个文件包含不具有 SEQ 的文本块”

在 Perl 中，我会这样做：

#!/usr/bin/env perl

use strict;
use warnings;

open ( my $has_seq, '>', 'SEQ' ) or die $!;
open ( my $no_seq, '>', 'NO_SEQ' ) or die $!;
my $seq_count = 0;
my $no_seq_count = 0;

local $/ = 'END'; 

#iterate stdin or files specified on command line, just like sed/grep
while ( <> ) {
    #check if this chunk contains the word 'SEQ'.
    #regex match, so it'll match this text anywhere. 
    #maybe need to tighen up to ^SEQ= or similar? 
    if ( m/SEQ/ ) { 
        #choose output filehandle
        $seq_count++;
        select $has_seq;
    }
    else { 
       $no_seq_count++;
       select $no_seq;
    }
    #print current block to selected filehandle. 
    print;
}

select \*STDOUT; 
print "SEQ: $seq_count\n"; 
print "No SEQ: $no_seq_count\n";

这将创建两个文件（创造性地称为“SEQ”和“NO_SEQ”）并从您的源中拆分结果。

【讨论】：

我怀疑 OP 可能没有意识到他可以避免他目前拥有的 for 循环而只使用 ./YourScript *.txt