在速度方面改进python代码答案

【问题标题】：Improve python code in terms of speed在速度方面改进python代码
【发布时间】：2015-07-29 11:38:13
【问题描述】：

我有一个非常大的文件（15 亿行），格式如下：

1    67108547    67109226    gene1$transcript1    0    +    1    0
1    67108547    67109226    gene1$transcript1    0    +    2    1
1    67108547    67109226    gene1$transcript1    0    +    3    3
1    67108547    67109226    gene1$transcript1    0    +    4    4 

                                 .
                                 .
                                 .

1    33547109    33557650    gene2$transcript1    0    +    239    2
1    33547109    33557650    gene2$transcript1    0    +    240    0

                                 .
                                 .
                                 .

1    69109226    69109999    gene1$transcript1    0    +    351    1
1    69109226    69109999    gene1$transcript1    0    +    352    0

我想要做的是根据第 4 列上的标识符重新组织/排序此文件。该文件由块组成。如果连接第 4、1、2 和 3 列，则为每个块创建唯一标识符。这是字典 all_exons 的键，值是一个 numpy 数组，其中包含第 8 列的所有值。然后我有第二个字典 unique_identifiers 以属性为键从第 4 列开始，并为相应的块标识符列表赋值。作为输出，我以以下形式编写文件：

>gene1
0
1
3
4
1
0
>gene2
2
0

我已经编写了一些代码（见下文），但我的实现速度很慢。运行大约需要 18 个小时。

import os
import sys
import time
from contextlib import contextmanager
import pandas as pd
import numpy as np


def parse_blocks(bedtools_file):
    unique_identifiers = {} # Dictionary with key: gene, value: list of exons
    all_exons = {} # Dictionary contatining all exons

    # Parse file and ...
    with open(bedtools_file) as fp:
        sp_line = []
        for line in fp:
            sp_line = line.strip().split("\t")
            current_id = sp_line[3].split("$")[0]

           identifier="$".join([sp_line[3],sp_line[0],sp_line[1],sp_line[2]])
           if(identifier in all_exons):
               item = float(sp_line[7])
               all_exons[identifier]=np.append(all_exons[identifier],item)
           else:
               all_exons[identifier] = np.array([sp_line[7]],float)

           if(current_id in unique_identifiers):
               unique_identifiers[current_id].add(identifier)
           else:
               unique_identifiers[current_id] =set([identifier])
  return unique_identifiers, all_exons

identifiers, introns = parse_blocks(options.bed)

w = open(options.out, 'w')
for gene in sorted(list(identifiers)):
    w.write(">"+str(gene)+"\n")
    for intron in sorted(list(identifiers[gene])):
        for base in introns[intron]:
            w.write(str(base)+"\n")
w.close()

如何改进上述代码以加快运行速度？

【问题讨论】：

了解如何让您的代码运行得更快的最佳方法是通过分析：stackoverflow.com/questions/582336/…
澄清一下：最终目标是以您发布的格式获取输出文件，还是需要将数据组织在字典中？您是否在此文件中也有带有“transcript2”的条目，还是总是“transcript1”？
最终目标是获取输出文件。我使用字典的原因是我在生成输入文件时通过管道传输它以节省磁盘空间。不，我没有带有成绩单2 的条目（对不起，我忘了提及这一点）。
我做了一个分析，似乎代码在连接时延迟（numpy.core.multiarray.concatenate）。所以问题似乎在于附加到numpy数组......
@fitziano：速度比较怎么样？熊猫解决方案与您的方法相比会很有趣......

标签： python numpy dictionary

【解决方案1】：

您还导入pandas，因此，我提供了一个pandas 解决方案，基本上只需要两行代码。但是，我不知道它在大型数据集上的表现如何，以及这是否比您的方法更快（但我很确定它是）。

在下面的示例中，您提供的数据存储在table.txt 中。然后我使用groupby 获取第 8 列中的所有值，将它们存储在第 4 列中相应标识符的列表中（请注意，我的索引从 0 开始）并将此数据结构转换为字典，然后可以轻松打印。

import pandas as pd

df=pd.read_csv("table.txt", header=None, sep = r"\s+") # replace the separator by e.g. '/t'

op = dict(df.groupby(3)[7].apply(lambda x: x.tolist()))

所以在这种情况下op 看起来像这样：

{'gene1$transcript1': [0, 1, 3, 4, 1, 0], 'gene2$transcript1': [2, 0]}

现在您可以像这样打印输出并将其通过管道传输到某个文件中：

for k,v in op.iteritems():

    print k.split('$')[0]
    for val in v:
        print val

这将为您提供所需的输出：

gene1
0
1
3
4
1
0
gene2
2
0

也许你可以试一试，让我知道它与你的解决方案相比如何！？

编辑2：

在您提到的 cmets 中，您希望以正确的顺序打印基因。你可以这样做：

# add some fake genes to op from above
op['gene0$stuff'] = [7,9]       
op['gene4$stuff'] = [5,9]

# print using 'sorted'
for k,v in sorted(op.iteritems()):

    print k.split('$')[0]
    for val in v:
        print val

给你：

gene0
7
9
gene1
0
1
3
4
1
0
gene2
2
0
gene4
5
9

编辑1：

我不确定是否有意重复，但您可以通过执行以下操作轻松摆脱它们：

op2 = dict(df.groupby(3)[7].apply(lambda x: set(x)))

现在op2 看起来像这样：

{'gene1$transcript1': {0, 1, 3, 4}, 'gene2$transcript1': {0, 2}}

你像以前一样打印输出：

for k,v in op2.iteritems():

    print k.split('$')[0]
    for val in v:
        print val

给你

gene1
0
1
3
4
gene2
0
2

【讨论】：

【解决方案2】：

我会尽量简化你的问题，我的解决方案是这样的：

首先，扫描大文件。对于每个不同的current_id，打开一个临时文件并将第 8 列的值附加到该文件。
完整扫描后，将所有块连接到一个结果文件。

代码如下：

# -*- coding: utf-8 -*-
import os
import tempfile
import subprocess


class ChunkBoss(object):
    """Boss for file chunks"""

    def __init__(self):
        self.opened_files = {}

    def write_chunk(self, current_id, value):
        if current_id not in self.opened_files:
            self.opened_files[current_id] = open(tempfile.mktemp(), 'wb')
            self.opened_files[current_id].write('>%s\n' % current_id)

        self.opened_files[current_id].write('%s\n' % value)

    def cat_result(self, filename):
        """Catenate chunks to one big file
        """
        # Sort the chunks
        chunk_file_list = []
        for current_id in sorted(self.opened_files.keys()):
            chunk_file_list.append(self.opened_files[current_id].name)

        # Flush chunks
        [chunk.flush() for chunk in self.opened_files.values()]

        # By calling cat command
        with open(filename, 'wb') as fp:
            subprocess.call(['cat', ] + chunk_file_list, stdout=fp, stderr=fp)

    def clean_up(self):
        [os.unlink(chunk.name) for chunk in self.opened_files.values()]


def main():
    boss = ChunkBoss()
    with open('bigfile.data') as fp:
        for line in fp:
            data = line.strip().split()
            current_id = data[3].split("$")[0]
            value = data[7]

            # Write value to temp chunk
            boss.write_chunk(current_id, value)

    boss.cat_result('result.txt')
    boss.clean_up()

if __name__ == '__main__':
    main()

我测试了我的脚本的性能，bigfile.data 包含大约 150k 行。在我的笔记本电脑上完成大约需要 0.5 秒。也许你可以试一试。

【讨论】：