从一个巨大的文本文件中获取数据以有效地替换另一个巨大的文本文件中的数据（Python）答案

【问题标题】：Get data from a huge text file to replace data in another huge text file, efficiently (Python)从一个巨大的文本文件中获取数据以有效地替换另一个巨大的文本文件中的数据（Python）
【发布时间】：2014-11-20 17:09:11
【问题描述】：

我已经编程了几个月，所以我不是专家。我有两个巨大的文本文件（omni，~20 GB，~2.5M 行；dbSNP，~10 GB，~60M 行）。它们有前几行，不一定是制表符分隔，以“#”（标题）开头，其余行组织在制表符分隔的列（实际数据）中。

每行的前两列包含染色体编号和染色体上的位置，而第三列包含识别码。在“omni”文件中，我没有 ID，因此我需要在 dbSNP 文件（数据库）中找到位置，并创建第一个包含 ID 的文件的副本。

由于内存限制，我决定逐行读取这两个文件，然后从读取的最后一行重新开始。我对我的代码的效率不满意，因为我觉得它比它可能的要慢。我很确定这是我的错，因为缺乏经验。有没有办法使用 Python 让它更快？问题可能是文件的打开和关闭吗？

我通常像这样在 GNOME 终端（Python 2.7.6，Ubuntu 14.04）中启动脚本：

python -u Replace_ID.py > Replace.log 2> Replace.err

非常感谢您。

全方位 (Omni example)：

...
#CHROM POS ID REF ALT ...
1 534247。 C T ...
...

dbSNP (dbSNP example)：

...
#CHROM POS ID REF ALT ...
1 10019 rs376643643 TA T ...
...

输出应该与 Omni 文件完全相同，但位置后带有 rs ID。

代码：

SNPline = 0    #line in dbSNP file
SNPline2 = 0    #temporary copy
omniline = 0    #line in omni file
line_offset = []    #beginnings of every line in dbSNP file (stackoverflow.com/a/620492)
offset = 0
with open("dbSNP_build_141.vcf") as dbSNP: #database
    for line in dbSNP:
        line_offset.append(offset)
        offset += len(line)
    dbSNP.seek(0)
with open("Omni_replaced.vcf", "w") as outfile:     
    outfile.write("")       
with open("Omni25_genotypes_2141_samples.b37.v2.vcf") as omni:  
    for line in omni:           
        omniline += 1
        print str(omniline) #log
        if line[0] == "#":      #if line is header
            with open("Omni_replaced.vcf", "a") as outfile:
                outfile.write(line) #write as it is
        else:
            split_omni = line.split('\t') #tab-delimited columns
            with open("dbSNP_build_141.vcf") as dbSNP:
                SNPline2 = SNPline          #restart from last line found
                dbSNP.seek(line_offset[SNPline])    
                for line in dbSNP:
                    SNPline2 = SNPline2 + 1 
                    split_dbSNP = line.split('\t')  
                    if line[0] == "#":
                        print str(omniline) + "#" + str(SNPline2) #keep track of what's happening.
                        rs_found = 0    #it does not contain the rs ID
                    else:
                        if split_omni[0] + split_omni[1] == split_dbSNP[0] + split_dbSNP[1]:    #if chromosome and position match
                            print str(omniline) + "." + str(SNPline2) #log
                            SNPline = SNPline2 - 1
                            with open("Omni_replaced.vcf", "a") as outfile:
                                split_omni[2] = split_dbSNP[2]  #replace the ID
                                outfile.write("\t".join(split_omni)) 
                            rs_found = 1    #ID found
                            break        
                        else:
                            rs_found = 0    #ID not found
                if rs_found == 0:   #if ID was not found in dbSNP, then:
                    with open("Omni_replaced.vcf", "a") as outfile:
                        outfile.write("\t".join(split_omni)) #keep the line unedited
                else:   #if ID was found:
                    pass    #no need to do anything, line already written
    print "End."

【问题讨论】：

您能否提供更多信息以进行测试？例如，为每个文件提供 5 行和显示替换过程完成的 5 行输出？它可能会让您更好地理解您提供的代码，并且还可以测试一些修改并验证一致性。
@ArthurVaïsse 谢谢你的建议。我编辑了原始消息。我只能添加两个链接，所以这里是输出文件示例：lucabetti.altervista.org/file/Output_example.vcf

标签： python performance text replace large-files

【解决方案1】：

这是我对您的问题的贡献。首先，这是我对您的问题的理解，只是为了检查我是否正确：您有两个文件，每个都是制表分隔值文件。第一个，dbSNP，包含数据，其第三列是对应于基因的染色体编号（第 1 列）和基因在染色体上的位置（第 2 列）的标识符。

任务包括获取omni文件并使用来自dbNSP文件的所有值填写ID列（基于染色体编号和基因位置）。

问题来自文件的大小。你试图在文件中保留每一行的位置做seek并直接去好行，以避免你把所有的dbnsp文件放到内存中。由于打开多个文件，这种方法对您来说不够快，这是我的建议。

解析一次 dbNSP 文件以仅保留基本信息，即夫妻(number,position):ID。从您的示例中对应于：

1   534247  rs201475892
1   569624  rs6594035
1   689186  rs374789455

这相当于不到 10% 的文件大小在内存方面，所以从一个 20GB 的文件开始，你将在小于 2GB 的内存中加载，它可能是负担得起的（不知道你之前尝试过什么样的加载)。

所以这是我执行此操作的代码。不要犹豫，要求解释，因为与您不同，我使用对象编程。

import argparse

#description of this script
__description__ = "This script parse a Database file in order to find the genes identifiers and provide them to a other file.[description to correct]\nTake the IDs from databaseFile and output the targetFile content enriched with IDs"

# -*- -*-  -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- 
#classes used to handle and operate on data
class ChromosomeIdentifierIndex():
    def __init__(self):
        self.chromosomes = {}

    def register(self, chromosomeNumber, positionOnChromosome, identifier):
        if not chromosomeNumber in self.chromosomes:
            self.chromosomes[chromosomeNumber] = {}

        self.chromosomes[chromosomeNumber][positionOnChromosome] = identifier

    def __setitem__(self, ref, ID):
        """ Allows to use alternative syntax to chrsIndex.register(number, position, id) : chrsIndex[number, position] = id """
        chromosomeNumber, positionOnChromosome = ref[0],ref[1]
        self.register(chromosomeNumber, positionOnChromosome, ID)

    def __getitem__(self, ref):
        """ Allows to get IDs using the syntax: chromosomeIDindex[chromosomenumber,positionOnChromosome] """
        chromosomeNumber, positionOnChromosome = ref[0],ref[1]
        try:
            return self.chromosomes[chromosomeNumber][positionOnChromosome]
        except:
            return "."

    def __repr__(self):
        for chrs in self.chromosomes.keys():
            print "Chromosome : ", chrs
            for position in self.chromosomes[chrs].keys():
                print "\t", position, "\t", self.chromosomes[chrs][position]

class Chromosome():
    def __init__(self, string):
        self.values   = string.split("\t")
        self.chrs     = self.values[0]
        self.position = self.values[1]
        self.ID       = self.values[2]

    def __str__(self):
        return "\t".join(self.values)

    def setID(self, ID):
        self.ID = ID
        self.values[2] = ID

class DefaultWritter():
    """ Use to print if no output is specified """
    def __init__(self):
        pass
    def write(self, string):
        print string
    def close(self):
        pass

# -*- -*-  -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- 
#The code executed when the scrip is called
if __name__ == "__main__":

    #initialisation
    parser = argparse.ArgumentParser(description = __description__)
    parser.add_argument("databaseFile"  , help="A batch file that contains many informations, including the IDs.")
    parser.add_argument("targetFile"    , help="A file that contains informations, but miss the IDs.")
    parser.add_argument("-o", "--output", help="The output file of the script. If no output is specified, the output will be printed on the screen.")
    parser.add_argument("-l", "--logs"  , help="The log file of the script. If no log file is specified, the logs will be printed on the screen.")
    args = parser.parse_args()

    output = None
    if args.output == None:
        output = DefaultWritter()
    else:
        output = open(args.output, 'w')

    logger = None
    if args.logs == None:
        logger = DefaultWritter()
    else:
        logger = open(args.logs, 'w')

    #start of the process

    idIndex = ChromosomeIdentifierIndex()

    #build index by reading the database file.
    with open(args.databaseFile, 'r') as database:
        for line in database:
            if not line.startswith("#"):
                chromosome = Chromosome(line)
                idIndex[chromosome.chrs, chromosome.position] = chromosome.ID

    #read the target, replace the ID and output the result
    with open(args.targetFile, 'r') as target:
        for line in target:
            if not line.startswith("#"):
                chromosome = Chromosome(line)
                chromosome.setID(idIndex[chromosome.chrs, chromosome.position])
                output.write(str(chromosome))
            else:
                output.write(line)

    output.close()
    logger.close()

主要思想是一次性解析 dbNSP 文件并收集字典中的所有 ID。然后逐行读取omnifile并输出结果。

你可以这样调用脚本：

python replace.py ./dbSNP_example.vcf ./Omni_example.vcf -o output.vcf

我用来处理参数的argparse模块也提供了自动帮助，所以参数的描述可以用

python replace.py -h

或

python replace.py --help

我相信他的方法会比你的更快，因为我只读取一次文件，然后在 RAM 上工作，我邀请你测试它。

注意：我不知道你是否熟悉对象编程，所以我不得不提一下，这里所有的类都在同一个文件中，以便在堆栈溢出时发布。在现实生活用例中，好的做法是将所有类放在单独的文件中，例如“Chromosome.py”、“ChromosomeIdentifierIndex.py”和“DefaultWritter.py”，然后将它们导入“replace.py” "文件。

【讨论】：

感谢您的回答。我测试了它，它实际上要快得多。我还学到了很多关于 Python 对象编程的知识，这是最重要的。再见