【问题标题】:optimize python key-searching in hierarchichal dictionary优化分层字典中的python键搜索
【发布时间】:2012-11-30 01:38:09
【问题描述】:

我正在尝试优化我的代码,因为当我尝试加载庞大的字典时,它变得非常慢。我认为这是因为它在字典中搜索一个键。我一直在阅读有关 python defaultdict 的信息,我认为这可能是一个很好的改进,但我没有在这里实现它。如您所见,这是一个分层字典结构。任何提示将不胜感激。

class Species:
    '''This structure contains all the information needed for all genes.
    One specie have several genes, one gene several proteins'''
    def __init__(self, name):
        self.name = name #name of the GENE
        self.genes = {}
    def addProtein(self, gene, protname, len):
        #Converting a line from the input file into a protein and/or an exon
        if gene in self.genes:
            #Gene in the structure
            self.genes[gene].proteins[protname] = Protein(protname, len)
            self.genes[gene].updateProts()
        else:
            self.genes[gene] = Gene(gene) 
            self.updateNgenes()
            self.genes[gene].proteins[protname] = Protein(protname, len)
            self.genes[gene].updateProts()
    def updateNgenes(self):
    #Updating the number of genes
        self.ngenes = len(self.genes.keys())    

基因和蛋白质的定义分别是:

class Protein:
    #The class protein contains information about the length of the protein and a list with it's exons (with it's own attributes)
    def __init__(self, name, len):
        self.name = name
        self.len = len

class Gene:
    #The class gene contains information about the gene and a dict with it's proteins (with it's own attributes)
    def __init__(self, name):
        self.name = name
        self.proteins = {}
        self.updateProts()
    def updateProts(self):
        #Update number of proteins
        self.nproteins = len(self.proteins)

【问题讨论】:

    标签: python optimization dictionary defaultdict


    【解决方案1】:

    您不能使用defaultdict,因为您的__init__ 方法需要参数。

    这可能是您的瓶颈之一:

    def updateNgenes(self):
    #Updating the number of genes
        self.ngenes = len(self.genes.keys()) 
    

    len(self.genes.keys()) 在计算长度之前创建所有键的list。这意味着每次您添加一个基因,您都会创建一个列表并将其丢弃。你拥有的基因越多,这个列表的创建就越昂贵。为避免创建中间列表,只需执行len(self.genes)

    最好将ngenes 设为property,以便仅在需要时计算。

    @property
    def ngenes(self):
        return len(self.genes)
    

    Gene 类中的nproteins 也可以这样做。

    这是重构的代码:

    class Species:
        '''This structure contains all the information needed for all genes.
        One specie have several genes, one gene several proteins'''
    
        def __init__(self, name):
            self.name = name #name of the GENE
            self.genes = {}
    
        def addProtein(self, gene, protname, len):
            #Converting a line from the input file into a protein and/or an exon
            if gene not in self.genes:
                self.genes[gene] = Gene(gene) 
            self.genes[gene].proteins[protname] = Protein(protname, len)
    
        @property
        def ngenes(self):
            return len(self.genes)
    
    class Protein:
        #The class protein contains information about the length of the protein and a list with it's exons (with it's own attributes)
        def __init__(self, name, len):
            self.name = name
            self.len = len
    
    class Gene:
        #The class gene contains information about the gene and a dict with it's proteins (with it's own attributes)
        def __init__(self, name):
            self.name = name
            self.proteins = {}
    
        @property
        def nproteins(self):
            return len(self.proteins)
    

    【讨论】:

    • 或者甚至 def __init__(): self.ngenes = 0 后跟 def addProtein(): self.ngenes += 1,这取决于 ngenes 实际访问的频率。如果像while myspecies.ngenes < limit: ...这样连续命中,可能会更快。
    • 现在速度快了几个数量级,而且我已经学会了一些技巧。谢谢!
    猜你喜欢
    • 2011-07-07
    • 2018-06-11
    • 1970-01-01
    • 2016-07-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-06-16
    • 1970-01-01
    相关资源
    最近更新 更多