优化分层字典中的python键搜索答案

【问题标题】：optimize python key-searching in hierarchichal dictionary优化分层字典中的python键搜索
【发布时间】：2012-11-30 01:38:09
【问题描述】：

我正在尝试优化我的代码，因为当我尝试加载庞大的字典时，它变得非常慢。我认为这是因为它在字典中搜索一个键。我一直在阅读有关 python defaultdict 的信息，我认为这可能是一个很好的改进，但我没有在这里实现它。如您所见，这是一个分层字典结构。任何提示将不胜感激。

class Species:
    '''This structure contains all the information needed for all genes.
    One specie have several genes, one gene several proteins'''
    def __init__(self, name):
        self.name = name #name of the GENE
        self.genes = {}
    def addProtein(self, gene, protname, len):
        #Converting a line from the input file into a protein and/or an exon
        if gene in self.genes:
            #Gene in the structure
            self.genes[gene].proteins[protname] = Protein(protname, len)
            self.genes[gene].updateProts()
        else:
            self.genes[gene] = Gene(gene) 
            self.updateNgenes()
            self.genes[gene].proteins[protname] = Protein(protname, len)
            self.genes[gene].updateProts()
    def updateNgenes(self):
    #Updating the number of genes
        self.ngenes = len(self.genes.keys())

基因和蛋白质的定义分别是：

class Protein:
    #The class protein contains information about the length of the protein and a list with it's exons (with it's own attributes)
    def __init__(self, name, len):
        self.name = name
        self.len = len

class Gene:
    #The class gene contains information about the gene and a dict with it's proteins (with it's own attributes)
    def __init__(self, name):
        self.name = name
        self.proteins = {}
        self.updateProts()
    def updateProts(self):
        #Update number of proteins
        self.nproteins = len(self.proteins)

【问题讨论】：

标签： python optimization dictionary defaultdict

【解决方案1】：

您不能使用defaultdict，因为您的__init__ 方法需要参数。

这可能是您的瓶颈之一：

def updateNgenes(self):
#Updating the number of genes
    self.ngenes = len(self.genes.keys())

len(self.genes.keys()) 在计算长度之前创建所有键的list。这意味着每次您添加一个基因，您都会创建一个列表并将其丢弃。你拥有的基因越多，这个列表的创建就越昂贵。为避免创建中间列表，只需执行len(self.genes)。

最好将ngenes 设为property，以便仅在需要时计算。

@property
def ngenes(self):
    return len(self.genes)

在Gene 类中的nproteins 也可以这样做。

这是重构的代码：

class Species:
    '''This structure contains all the information needed for all genes.
    One specie have several genes, one gene several proteins'''

    def __init__(self, name):
        self.name = name #name of the GENE
        self.genes = {}

    def addProtein(self, gene, protname, len):
        #Converting a line from the input file into a protein and/or an exon
        if gene not in self.genes:
            self.genes[gene] = Gene(gene) 
        self.genes[gene].proteins[protname] = Protein(protname, len)

    @property
    def ngenes(self):
        return len(self.genes)

class Protein:
    #The class protein contains information about the length of the protein and a list with it's exons (with it's own attributes)
    def __init__(self, name, len):
        self.name = name
        self.len = len

class Gene:
    #The class gene contains information about the gene and a dict with it's proteins (with it's own attributes)
    def __init__(self, name):
        self.name = name
        self.proteins = {}

    @property
    def nproteins(self):
        return len(self.proteins)

【讨论】：

或者甚至 def __init__(): self.ngenes = 0 后跟 def addProtein(): self.ngenes += 1，这取决于 ngenes 实际访问的频率。如果像while myspecies.ngenes < limit: ...这样连续命中，可能会更快。
现在速度快了几个数量级，而且我已经学会了一些技巧。谢谢！