【发布时间】:2017-07-15 10:31:38
【问题描述】:
我正在编写一段很长的代码,执行时间过长。我在代码上使用了cProfile,我发现以下函数被调用了150次,每次调用需要1.3秒,仅此函数就需要大约200秒。功能是 -
def makeGsList(sentences,org):
gs_list1=[]
gs_list2=[]
for s in sentences:
if s.startswith(tuple(StartWords)):
s = s.lower()
if org=='m':
gs_list1 = [k for k in m_words if k in s]
if org=='h':
gs_list1 = [k for k in h_words if k in s]
for gs_element in gs_list1:
gs_list2.append(gs_element)
gs_list3 = list(set(gs_list2))
return gs_list3
代码应该包含一个句子列表和一个标志org。然后它遍历每一行,检查它是否以列表StartWords 中存在的任何单词开头,然后将其小写。然后,根据org 的值,它列出当前句子中的所有单词,这些单词也出现在m_words 或h_words 中。它不断将这些单词附加到另一个列表gs_list2。最后它生成一组gs_list2 并返回它。
有人可以给我任何建议,告诉我如何优化此功能以减少执行时间吗?
注意 - 单词h_words/m_words 并非都是单个单词,其中许多是包含 3-4 个单词的短语。
一些例子-
StartWords = ['!Series_title','!Series_summary','!Series_overall_design','!Sample_title','!Sample_source_name_ch1','!Sample_characteristics_ch1']
sentences = [u'!Series_title\t"Transcript profiles of DCs of PLOSL patients show abnormalities in pathways of actin bundling and immune response"\n', u'!Series_summary\t"This study was aimed to identify pathways associated with loss-of-function of the DAP12/TREM2 receptor complex and thus gain insight into pathogenesis of PLOSL (polycystic lipomembranous osteodysplasia with sclerosing leukoencephalopathy). Transcript profiles of PLOSL patients\' DCs showed differential expression of genes involved in actin bundling and immune response, but also for the stability of myelin and bone remodeling."\n', u'!Series_summary\t"Keywords: PLOSL patient samples vs. control samples"\n', u'!Series_overall_design\t"Transcript profiles of in vitro differentiated DCs of three controls and five PLOSL patients were analyzed."\n', u'!Series_type\t"Expression profiling by array"\n', u'!Sample_title\t"potilas_DC_A"\t"potilas_DC_B"\t"potilas_DC_C"\t"kontrolli_DC_A"\t"kontrolli_DC_C"\t"kontrolli_DC_D"\t"potilas_DC_E"\t"potilas_DC_D"\n', u'!Sample_characteristics_ch1\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\n', u'!Sample_description\t"DAP12mut"\t"DAP12mut"\t"DAP12mut"\t"control"\t"control"\t"control"\t"TREM2mut"\t"TREM2mut"\n']
h_words = ['pp1665', 'glycerophosphodiester phosphodiesterase domain containing 5', 'gde2', 'PLOSL patients', 'actin bundling', 'glycerophosphodiester phosphodiesterase 2', 'glycerophosphodiester phosphodiesterase domain-containing protein 5']
m_words 是相似的。
关于尺寸 -
h_words 和 m_words 两个列表的长度约为 250,000。列表中的每个元素平均有 2 个单词长。句子列表大约 10-20 个句子长,我提供了一个示例列表,让您了解每个句子的长度。
【问题讨论】:
-
如果你的代码运行良好,应该去代码审查 stackexchange
-
1.在每次迭代后想想
gs_list1中的内容。 2. 为什么不开始一组? -
你能保证
org将总是是'm'或'h'吗? -
@PM2Ring,是的,只有这两个
标签: python performance list optimization