【问题标题】：Can this function be optimized for speed?这个功能可以针对速度进行优化吗？
【发布时间】：2017-07-15 10:31:38
【问题描述】：

我正在编写一段很长的代码，执行时间过长。我在代码上使用了cProfile，我发现以下函数被调用了150次，每次调用需要1.3秒，仅此函数就需要大约200秒。功能是 -

def makeGsList(sentences,org):
    gs_list1=[]
    gs_list2=[]
    for s in sentences:
        if s.startswith(tuple(StartWords)):
            s = s.lower()
            if org=='m':
                gs_list1 = [k for k in m_words if k in s]
            if org=='h':
                gs_list1 = [k for k in h_words if k in s]
            for gs_element in gs_list1:
                gs_list2.append(gs_element)
    gs_list3 = list(set(gs_list2))
    return gs_list3

代码应该包含一个句子列表和一个标志org。然后它遍历每一行，检查它是否以列表StartWords 中存在的任何单词开头，然后将其小写。然后，根据org 的值，它列出当前句子中的所有单词，这些单词也出现在m_words 或h_words 中。它不断将这些单词附加到另一个列表gs_list2。最后它生成一组gs_list2 并返回它。

有人可以给我任何建议，告诉我如何优化此功能以减少执行时间吗？

注意 - 单词h_words/m_words 并非都是单个单词，其中许多是包含 3-4 个单词的短语。

一些例子-

StartWords = ['!Series_title','!Series_summary','!Series_overall_design','!Sample_title','!Sample_source_name_ch1','!Sample_characteristics_ch1']

sentences = [u'!Series_title\t"Transcript profiles of DCs of PLOSL patients show abnormalities in pathways of actin bundling and immune response"\n', u'!Series_summary\t"This study was aimed to identify pathways associated with loss-of-function of the DAP12/TREM2 receptor complex and thus gain insight into pathogenesis of PLOSL (polycystic lipomembranous osteodysplasia with sclerosing leukoencephalopathy). Transcript profiles of PLOSL patients\' DCs showed differential expression of genes involved in actin bundling and immune response, but also for the stability of myelin and bone remodeling."\n', u'!Series_summary\t"Keywords: PLOSL patient samples vs. control samples"\n', u'!Series_overall_design\t"Transcript profiles of in vitro differentiated DCs of three controls and five PLOSL patients were analyzed."\n', u'!Series_type\t"Expression profiling by array"\n', u'!Sample_title\t"potilas_DC_A"\t"potilas_DC_B"\t"potilas_DC_C"\t"kontrolli_DC_A"\t"kontrolli_DC_C"\t"kontrolli_DC_D"\t"potilas_DC_E"\t"potilas_DC_D"\n',  u'!Sample_characteristics_ch1\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\n', u'!Sample_description\t"DAP12mut"\t"DAP12mut"\t"DAP12mut"\t"control"\t"control"\t"control"\t"TREM2mut"\t"TREM2mut"\n']

h_words = ['pp1665', 'glycerophosphodiester phosphodiesterase domain containing 5', 'gde2', 'PLOSL patients', 'actin bundling', 'glycerophosphodiester phosphodiesterase 2', 'glycerophosphodiester phosphodiesterase domain-containing protein 5']

m_words 是相似的。

关于尺寸 -

h_words 和 m_words 两个列表的长度约为 250,000。列表中的每个元素平均有 2 个单词长。句子列表大约 10-20 个句子长，我提供了一个示例列表，让您了解每个句子的长度。

【问题讨论】：

如果你的代码运行良好，应该去代码审查 stackexchange
1.在每次迭代后想想gs_list1 中的内容。 2. 为什么不开始一组？
你能保证org将总是是'm'或'h'吗？
@PM2Ring，是的，只有这两个

标签： python performance list optimization

【解决方案1】：

不要对m_words 和k_words 使用全局变量。
将if 语句放在for 循环之外。
一劳永逸地投射tuple(StartWords)。
使用以编程方式创建的正则表达式而不是列表理解。
尽可能预编译所有内容。
直接扩展您的列表，而不是通过它迭代到 append() 每个元素。
从一开始就使用set，而不是list。
使用集合理解而不是显式的 for 循环。

m_reg = re.compile("|".join(re.escape(w) for w in m_words))
h_reg = re.compile("|".join(re.escape(w) for w in h_words))

def make_gs_list(sentences, start_words, m_reg, h_reg, org):
    if org == 'm':
        reg = m_reg
    elif org == 'h':
        reg = h_reg

    matched = {w for s in sentences if s.startswith(start_words)
                 for w in reg.findall(s.lower())}

    return matched

【讨论】：

感谢您的详细回答。正如我在编辑中提到的，由于列表 m_words 和 h_words 都很大（每个都有 250,000 个条目），最好在主函数中只编译一次（对于两者），然后通过 reg_m和reg_h 这个函数？另外，为什么您认为将start_words、m_words、h_words 作为函数参数传递比为它们使用全局变量更快？
@user1993 如果您能够预编译正则表达式，是的，您应该这样做！局部变量通常比全局变量快。
@user1993 另外，让我知道与您的第一个函数相比，它的速度有多快，我很好奇。
@user1993 正如 Gribouillis 所建议的那样，我也对使用正则表达式而不是 startswith 感到好奇。 * 是解包匹配单词中的元素所必需的，否则它将使用列表构造集合。
我试过这个，但它给出了错误 - sre_constants.error: bad character range l-c at position 194661 re.compile。很难检查是什么原因造成的

【解决方案2】：

我会试试这个

# optionaly change these regexes
FIRST_WORD_RE = re.compile(r"^[a-zA-Z]+")
LOWER_WORD_RE = re.compile(r"[a-z]+")
m_or_h_words = {'m': set(m_words), 'h': set(h_words)}
startwords_set = set(StartWords)

def makeGsList(sentences, org):
    words = m_or_h_words[org]
    gs_set2 = set()
    for s in sentences:
        mo = FIRST_WORD_RE.match(s)
        if mo and mo.group(0) in startwords_set:
            gs_set2 |= set(LOWER_WORD_RE.findall(s.lower())) & words
    return list(gs_set2)

【讨论】：

感谢您的回答。我有几个问题 - 1. 为什么你认为使用正则表达式比 .startswith 更好？ 2.|=是做什么的？
我不太清楚.startswith() 算法。我怀疑它会遍历所有单词，如果单词很多，效率可能会很低。此外，我从编程语言解析技术中了解到，识别文本中关键字的最有效方法是使用单词正则表达式并在集合或字典中使用查找。您可以使用 timeit 来比较这两种方法。 |= 以增量方式执行集合并集（| 是集合类型的二元并集运算符）。
我认为这行有一个错误 - gs_set2 |= set(LOWER_WORD_RE.findall(s.lower()) & words 因为我的编译器在我之后放置的任何行都会给出错误SyntaxError: invalid syntax
好的，是gs_set2 |= set(LOWER_WORD_RE.findall(s.lower())) & words。缺少括号
我尝试了您的方法，并在 LOWER_WORD_RE 中添加了空格和连字符，以实现它 - LOWER_WORD_RE = re.compile(r"[a-z0-9 /-]+")。现在，它还可以在h_words 中包含多词实体。唯一的问题是这个正则表达式非常贪婪，因此它找到的匹配项（使用LOWER_WORD_RE.findall(s.lower())）仅在存在不属于[a-z0-9 /-] 的字符时才结束。因此，它找到的匹配多次包含比h_words 中存在的实体更多的内容，因此& 函数没有给出正确的结果。有什么想法吗？

【解决方案3】：

我认为你可以通过标记你的句子来第一次破解

所以你会这样做：

这里使用正则表达式而不是 split，但只是为了说明使用 split

sentences = tuple(s.split(' ') for s in sentences) 然后不要使用startswith，而是将您的StartsWords放在一个集合中

所以 sw_set = {w for w in StartsWords}

然后，当您迭代句子时，请执行以下操作：如果 sw_set 中的 s[0]： # 继续你的其余逻辑

我认为这是您受到最大性能影响的地方。

【讨论】：

【解决方案4】：

在 Python 中，搜索集合比搜索列表要快得多，因此您始终可以将列表转换为集合，然后尝试在集合中搜索单词而不是列表。这是我的示例代码：

 for i in range(0, num_reviews):
    text = raw_review["review"][i]).lower()  # Convert to lower case
    words = text.split()  # Split into words
    ### convert the stopwords from list to a set
    stops = set(stopwords.words("english"))
    # Remove stop words from "words"
    meaningful_words = [w for w in words if not w in stops]
    # Join the words back into one string
    BS_reviews.append(" ".join(meaningful_words))
 return BS_reviews

【讨论】：