当 CPython 设置`in` 运算符为 O(n) 时？答案

【问题标题】：When CPython set `in` operator is O(n)?当 CPython 设置`in` 运算符为 O(n) 时？
【发布时间】：2020-03-31 23:39:50
【问题描述】：

我正在阅读 CPython 中的 time complexity of set operations 并了解到集合的 in 运算符的平均时间复杂度为 O(1)，最坏情况下的时间复杂度为 O(n)。我还了解到最坏的情况不会发生在 CPython unless the set's hash table's load factor is too high。

这让我想知道，CPython 实现中何时会出现这种情况？有没有简单的演示代码，显示了in 运算符的O(n) 时间复杂度明显可见的集合？

【问题讨论】：

标签： python set time-complexity cpython in-operator

【解决方案1】：

负载系数是一个红鲱鱼。在 CPython 中，集合（和 dicts）会自动调整大小以将负载因子保持在 2/3 以下。您无法在 Python 代码中做任何事情来阻止这种情况。

O(N) 行为可能发生在大量元素具有完全相同的哈希码时。然后它们映射到同一个哈希桶，并将查找退化为线性搜索的慢速形式。

设计这种不良元素的最简单方法是创建一个具有可怕哈希函数的类。喜欢，例如，未经测试：

class C:
    def __init__(self, val):
        self.val = val
    def __eq__(a, b):
        return a.val == b.val
    def __hash__(self):
        return 3

然后hash(C(i)) == 3 不管i 的值如何。

要对内置类型做同样的事情，需要深入了解它们的 CPython 实现细节。例如，下面是一种创建任意大量具有相同哈希码的不同整数的方法：

>>> import sys
>>> M = sys.hash_info.modulus
>>> set(hash(1 + i*M) for i in range(10000))
{1}

这表明创建的一万个不同的 int 都具有哈希码 1。

【讨论】：

【解决方案2】：

您可以在此处查看set 来源，这可以提供帮助：https://github.com/python/cpython/blob/723f71abf7ab0a7be394f9f7b2daa9ecdf6fb1eb/Objects/setobject.c#L429-L441

很难设计一个具体的例子，但幸运的是理论相当简单:) 该集合使用值的hash 存储键，只要hash 足够独特，您最终将获得预期的O(1) 性能。

如果出于某种奇怪的原因，您的所有项目都具有不同的数据但具有相同的哈希值，则会发生冲突，并且必须单独检查所有项目。

为了说明，您可以将集合视为这样的字典：

import collection


your_set = collection.defaultdict(list)


def add(value):
    your_set[hash(value)].append(value)


def contains(value):
    # This is where your O(n) can occur, all values the same hash()
    values = your_set.get(hash(value), [])
    for v in values:
        if v == value:
            return True
    return False

【讨论】：

【解决方案3】：

这有时称为集合或字典的“摊销”。它不时作为面试问题出现。正如@TimPeters 所说，调整大小会自动以 2/3 容量发生，所以如果你自己强制哈希，你只会达到 O(n)。

In computer science, amortized analysis is a method for analyzing a given algorithm's complexity, or how much of a resource, especially time or memory, it takes to execute. The motivation for amortized analysis is that looking at the worst-case run time per operation, rather than per algorithm, can be too pessimistic.

`/* GROWTH_RATE. Growth rate upon hitting maximum load.
 * Currently set to used*3.
 * This means that dicts double in size when growing without deletions,
 * but have more head room when the number of deletions is on a par with the
 * number of insertions.  See also bpo-17563 and bpo-33205.
 *
 * GROWTH_RATE was set to used*4 up to version 3.2.
 * GROWTH_RATE was set to used*2 in version 3.3.0
 * GROWTH_RATE was set to used*2 + capacity/2 in 3.4.0-3.6.0.
 */
#define GROWTH_RATE(d) ((d)->ma_used*3)`

更多的是效率点。为什么是 2/3？维基百科的文章有一个很好的图表 https://upload.wikimedia.org/wikipedia/commons/1/1c/Hash_table_average_insertion_time.png 附文。（对于我们的目的，线性探测曲线对应于 O(1) 到 O(n)，链接是一种更复杂的散列方法）见https://en.wikipedia.org/wiki/Hash_table 完整的

假设您有一个稳定的集合或字典，并且是其基本容量的 2/3 - 1。你真的想要永远低迷的表现吗？您可能希望强制向上调整大小。

“如果密钥总是预先知道的，你可以将它们存储在一个集合中，并使用 dict.fromkeys() 从集合中构建你的字典。”加上一些其他有用的，如果过时的意见。 Improving performance of very large dictionary in Python

为了更好地阅读 dictresize()：（dict 在设置之前在 Python 中） https://github.com/python/cpython/blob/master/Objects/dictobject.c#L415

【讨论】：

这与摊销无关；在 Python 集上的单个 in 操作的时间复杂度在平均情况下为 O(1)。您不必对整个算法进行平均，每次操作的时间为 O(1)。
OP " 最坏情况的时间复杂度为 O(n)。"随着哈希表的填写，集合或字典查找中的每个“输入”请求都趋向于 O(n)。
是的，但这与摊销无关；这是最坏情况下单个in 操作所需的时间。它不是大量 in 操作的平均值。
O(n) 当然可以调整大小。 By instrumenting a pure Python model for dictionaries (such as this one), it is possible to count the weighted-average number of probes for an alternative insertion order. For example, inserting dict.fromkeys([11100, 22200, 44400, 33300]) averages 1.75 probes per lookup. That beats the 2.25 average probes per lookup for dict.fromkeys([33300, 22200, 11100, 44400]). .Raymond Hettinger 来自“提高性能”链接在我上面的回答中
in 操作不会修改集合的状态；它不能触发集合底层数组的大小调整。您对 Raymond Hettinger 的引用没有提及摊销，他也不是在谈论摊销平均；只是平均水平。