List<T>.Contains() 很慢？答案

【问题标题】：List<T>.Contains() is very slow?List<T>.Contains() 很慢？
【发布时间】：2009-05-05 08:21:24
【问题描述】：

谁能解释一下为什么泛型List.Contains() 函数这么慢？

我有一个 List<long> 有大约一百万个数字，以及不断检查这些数字中是否有特定数字的代码。

我尝试使用 Dictionary<long, byte> 和 Dictionary.ContainsKey() 函数做同样的事情，它比使用 List 快大约 10-20 倍。

当然，我并不想将 Dictionary 用于此目的，因为它不应该以这种方式使用。

所以，这里真正的问题是，是否有任何替代 List<T>.Contains()，但不像 Dictionary<K,V>.ContainsKey() 那样古怪？

【问题讨论】：

字典有什么问题？它适用于像您这样的情况。
@Kamarey：HashSet 可能是更好的选择。
HashSet 是我要找的。span>

标签： .net arrays generics list

【解决方案1】：

如果您只是检查是否存在，.NET 3.5 中的 HashSet<T> 是您的最佳选择 - 类似字典的性能，但没有键/值对 - 只是值：

    HashSet<int> data = new HashSet<int>();
    for (int i = 0; i < 1000000; i++)
    {
        data.Add(rand.Next(50000000));
    }
    bool contains = data.Contains(1234567); // etc

【讨论】：

【解决方案2】：

List.Contains 是一个 O(n) 操作。

Dictionary.ContainsKey 是一个 O(1) 操作，因为它使用对象的哈希码作为键，这使您具有更快的搜索能力。

我不认为扫描包含一百万个条目的列表以找到几个条目是一个好主意。

是否可以将这些百万实体保存到 RDBMS 中，然后在该数据库上执行查询？

如果不可能，那么如果你想进行键查找，我还是会使用字典。

【讨论】：

我不认为包含一百万个项目的列表有什么不妥之处，只是您可能不想继续对它进行线性搜索。
同意，列表和包含这么多条目的数组都没有错。只是不要扫描值。

【解决方案3】：

我想我有答案了！是的，列表（数组）上的 Contains() 确实是 O(n)，但是如果数组很短并且您使用的是值类型，它仍然应该很快。但是使用 CLR Profiler [从 Microsoft 免费下载]，我发现 Contains() 是装箱值以便比较它们，这需要堆分配，这非常昂贵（慢）。 [注意：这是.Net 2.0；其他 .Net 版本未测试。]

这是完整的故事和解决方案。我们有一个名为“VI”的枚举并创建了一个名为“ValueIdList”的类，它是 VI 对象列表（数组）的抽象类型。最初的实现是在古老的 .Net 1.1 时代，它使用封装的 ArrayList。我们最近在http://blogs.msdn.com/b/joshwil/archive/2004/04/13/112598.aspx 中发现，通用列表 (List) 在值类型（如我们的枚举 VI）上的性能比 ArrayList 好得多，因为不必对值进行装箱。这是真的，而且它起作用了......几乎。

CLR Profiler 揭示了一个惊喜。这是分配图的一部分：

ValueIdList::Contains bool(VI) 5.5MB (34.81%)
Generic.List::Contains bool() 5.5MB (34.81%)
Generic.ObjectEqualityComparer::Equals bool () 5.5MB (34.88%)
Values.VI 7.7MB (49.03%)

正如您所见，Contains() 令人惊讶地调用了 Generic.ObjectEqualityComparer.Equals()，这显然需要对 VI 值进行装箱，这需要昂贵的堆分配。奇怪的是微软会在列表中取消拳击，只是为了像这样的简单操作再次要求它。

我们的解决方案是重写 Contains() 实现，这在我们的例子中很容易做到，因为我们已经封装了通用列表对象 (_items)。这是简单的代码：

public bool Contains(VI id) 
{
  return IndexOf(id) >= 0;
}

public int IndexOf(VI id) 
{ 
  int i, count;

  count = _items.Count;
  for (i = 0; i < count; i++)
    if (_items[i] == id)
      return i;
  return -1;
}

public bool Remove(VI id) 
{
  int i;

  i = IndexOf(id);
  if (i < 0)
    return false;
  _items.RemoveAt(i);

  return true;
}

VI 值的比较现在在我们自己的 IndexOf() 版本中完成，不需要装箱，而且速度非常快。在这个简单的重写之后，我们的特定程序加速了 20%。 O(n)...没问题！避免浪费内存！

【讨论】：

感谢您的提示，我自己也被糟糕的拳击表现所困扰。对于我的用例，自定义 Contains 实现要快得多。

【解决方案4】：

字典并没有那么糟糕，因为字典中的键旨在快速找到。要在列表中查找数字，它需要遍历整个列表。

当然，字典只有在你的数字是唯一的并且没有顺序的情况下才有效。

我认为.NET 3.5 中还有一个HashSet<T> 类，它也只允许唯一元素。

【讨论】：

A Dictionary 也可以有效地存储非唯一对象 - 使用整数来计算重复的数量。例如，您将列表 {a,b,a} 存储为 {a=2,b=1}。当然，它确实失去了秩序。

【解决方案5】：

这不完全是您问题的答案，但我有一个类可以提高 Contains() 在集合上的性能。我将一个队列子类化并添加了一个将哈希码映射到对象列表的字典。 Dictionary.Contains() 函数是 O(1)，而 List.Contains()、Queue.Contains() 和 Stack.Contains() 是 O(n)。

字典的值类型是一个包含具有相同哈希码的对象的队列。调用者可以提供一个实现 IEqualityComparer 的自定义类对象。您可以将此模式用于堆栈或列表。只需对代码进行一些更改。

/// <summary>
/// This is a class that mimics a queue, except the Contains() operation is O(1) rather     than O(n) thanks to an internal dictionary.
/// The dictionary remembers the hashcodes of the items that have been enqueued and dequeued.
/// Hashcode collisions are stored in a queue to maintain FIFO order.
/// </summary>
/// <typeparam name="T"></typeparam>
private class HashQueue<T> : Queue<T>
{
    private readonly IEqualityComparer<T> _comp;
    public readonly Dictionary<int, Queue<T>> _hashes; //_hashes.Count doesn't always equal base.Count (due to collisions)

    public HashQueue(IEqualityComparer<T> comp = null) : base()
    {
        this._comp = comp;
        this._hashes = new Dictionary<int, Queue<T>>();
    }

    public HashQueue(int capacity, IEqualityComparer<T> comp = null) : base(capacity)
    {
        this._comp = comp;
        this._hashes = new Dictionary<int, Queue<T>>(capacity);
    }

    public HashQueue(IEnumerable<T> collection, IEqualityComparer<T> comp = null) :     base(collection)
    {
        this._comp = comp;

        this._hashes = new Dictionary<int, Queue<T>>(base.Count);
        foreach (var item in collection)
        {
            this.EnqueueDictionary(item);
        }
    }

    public new void Enqueue(T item)
    {
        base.Enqueue(item); //add to queue
        this.EnqueueDictionary(item);
    }

    private void EnqueueDictionary(T item)
    {
        int hash = this._comp == null ? item.GetHashCode() :     this._comp.GetHashCode(item);
        Queue<T> temp;
        if (!this._hashes.TryGetValue(hash, out temp))
        {
            temp = new Queue<T>();
            this._hashes.Add(hash, temp);
        }
        temp.Enqueue(item);
    }

    public new T Dequeue()
    {
        T result = base.Dequeue(); //remove from queue

        int hash = this._comp == null ? result.GetHashCode() : this._comp.GetHashCode(result);
        Queue<T> temp;
        if (this._hashes.TryGetValue(hash, out temp))
        {
            temp.Dequeue();
            if (temp.Count == 0)
                this._hashes.Remove(hash);
        }

        return result;
    }

    public new bool Contains(T item)
    { //This is O(1), whereas Queue.Contains is (n)
        int hash = this._comp == null ? item.GetHashCode() : this._comp.GetHashCode(item);
        return this._hashes.ContainsKey(hash);
    }

    public new void Clear()
    {
        foreach (var item in this._hashes.Values)
            item.Clear(); //clear collision lists

        this._hashes.Clear(); //clear dictionary

        base.Clear(); //clear queue
    }
}

我的简单测试表明我的HashQueue.Contains() 运行速度比Queue.Contains() 快得多。运行计数设置为 10,000 的测试代码对于 HashQueue 版本需要 0.00045 秒，对于 Queue 版本需要 0.37 秒。计数为 100,000 时，HashQueue 版本需要 0.0031 秒，而 Queue 需要 36.38 秒！

这是我的测试代码：

static void Main(string[] args)
{
    int count = 10000;

    { //HashQueue
        var q = new HashQueue<int>(count);

        for (int i = 0; i < count; i++) //load queue (not timed)
            q.Enqueue(i);

        System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
        for (int i = 0; i < count; i++)
        {
            bool contains = q.Contains(i);
        }
        sw.Stop();
        Console.WriteLine(string.Format("HashQueue, {0}", sw.Elapsed));
    }

    { //Queue
        var q = new Queue<int>(count);

        for (int i = 0; i < count; i++) //load queue (not timed)
            q.Enqueue(i);

        System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
        for (int i = 0; i < count; i++)
        {
            bool contains = q.Contains(i);
        }
        sw.Stop();
        Console.WriteLine(string.Format("Queue,     {0}", sw.Elapsed));
    }

    Console.ReadLine();
}

【讨论】：

我刚刚为 HashSet 添加了第三个测试用例，这似乎比您的解决方案获得了更好的结果：HashQueue, 00:00:00.0004029Queue, 00:00:00.3901439HashSet, 00:00:00.0001716

【解决方案6】：

SortedList 搜索起来会更快（但插入项目会更慢）

【讨论】：

【解决方案7】：

为什么字典不合适？

要查看特定值是否在列表中，您需要遍历整个列表。使用字典（或其他基于散列的容器）可以更快地缩小需要比较的对象数量。键（在您的情况下是数字）被散列，并为字典提供要比较的对象的小数子集。

【讨论】：

【解决方案8】：

我在不支持 HashSet 的 Compact Framework 中使用它，我选择了字典，其中两个字符串都是我要查找的值。

这意味着我获得了具有字典性能的列表功能。这有点 hacky，但它确实有效。

【讨论】：

如果您使用字典代替 HashSet，您不妨将值设置为 "" 而不是与键相同的字符串。这样，您将使用更少的内存。或者，您甚至可以使用 Dictionary 并将它们全部设置为 true（或 false）。我不知道哪个会使用更少的内存，空字符串或布尔值。我的猜测是布尔值。
在字典中，string 引用和bool 值分别对 32 位或 64 位系统产生 3 或 7 个字节的差异。但是请注意，每个条目的大小分别向上舍入为 4 或 8 的倍数。因此，string 和 bool 之间的选择可能根本不会对大小产生任何影响。空字符串"" 总是作为静态属性string.Empty 存在于内存中，因此无论您是否在字典中使用它都没有任何区别。（无论如何它都在其他地方使用。）