为什么 hashset.exceptwith 迭代和检查 !contains 的速度是另一个集合的两倍？答案

【问题标题】：Why is hashset.exceptwith twice as fast iterating and checking !contains on the other collection?为什么 hashset.exceptwith 迭代和检查 !contains 的速度是另一个集合的两倍？
【发布时间】：2017-03-25 12:16:33
【问题描述】：

我只是做了一些优化，对此感到困惑。

我的原始代码如下所示：

   HashSet<IExampleAble> alreadyProcessed;//a few million items
    void someSetRoutineSlower(HashSet<IExampleAble> exampleSet)
    {

        foreach (var item in exampleSet)
        {
            if (!alreadyProcessed.Contains(item))
            {
                // do Stuff
            }
        }
    }

这需要大约 120 万个刻度来处理。

然后我用 exceptwith 尝试了同样的方法：

 void someSetRoutineFaster(HashSet<IExampleAble> exampleSet)
    {
        exampleSet.ExceptWith(alreadyProcessed);//doesnt this have to check each item of it's collection against the other one, thus actually looping twice?
        foreach (var item in exampleSet)
        {
            // do Stuff
        }
    }

它以大约 0.4 万到 0.7 万个滴答声运行。

exceptwith 中进行了哪些优化？它不是也必须像我在第一个 code-sn-p 中那样检查所有项目吗？

【问题讨论】：

@harold 发布了看似正确的答案，但他出于某种原因将其删除...ExceptWith() 从集合中删除项目，因此每个删除的元素在搜索下一个元素。使用.Contains()，集合永远不会变小，因此每个元素的搜索时间不会减少。
@MatthewWatson 但 exceptwith 迭代完整的其他集合。那是比示例集大得多的数量级。我的第一个想法是避免迭代“已经处理”并在迭代示例集一次时进行包含检查。使用示例集正是我试图避免的，但速度更快
@MatthewWatson 我刚刚试了一下，它的速度和使用一样快。还是不明白为什么
你能分享你的性能测试吗？
请发布包含简单基准的可执行代码。这个版本是没有附加调试器的 x64 版本吗？如前所述，结果是不可能的，因为 alreadyProcessed 要大得多。所以基准在某种程度上是错误的。

标签： c# .net performance optimization

【解决方案1】：

根据 .NET Framework 4.7.2 中 HashSet exceptWith 方法的参考源代码如下所示：

public void ExceptWith(IEnumerable<T> other) {
        if (other == null) {
            throw new ArgumentNullException("other");
        }
        Contract.EndContractBlock();

        // this is already the enpty set; return
        if (m_count == 0) {
            return;
        }

        // special case if other is this; a set minus itself is the empty set
        if (other == this) {
            Clear();
            return;
        }

        // remove every element in other from this
        foreach (T element in other) {
            Remove(element);
        }
    }

方法中只有显式优化适用于集合为空或自身“例外”的特殊情况。

当 Contains(T) 调用的数量与设置的大小相当时，您遇到的加速可能来自调用 Contains(T) 和迭代所有元素之间的差异。从表面上看，它似乎应该执行相同的旧实现，显式地称为 Contains(T)，新实现在 Remove(T) 中执行相同类型的搜索。不同之处在于，随着元素被移除，集合的内部结构变得更加稀疏。这导致每个存储桶的项目（根据源代码表示法的插槽）在统计上更少，并且找到一个元素变得更快，如果存在则它是存储桶中的第一个项目。

这完全取决于您的对象的散列函数的质量。理想情况下，每个对象都应该在它的桶中单独存在，但大多数真正的哈希函数会分布数百万个有冲突的元素（同一个桶中的多个元素）。

【讨论】：