通用列表 Contains() 性能和替代方案答案

【问题标题】：Generic List Contains() perfomance and alternatives通用列表 Contains() 性能和替代方案
【发布时间】：2013-12-10 16:44:48
【问题描述】：

我需要存储大量的 key, value 对，其中 key 不是唯一的。 key 和 value 都是字符串。项目数量约为 500 万。

我的目标是只保存唯一的对。

我尝试使用List<KeyValuePair<string, string>>，但Contains() 非常慢。 LINQ Any() 看起来快了一点，但还是太慢了。

是否有任何替代方法可以更快地在通用列表上执行搜索？或者我应该使用其他存储空间？

【问题讨论】：

考虑使用数据库。

标签： c# performance generics

【解决方案1】：

我会使用Dictionary<string, HashSet<string>> 将一个键映射到它的所有值。

这是一个完整的解决方案。首先，编写几个扩展方法来将 (key,value) 对添加到您的 Dictionary 和另一个以获取所有 (key,value) 对。请注意，我对键和值使用任意类型，您可以将其替换为 string 没有问题。您甚至可以在其他地方编写这些方法而不是作为扩展，或者根本不使用方法，而只是在程序的某个地方使用此代码。

public static class Program
{
  public static void Add<TKey, TValue>(
    this Dictionary<TKey, HashSet<TValue>> data, TKey key, TValue value)
  {
    HashSet<TValue> values = null;
    if (!data.TryGetValue(key, out values)) {
      // first time using this key? create a new HashSet 
      values = new HashSet<TValue>();
      data.Add(key, values);
    }
    values.Add(value);
  }
  public static IEnumerable<KeyValuePair<TKey, TValue>> KeyValuePairs<TKey, TValue>(
    this Dictionary<TKey, HashSet<TValue>> data)
  {
    return data.SelectMany(k => k.Value,
                           (k, v) => new KeyValuePair<TKey, TValue>(k.Key, v));
  }
}

现在你可以按如下方式使用它：

public static void Main(string[] args)
{
  Dictionary<string, HashSet<string>> data = new Dictionary<string, HashSet<string>>();
  data.Add("k1", "v1.1");
  data.Add("k1", "v1.2");
  data.Add("k1", "v1.1"); // already in, so nothing happens here
  data.Add("k2", "v2.1");

  foreach (var kv in data.KeyValuePairs())
     Console.WriteLine(kv.Key + " : " + kv.Value);
}

哪个会打印这个：

k1 : v1.1
k1 : v1.2
k2 : v2.1

如果您的键映射到List<string>，那么您需要自己处理重复项。 HashSet<string> 已经为你做到了。

【讨论】：

【解决方案2】：

我猜Dictionary<string, List<string>> 可以解决问题。

【讨论】：

【解决方案3】：

我会考虑使用一些进程内 NoSQL 数据库，如 RavenDB（在本例中为 RavenDB 嵌入式），正如他们在其网站上所说的那样：

RavenDB 可用于需要存储数百万条记录并具有快速查询时间的应用程序。

使用它不需要大的样板文件（来自RavenDB website 的示例）：

var myCompany = new Company
                {
                    Name = "Hibernating Rhinos",
                    Employees = {
                                   new Employee
                                   {
                                       Name = "Ayende Rahien"
                                   }
                                 },
                    Country = "Israel"
                };

// Store the company in our RavenDB server
using (var session = documentStore.OpenSession())
{
    session.Store(myCompany);
    session.SaveChanges();
}

// Create a new session, retrieve an entity, and change it a bit
using (var session = documentStore.OpenSession())
{
    Company entity = session.Query<Company>()
        .Where(x => x.Country == "Israel")
        .FirstOrDefault();

    // We can also load by ID: session.Load<Company>(companyId);
    entity.Name = "Another Company";
    session.SaveChanges(); // will send the change to the database
}

【讨论】：

【解决方案4】：

要创建一个唯一列表，您想使用.Distinct() 来生成它，而不是.Contains()。但是，任何包含您的字符串的类都必须正确实现.GetHashCode()和.Equals()才能获得良好的性能，否则您必须传入自定义比较器。

这是使用自定义比较器的方法

    private static void Main(string[] args)
    {

        List<KeyValuePair<string, string>> giantList = Populate();
        var uniqueItems = giantList.Distinct(new MyStringEquater()).ToList();
    }

    class MyStringEquater : IEqualityComparer<KeyValuePair<string, string>>
    {
        //Choose which comparer you want based on if you want your comparisions to be case sensitive or not
        private static StringComparer comparer = StringComparer.OrdinalIgnoreCase; 

        public bool Equals(KeyValuePair<string, string> x, KeyValuePair<string, string> y)
        {
            return comparer.Equals(x.Key, y.Key) && comparer.Equals(x.Value, y.Value);
        }

        public int GetHashCode(KeyValuePair<string, string> obj)
        {
            unchecked
            {
                int x = 27;
                x = x*11 + comparer.GetHashCode(obj.Key);
                x = x*11 + comparer.GetHashCode(obj.Value);
                return x;
            }
        }
    }

此外，根据your comment in the other answer，您还可以在 HashSet 中使用上述比较器，并让它以这种方式存储您的独特项目。您只需将比较器传入构造函数即可。

var hashSetWithComparer = new HashSet<KeyValuePair<string,string>(new MyStringEquater());

【讨论】：

KeyValuePair<string, string> 相等应该产生他期望的结果而不需要修改。原因（我认为）是因为 KeyValuePair<TKey, TValue> 是一个结构，而 string 值是实习的。因此，将根据其字符串内容评估结构是否相等。如果我错了，请纠正我。
@ken 你可能是对的，但是有很多猜测，如果这是真的（它可能有效）。您还必须使用引用相等，我提供的解决方案允许您测试其他类型（例如我在示例中所做的OrdnalIgnoreCase。

【解决方案5】：

如果您使用HashSet<KeyValuePair<string, string>>，您很可能会看到改进。

下面的测试在我的机器上完成大约需要 10 秒。如果我改变...

var collection = new HashSet<KeyValuePair<string, string>>();

...到...

var collection = new List<KeyValuePair<string, string>>();

...我厌倦了等待它完成（超过几分钟）。

使用KeyValuePair<string, string> 的优点是相等性由Key 和Value 的值确定。由于字符串是内部的，而KeyValuePair<TKey, TValue> 是一个结构体，因此具有相同Key 和Value 的对将被运行时视为相等。

您可以通过此测试看到相等性：

    var hs = new HashSet<KeyValuePair<string, string>>();
    hs.Add(new KeyValuePair<string, string>("key", "value"));
    var b = hs.Contains(new KeyValuePair<string, string>("key", "value"));
    Console.WriteLine(b);

但要记住的重要一点是，对的相等性取决于字符串的保留。如果由于某种原因，您的字符串没有被保留（因为它们来自文件或其他东西），则相等可能不起作用。

using System;
using System.Collections.Generic;
using System.Diagnostics;

namespace ConsoleApplication1 {

    internal class Program {

        static void Main(string[] args) {

            var key = default(string);
            var value = default(string);

            var collection = new HashSet<KeyValuePair<string, string>>();

            for (var i = 0; i < 5000000; i++) {

                if (key == null || i % 2 == 0) {
                    key = "k" + i;
                }
                value = "v" + i;

                collection.Add(new KeyValuePair<string, string>(key, value));
            }

            var found = 0;

            var sw = new Stopwatch();
            sw.Start();
            for (var i = 0; i < 5000000; i++) {

                if (collection.Contains(new KeyValuePair<string, string>("k" + i, "v" + i))) {
                    found++;
                }
            }
            sw.Stop();

            Console.WriteLine("Found " + found);
            Console.WriteLine(sw.Elapsed);
            Console.ReadLine();
        }
    }
}

【讨论】：

用这种结构查看给定的键/值对是否存在很容易，但是查看给定的键是否存在，或者给定键的所有值是否存在，与@987654332 一样慢@.
@Servy - 问题是关于Contains 方法，所以这就是我测试的。关于需要什么样的性能，这个问题实际上并不是很清楚。

【解决方案6】：

您是否尝试过使用哈希集？当涉及大量数据时，比列表快得多，尽管我不知道它是否仍然太慢。

这个答案信息量很大：HashSet vs. List performance

【讨论】：

谢谢，我试试 HashSet。我知道 HashSet 对于简单类型来说很好，但是 KeyValuePair 呢？