用另一个列表搜索列表的最佳方式？答案

【问题标题】：Best way to search list with another list?用另一个列表搜索列表的最佳方式？
【发布时间】：2021-01-15 05:29:47
【问题描述】：

我有一个名为 resultList 的大型 HashSet 列表（大约百万条记录）。

我需要在包含 10.000 条记录的字典列表中找到匹配项。没有必要匹配。

在 12 线程 CPU 上，这大约需要 40-50 秒。我不断将新数据加载到sampleList 并将它们与resultList 列表进行比较。

我的问题是，这可以做得更快或更优雅吗？

这是我的代码：

HashSet<string> resultList = new HashSet<string>()
{
    "0000000000000000000000000000000000000000",
    "0000000000000000000000000000000000000001",
    "0000000000000000000000000000000000000002",
    "0000000000000000000000000000000000000003",
    "0000000000000000000000000000000000000004",
    "0000000000000000000000000000000000000005"
    //... this list is about million records
};

Dictionary<string, string> sampleList = new Dictionary<string, string>()
{
   { "0000000003000000300000000000000000000005", "This is a value"  },
   { "0000000000100000000000002000000800000001", "This is a value 1"  },
   { "0000000000000000000000000000000000000004", "This is what I'm trying to match" },
   { "0000000200000000100000000000000000000000", "This is a value 2" },
   { "0000005000000000000000000050000000000004", "This is a value 3" },
   { "0000000080000000000200000000000000000004", "This is a value 4" },
   { "0000000000200000000000000000800000000004", "This is a value 5" }
   //... this list is about 10.000 records
};

//first try to find any match - found that Any is faster than Where and the chance to find a match is little, so...
if (resultList.AsParallel().WithDegreeOfParallelism(MaxDegreeOfParallelism).Any(x => sampleList.Any(y => x == y.Key)))
{
    //then if there is a match, fetch it.
    foreach (var found in resultList.AsParallel().WithDegreeOfParallelism(MaxDegreeOfParallelism).Where(x => sampleList.Any(y => x == y.Key)))
    {
        //do something with the found matches
    }
}

【问题讨论】：

你试过Intersect吗？
什么是密钥格式？
我会迭代字典，因为它的值较少，而是在哈希集中搜索匹配的键。
它们真的是字符串值吗？
我不确定我是否理解这个问题。无论如何，乔纳森给了你一个很好的答案——你为什么不试试呢？

标签： c# list dictionary search

【解决方案1】：

目前，您正在使用对 Any 的嵌套调用，在最坏的情况下它具有 O(n²) 的复杂性。

您需要利用具有O(1) 复杂性的HashSet.Contains：

var matches = sampleList
    .Where(kvp => resultList.Contains(kvp.Key))
    .Select(kvp => kvp.Value);

这现在具有O(n) 复杂性。

至于AsParallel()的使用，这很可能会对性能产生负面影响，因为每个分区的计算成本都很低。

【讨论】：

我可以确认这是正确的答案。它在 3-4 秒内完成相同的工作，而之前需要 40-50 秒。谢谢好先生。附言我在没有AsParralel() 的情况下测试了这个

【解决方案2】：

首先，您可以将sampleList.Any(y => x == y.Key) 替换为sampleList.ContainsKey(x);

其次，这并不重要，但您并没有真正从使用 HashSet 中获得任何好处，因为您几乎只是在循环内容。

根据您的数据，您可以查看其他可以加快查找速度的结构。

您也可以尝试反向搜索，尝试从字典中查找 HashSet 中的值。

【讨论】：

HashSet 用于删除重复项，因为我使用多线程。
删除重复很好，不确定它与这里的多线程有什么关系。