【问题标题】:Finding largest sequence of bytes in two byte arrays在两个字节数组中查找最大的字节序列
【发布时间】:2016-08-22 19:51:10
【问题描述】:

例子:

{ 54, 87, 23, 87, 45, 67, 7, 85, 65, 65, 3, 4, 55, 76, 65, 64, 5, 6, 4, 54, 45强>, 6, 4 };

{ 76, 57, 65, 3, 4, 55, 76, 65, 64, 5, 6, 4, 54, 45, 8, 65, 66, 57, 6, 7 , 7, 56, 6, 7, 44, 57, 8, 76, 54, 67 };

基本上,我有两个字节[],需要在两者中找到最大相同的字节序列。

我已经尝试了显而易见的事情并编写了一些代码来暴力破解结果:

var bestIndex = 0;
var bestCount = 0;
for (var i1 = 0; i1 + bestCount < data1.Length; i1++)
{
    var currentCount = 0;
    for (var i2 = 0; i2 < data2.Length; i2++)
    {
        if (data1[i1 + currentCount] == data2[i2])
        {
            currentCount++;
            if (i1 + currentCount == data1.Length)
            {
                bestCount = currentCount;
                bestIndex = i1;
                break;
            }
        }
        else
        {
            if (currentCount > bestCount)
            {
                bestCount = currentCount;
                bestIndex = i1;
            }
            currentCount = 0;
        }
    }
    if (currentCount > bestCount)
    {
        bestCount = currentCount;
        bestIndex = i1;
    }
}

但是,在我的应用程序中,字节数组会大得多,甚至可以达到 GB。所以基本上我需要一个关于如何提高效率的提示/代码。

【问题讨论】:

标签: c# arrays lcs


【解决方案1】:

我对此有几个想法。我不确定这是否有帮助或伤害,但您是否考虑过首先通过最大的可能性向后工作,这样您就可以在找到匹配项后立即终止。

        byte[] b1 = { 54, 87, 23, 87, 45, 67, 7, 85, 65, 65, 3, 4, 55, 76, 65, 64, 5, 6, 4, 54, 45, 6, 4 };
        byte[] b2 = { 76, 57, 65, 3, 4, 55, 76, 65, 64, 5, 6, 4, 54, 45, 8, 65, 66, 57, 6, 7, 7, 56, 6, 7, 44, 57, 8, 76, 54, 67 };

        //figure out which one is smaller, since that one will limit the range options
        byte[] smaller;
        byte[] bigger;

        if (b1.Count() > b2.Count())
        {
            bigger = b1;
            smaller = b2;
        }
        else
        {
            bigger = b2;
            smaller = b1;
        }


        // doesn't matter what order we put these in, since they will be ordered later by length
        List<Tuple<int, int>> ranges = new List<Tuple<int, int>>();
        Parallel.For(0, smaller.Count(), (i1) => {
            Parallel.For(i1 + 1, smaller.Count(), (i2) =>
            {
                ranges.Add(new Tuple<int, int>(i1, i2));
            });
        });

        // order by length of slice produced by range in descending order
        // this way, once we get an answer, we know nothing else can be longer
        ranges = ranges.OrderByDescending(x => x.Item2 - x.Item1).ToList();

        Tuple<int, int> largestMatchingRange = new Tuple<int, int>(0, 0);

        foreach (Tuple<int, int> range in ranges)
        {
            bool match = true; // set in outer loop to allow for break

            for (int i1 = 0; i1 < bigger.Count(); i1++)
            {
                if (bigger.Count() <= i1 + (range.Item2 - range.Item1))
                {
                    //short cut if the available slice from the bigger array is shorter than the range length
                    match = false;
                    continue;
                }

                match = true; // reset to true to allow for new attempt for each larger array slice

                for (int i2 = range.Item1, i1Temp = i1; i2 < range.Item2; i2++, i1Temp++)
                {
                    if (bigger[i1Temp] != smaller[i2])
                    {
                        match = false;
                        break;
                    }
                }
                if (match)
                {
                    largestMatchingRange = range;
                    break;
                }
            }
            if (match)
            {
                break;
            }
        }

        byte[] largestMatchingBytes = smaller.Skip(largestMatchingRange.Item1).Take(largestMatchingRange.Item2 - largestMatchingRange.Item1).ToArray();

【讨论】:

  • 附言。如果没有匹配,这可能会更慢,并且需要更多的前期处理,因此您可能不会在所有情况下都看到积极的结果。聚苯乙烯。如果您使用更大的数组,您可能需要将 int 换成 long。
  • 好主意!我认为这对我的情况会有所帮助,因为我将始终使用一个相当小和一个非常大的数组来做到这一点
  • 另外,有谁知道如何按长度顺序获取范围而不获取所有范围然后排序?我无法弄清楚该模式来制定一个嵌套循环,以切断前面的范围选择和排序
【解决方案2】:

您可以将每个字节值的索引位置保存在列表字典中,而不是逐个检查字节。在您的情况下,包含 256 个列表的数组可能会更好。

List<int>[] index(byte[] a) {        // List<long> if the array can be more than 2GB
    var lists = new List<int>[256];
    for(int i = 0; i < a.Length; i++) {
        var b = a[i];
        if (lists[b] == null) lists[b] = new List<int>();
        lists[b].Add(i);
    }
    return lists;
}

然后你可以遍历 256 个可能的字节值

byte[] data1 = { 54, 87, 23, 87, 45, 67, 7, 85, 65, 65, 3, 4, 55, 76, 65, 64, 5, 6, 4, 54, 45, 6, 4 };
byte[] data2 = { 76, 57, 65, 3, 4, 55, 76, 65, 64, 5, 6, 4, 54, 45, 8, 65, 66, 57, 6, 7, 7, 56, 6, 7, 44, 57, 8, 76, 54, 67 };

var indexes1 = index(data1);
var indexes2 = index(data2);

var bestIndex = 0;
var bestCount = 0;

for (var b = 0; b < 256; b++)
{
    var list1 = indexes1[b]; if (list1 == null) continue;
    var list2 = indexes1[b]; if (list2 == null) continue;

    foreach(var index1 in list1)
    {
        foreach (var index2 in list2)
        {
            // your code here 
            for (var i1 = index1; i1 < data1.Length - bestCount; i1++)
            {
                var currentCount = 0;
                for (var i2 = index2; i2 < data2.Length; i2++)
                {
                    if (data1[i1 + currentCount] == data2[i2])
                    {
                        currentCount++;
                        if (i1 + currentCount == data1.Length)
                        {
                            bestCount = currentCount;
                            bestIndex = i1;
                            break;
                        }
                    }
                    else
                    {
                        if (currentCount > bestCount)
                        {
                            bestCount = currentCount;
                            bestIndex = i1;
                        }
                        currentCount = 0;
                    }
                }
                if (currentCount > bestCount)
                {
                    bestCount = currentCount;
                    bestIndex = i1;
                }
            }
        }
    }
}

var best = data1.Skip(bestIndex).Take(bestCount);
Debug.Print(bestIndex + ", " + bestCount + ": " + string.Join(", ", best));

理论上,对于更大的数组,这感觉需要的比较更少,但实际上它会有更多的内存缓存未命中,所以我不确定它是否会比其他答案中更线性的并行版本更快。我没有考虑太多,但希望它可以给你一些想法,以防我弄错了。

更新

我刚刚意识到这个想法对于内存少于 32 GB 的普通机器来说是多么糟糕,因为索引列表将占用字节数组内存的 4 倍以上。

【讨论】:

    【解决方案3】:

    我弄清楚了循环,这个应该更快。

    byte[] data1 = { 54, 87, 23, 87, 45, 67, 7, 85, 65, 65, 3, 4, 55, 76, 65, 64, 5, 6, 4, 54, 45, 6, 4 };
    byte[] data2 = { 76, 57, 65, 3, 4, 55, 76, 65, 64, 5, 6, 4, 54, 45, 8, 65, 66, 57, 6, 7, 7, 56, 6, 7, 44, 57, 8, 76, 54, 67 };
    
    
    //figure out which one is smaller, since that one will limit the range options
    byte[] smaller;
    byte[] bigger;
    
    if (data1.Count() > data2.Count())
    {
        bigger = data1;
        smaller = data2;
    }
    else
    {
        bigger = data2;
        smaller = data1;
    }
    
    Tuple<int, int> largestMatchingRange = new Tuple<int, int>(0, 0);
    
    //iterate over slices in reverse length order
    for (int length = smaller.Count() - 1; length > 0; length--)
    {
        int numberOfSlicesForLength = smaller.Count() - length;
    
        bool match = true; // set in outer loop to allow for break
    
        for (int start = 0; start < numberOfSlicesForLength; start++)
        {
            //within a collection of similarly sized slices, we start with the slice found first within the array
            Tuple<int, int> range = new Tuple<int, int>(start, start + length);
    
            for (int i1 = 0; i1 < bigger.Count(); i1++)
            {
                if (bigger.Count() <= i1 + (range.Item2 - range.Item1))
                {
                    //short cut if the available slice from the bigger array is shorter than the range length
                    match = false;
                    continue;
                }
    
                match = true; // reset to true to allow for new attempt for each larger array slice
    
                for (int i2 = range.Item1, i1Temp = i1; i2 < range.Item2; i2++, i1Temp++)
                {
                    if (bigger[i1Temp] != smaller[i2])
                    {
                        match = false;
                        break;
                    }
                }
                if (match)
                {
                    largestMatchingRange = range;
                    break;
                }
            }
            if (match)
            {
                break;
            }
        }
    
        if (match)
        {
            break;
        }
    }
    
    byte[] largestMatchingBytes = smaller.Skip(largestMatchingRange.Item1).Take(largestMatchingRange.Item2 - largestMatchingRange.Item1).ToArray();
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-11-17
      • 2014-02-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多