实现递归哈希算法答案

【问题标题】：Implement recursive hashing algorithm实现递归哈希算法
【发布时间】：2011-12-07 03:08:45
【问题描述】：

假设文件 A 有字节：

我有一个简单的散列算法，我存储最后三个连续字节的总和，所以：

2   
5   
8   - = 8+5+2 = 15
0   
33  
90  - = 90+33+0 = 123
1   
3   
200 - = 204
201 
23  
12  - = 236

所以我可以将文件 A 表示为 15, 123, 204, 236

假设我将该文件复制到新计算机 B 并进行了一些小修改，文件 B 的字节为：

“注意区别是文件开头多出一个字节，结尾多出两个字节，但其余部分非常相似”

所以我可以执行相同的算法来确定文件的某些部分是否相同。请记住，文件 A 由哈希码 15, 123, 204, 236 表示，让我们看看文件 B 是否给了我一些哈希码！

在文件 B 上，我必须每 3 个连续字节执行一次

int[] sums; // array where we will hold the sum of the last bytes


255 sums[0]  =          255     
2   sums[1]  =  2+ sums[0]    = 257     
5   sums[2]  =  5+ sums[1]    = 262     
8   sums[3]  =  8+ sums[2]    = 270  hash = sums[3]-sums[0]   = 15   --> MATHES FILE A!
0   sums[4]  =  0+ sums[3]    = 270  hash = sums[4]-sums[1]   = 13
33  sums[5]  =  33+ sums[4]   = 303  hash = sums[5]-sums[2]   = 41
90  sums[6]  =  90+ sums[5]   = 393  hash = sums[6]-sums[3]   = 123  --> MATHES FILE A!
1   sums[7]  =  1+ sums[6]    = 394  hash = sums[7]-sums[4]   = 124
3   sums[8]  =  3+ sums[7]    = 397  hash = sums[8]-sums[5]   = 94
200 sums[9]  =  200+ sums[8]  = 597  hash = sums[9]-sums[6]   = 204  --> MATHES FILE A!
201 sums[10] =  201+ sums[9]  = 798  hash = sums[10]-sums[7]  = 404
23  sums[11] =  23+ sums[10]  = 821  hash = sums[11]-sums[8]  = 424
12  sums[12] =  12+ sums[11]  = 833  hash = sums[12]-sums[9]  = 236  --> MATHES FILE A!
55  sums[13] =  55+ sums[12]  = 888  hash = sums[13]-sums[10] = 90
255 sums[14] =  255+ sums[13] = 1143    hash = sums[14]-sums[11] =  322
255 sums[15] =  255+ sums[14] = 1398    hash = sums[15]-sums[12] =  565

所以通过查看该表，我知道文件 B 包含文件 A 中的字节加上其他字节，因为哈希码匹配。

我展示这个算法的原因是因为它是 n 阶的，换句话说，我能够计算最后 3 个连续字节的哈希值，而无需遍历它们！

如果我在哪里有一个更复杂的算法，例如对最后 3 个字节执行 md5，那么它将是 n^3 的顺序，因为当我遍历文件 B 时，我必须有一个内部 for 循环来计算最后三个字节的哈希。

所以我的问题是：

如何改进算法，使其保持 n 阶。那就是只计算一次哈希。如果我使用现有的散列算法，例如 md5，我将不得不在算法内部放置一个内部循环，这将显着增加算法的顺序。

请注意，可以用乘法而不是加法来做同样的事情。但计数器显着增长非常快。也许我可以把乘法和加法和减法结合起来......

编辑

如果我用谷歌搜索：

递归散列函数 in-grams

出现了很多信息，我认为那些算法很难理解......

我必须为一个项目实现这个算法，这就是我重新发明轮子的原因......我知道那里有很多算法。

另外一个我认为的替代解决方案是执行相同的算法加上另一个强大的算法。所以在文件 A 上，我将每 3 个字节加上每 3 个字节的 md5 执行相同的算法。在第二个文件上，如果第一个算法成真，我将只执行第二个算法....

【问题讨论】：

听起来你在重塑rsync。 en.wikipedia.org/wiki/Rolling_hash 可能会感兴趣。
是的，不幸的是，我需要重新发明轮子......我需要为我学校的数据结构课做这个......
为什么这个订单是n!？我看到订单n*m，其中n 是文件中的行数，m 是您加在一起的先前值的数量。如果n == m，那么你所有的总和都是一样的，这将毫无价值，但仍然是n^2。如果您使用 dequeue，则可以使用此命令 n，当您推送一个新值时将一个值添加到当前总和，并在您弹出一个值时从当前总和中减去一个值。
我打算使用！作为感叹对不起谢谢我会解决它。
你有实现递归的约束吗？使用非递归实现会更快。

标签： c# algorithm filecompare

【解决方案1】：

编辑：

我越想你所说的“递归”是什么意思，我就越怀疑我之前提出的解决方案是你应该实施什么来做任何有用的事情。

你可能想implement a hash tree algorithm，这是一个递归操作。

为此，您对列表进行哈希处理，将列表一分为二，然后递归到这两个子列表中。当您的列表大小为 1 或所需的最小哈希大小时终止，因为每个递归级别都会使您的总哈希输出大小加倍。

伪代码：

create-hash-tree(input list, minimum size: default = 1):
  initialize the output list
  hash-sublist(input list, output list, minimum size)
  return output list

hash-sublist(input list, output list, minimum size):
  add sum-based-hash(list) result to output list // easily swap hash styles here
  if size(input list) > minimum size:
    split the list into two halves
    hash-sublist(first half of list, output list, minimum size)
    hash-sublist(second half of list, output list, minimum size)

sum-based-hash(list):
  initialize the running total to 0

  for each item in the list:
    add the current item to the running total

  return the running total

我认为整个算法的运行时间是O(hash(m)); m = n * (log(n) + 1)，hash(m) 通常是线性时间。

存储空间类似于O(hash * s); s = 2n - 1，散列通常是固定大小的。

请注意，对于 C#，我会将输出列表设为 List<HashType>，但我会将输入列表设为 IEnumerable<ItemType> 以节省存储空间，并使用 Linq 快速“拆分”列表而不分配两个新的子列表。

原文：

我认为您可以将其设为O(n + m) 执行时间；其中n 是列表的大小，m 是运行计数的大小，n < m（否则所有总和都将相等）。

双端队列

内存消耗将是堆栈大小，加上用于临时存储的大小m。

为此，请使用双端队列和运行总计。将新遇到的值推送到列表中，同时添加到运行总数中，当队列达到大小m 时，弹出列表并从运行总数中减去。

这是一些伪代码：

initialize the running total to 0

for each item in the list:
  add the current item to the running total
  push the current value onto the end of the dequeue
  if dequeue.length > m:
    pop off the front of the dequeue
    subtract the popped value from the running total
  assign the running total to the current sum slot in the list

reset the index to the beginning of the list

while the dequeue isn't empty:
  add the item in the list at the current index to the running total
  pop off the front of the dequeue
  subtract the popped value from the running total
  assign the running total to the current sum slot in the list
  increment the index

这不是递归的，它是迭代的。

这个算法的运行看起来像这样（m = 3）：

value   sum slot   overwritten sum slot
2       2          92
5       7          74
8       15         70
0       15         15
33      46
90      131
1       124
3       127
200     294
201     405
23      427
12      436
55      291

带索引

您可以删除队列并重新分配任何插槽，方法是获取最后一个 m 值的总和，并使用索引的偏移量而不是弹出出队，例如array[i - m].

这不会减少您的执行时间，因为您仍然需要有两个循环，一个用于建立运行计数，另一个用于填充所有值。但它会将您的内存使用量减少到仅堆栈空间（实际上是O(1)）。

这是一些伪代码：

initialize the running total to 0

for the last m items in the list:
  add those items to the running total

for each item in the list:
  add the current item to the running total
  subtract the value of the item m slots earlier from the running total
  assign the running total to the current sum slot in the list

m slots earlier 是棘手的部分。你可以把它分成两个循环：

从列表末尾开始索引，减去 m，加上 i
索引从 i 减去 m

或者您可以在i - m < 0 时使用模运算来“包装”值：

int valueToSutract = array[(i - m) % n];

【讨论】：

非常感谢。如果我理解正确，我相信您使用总和作为哈希......当使用总和作为哈希时，即使块不同，它也会在文件 B 上找到匹配的哈希。这就是为什么我想实现我的算法。我不在乎它是否很慢。我只需要它的顺序小于 n^2 就像你的一样谢谢，而且找到匹配的机会非常低......谢谢你的帮助！
@TonoNam：此算法仅适用于您的哈希算法部分可逆且无损的情况。按位异或会起作用。减法会起作用。除了被零除的问题外，乘法和除法都可以。另外，对于差异部分，请查看Longest Common Subsequence Problem
@TonoNam：这不是递归的 :) 如果您需要它是递归的，或者支持有损散列，请告诉我，我可以进一步扩展我的答案。不过会是O(n * m)。
@TonoNam: m 会假设您的散列函数的顺序是线性时间。它将与dequeue 版本非常相似，除了没有运行计数，整个队列将被送入散列算法而不破坏它。哦，你可能想忽略对最长公共子序列的评论——我认为散列会使这变得毫无意义/重复工作；）
@TonoNam：我想得越多，这种方法在解决递归散列 tree 问题时似乎越没有价值，这就是你的问题可能试图实施。您的“每三个”解决方案更有意义。您不希望它遍历整个事物 - 您可能希望对整个事物进行散列，然后是一半，然后是四分之一，等等。

【解决方案2】：

http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm 使用可更新的哈希函数，它称为http://en.wikipedia.org/wiki/Rolling_hash。这将更容易计算 MD5/SHA，并且可能不会逊色。

你可以证明一些事情：它是一个选择常数 a 的 d 次多项式。假设有人提供了两段文本，而你随机选择一段。碰撞的概率是多少？好吧，如果哈希值相同，减去它们会得到一个以 a 为根的多项式。由于一个非零多项式的根最多有d个，并且a是随机选择的，所以概率最多为模数/d，对于大的模数，这个概率会很小。

当然 MD5/SHA 是安全的，但请参阅 http://cr.yp.to/mac/poly1305-20050329.pdf 了解安全变体。

【讨论】：

【解决方案3】：

这就是我到目前为止所得到的。我只是错过了不应该花费时间的步骤，例如比较哈希数组和打开文件进行读取。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace RecursiveHashing
{
    static class Utilities
    {

        // used for circular arrays. If my circular array is of size 5 and it's
        // current position is 2 if I shift 3 units to the left I shouls be in index
        // 4 of the array.
        public static int Shift(this int number, int shift, int divisor)
        {
            var tempa = (number + shift) % divisor;
            if (tempa < 0)
                tempa = divisor + tempa;
            return tempa;
        }

    }
    class Program
    {
        const int CHUNCK_SIZE = 4; // split the files in chuncks of 4 bytes

        /* 
         * formula that I will use to compute hash
         * 
         *      formula =  sum(chunck) * (a[c]+1)*(a[c-1]+1)*(a[c-2]+1)*(-1^a[c])
         *      
         *          where:
         *              sum(chunk)  = sum of current chunck
         *              a[c]        = current byte
         *              a[c-1]      = last byte
         *              a[c-2]      = last last byte
         *              -1^a[c]     = eather -1 or +1  
         *              
         *      this formula is efficient because I can get the sum of any current index by keeping trak of the overal sum
         *      thus this algorithm should be of order n
         */

        static void Main(string[] args)
        {
            Part1(); // Missing implementation to open file for reading
            Part2();
        }



        // fist part compute hashes on first file
        static void Part1()
        {
            // pertend file b reads those bytes
            byte[] FileB = new byte[]{2,3,5,8,2,0,1,0,0,0,1,2,4,5,6,7,8,2,3,4,5,6,7,8,11,};

            // create an array where to store the chashes
            // index 0 will use a fast hash algorithm. index 1 will use a more secure hashing algorithm
            Int64[,] hashes = new Int64[(FileB.Length / CHUNCK_SIZE) + 10, 2];


            // used to track on what index of the file we are at
            int counter = 0;
            byte[] current = new byte[CHUNCK_SIZE + 1]; // circual array  needed to remember the last few bytes
            UInt64[] sum = new UInt64[CHUNCK_SIZE + 1]; // circual array  needed to remember the last sums
            int index = 0; // position where in circular array

            int numberOfHashes = 0; // number of hashes created so far


            while (counter < FileB.Length)
            {
                int i = 0;
                for (; i < CHUNCK_SIZE; i++)
                {
                    if (counter == 0)
                    {
                        sum[index] = FileB[counter];
                    }
                    else
                    {
                        sum[index] = FileB[counter] + sum[index.Shift(-1, CHUNCK_SIZE + 1)];
                    }
                    current[index] = FileB[counter];
                    counter++;

                    if (counter % CHUNCK_SIZE == 0 || counter == FileB.Length)
                    {
                        // get the sum of the last chunk
                        var a = (sum[index] - sum[index.Shift(1, CHUNCK_SIZE + 1)]);
                        Int64 tempHash = (Int64)a;

                        // conpute my hash function
                        tempHash = tempHash * ((Int64)current[index] + 1)
                                          * ((Int64)current[index.Shift(-1, CHUNCK_SIZE + 1)] + 1)
                                          * ((Int64)current[index.Shift(-2, CHUNCK_SIZE + 1)] + 1)
                                          * (Int64)(Math.Pow(-1, current[index]));


                        // add the hashes to the array
                        hashes[numberOfHashes, 0] = tempHash;
                        numberOfHashes++;

                        hashes[numberOfHashes, 1] = -1;// later store a stronger hash function
                        numberOfHashes++;

                        // MISSING IMPLEMENTATION TO STORE A SECOND STRONGER HASH FUNCTION

                        if (counter == FileB.Length)
                            break;
                    }

                    index++;
                    index = index % (CHUNCK_SIZE + 1); // if index is out of bounds in circular array place it at position 0
                }
            }
        }


        static void Part2()
        {
            // simulate file read of a similar file
            byte[] FileB = new byte[]{1,3,5,8,2,0,1,0,0,0,1,2,4,5,6,7,8,2,3,4,5,6,7,8,11};            

            // place where we will place all matching hashes
            Int64[,] hashes = new Int64[(FileB.Length / CHUNCK_SIZE) + 10, 2];


            int counter = 0;
            byte[] current = new byte[CHUNCK_SIZE + 1]; // circual array
            UInt64[] sum = new UInt64[CHUNCK_SIZE + 1]; // circual array
            int index = 0; // position where in circular array



            while (counter < FileB.Length)
            {
                int i = 0;
                for (; i < CHUNCK_SIZE; i++)
                {
                    if (counter == 0)
                    {
                        sum[index] = FileB[counter];
                    }
                    else
                    {
                        sum[index] = FileB[counter] + sum[index.Shift(-1, CHUNCK_SIZE + 1)];
                    }
                    current[index] = FileB[counter];
                    counter++;

                    // here we compute the hash every time and we are missing implementation to 
                    // check if hash is contained by the other file
                    if (counter >= CHUNCK_SIZE)
                    {
                        var a = (sum[index] - sum[index.Shift(1, CHUNCK_SIZE + 1)]);

                        Int64 tempHash = (Int64)a;

                        tempHash = tempHash * ((Int64)current[index] + 1)
                                          * ((Int64)current[index.Shift(-1, CHUNCK_SIZE + 1)] + 1)
                                          * ((Int64)current[index.Shift(-2, CHUNCK_SIZE + 1)] + 1)
                                          * (Int64)(Math.Pow(-1, current[index]));

                        if (counter == FileB.Length)
                            break;
                    }

                    index++;
                    index = index % (CHUNCK_SIZE + 1);
                }
            }
        }
    }
}

使用相同算法在表中表示的相同文件

                        hashes
bytes       sum Ac  A[c-1]  A[c-2]  -1^Ac   sum * (Ac+1) * (A[c-1]+1) * (A[c-2]+1)
2       2                   
3       5                   
5       10  5   3   2   -1  
8       18  8   5   3   1   3888
2       20  2   8   5   1   
0       20  0   2   8   1   
1       21  1   0   2   -1  
0       21  0   1   0   1   6
0       21  0   0   1   1   
0       21  0   0   0   1   
1       22  1   0   0   -1  
2       24  2   1   0   1   18
4       28  4   2   1   1   
5       33  5   4   2   -1  
6       39  6   5   4   1   
7       46  7   6   5   -1  -7392
8       54  8   7   6   1   
2       56  2   8   7   1   
3       59  3   2   8   -1  
4       63  4   3   2   1   1020
5       68  5   4   3   -1  
6       74  6   5   4   1   
7       81  7   6   5   -1  
8       89  8   7   6   1   13104
11      100 11  8   7   -1  -27648






file b                          
                            rolling hashes
bytes       0   Ac  A[c-1]  A[c-2]  -1^Ac   sum * (Ac+1) * (A[c-1]+1) * (A[c-2]+1)
1       1                   
3       4                   
5       9   5   3   1   -1  
8       17  8   5   3   1   3672
2       19  2   8   5   1   2916
0       19  0   2   8   1   405
1       20  1   0   2   -1  -66
0       20  0   1   0   1   6
0       20  0   0   1   1   2
0       20  0   0   0   1   1
1       21  1   0   0   -1  -2
2       23  2   1   0   1   18
4       27  4   2   1   1   210
5       32  5   4   2   -1  -1080
6       38  6   5   4   1   3570
7       45  7   6   5   -1  -7392
8       53  8   7   6   1   13104
2       55  2   8   7   1   4968
3       58  3   2   8   -1  -2160
4       62  4   3   2   1   1020
5       67  5   4   3   -1  -1680
6       73  6   5   4   1   3780
7       80  7   6   5   -1  -7392
8       88  8   7   6   1   13104
11      99  11  8   7   -1  -27648

【讨论】：