为字符串生成校验和答案

【问题标题】：Generate Checksum for String为字符串生成校验和
【发布时间】：2016-03-13 05:56:49
【问题描述】：

我想为字符串/数据生成校验和

1. The same data should produce the same Checksum
2. Two different data strings can't product same checksum. Random collision of 0.1% can be negligible
3. No encryption/decryption of data 
4. Checksum length need not be too huge and contains letters and characters.
5. Must be too fast and efficient. Imagine generating checksum(s) for 100 Mb of text data should be in less than 5mins. Generating 1000 checksums for less than 1 KB of each segment data should be in less than 10 seconds.

非常感谢任何算法或实现参考和建议。

【问题讨论】：

使用散列...如果你不能加密/解密。
@vish4071，比“使用散列”更好地建议一些散列技术。
Sha-1、Sha-2、MD5...等。或者定义一个自定义散列函数。（另外......你可以不那么粗鲁）。
如果 0.1% 的碰撞容限意味着只需要 1000 个可能的校验和，那么总和可以占用 10 位 (2^10 = 1024)。只需将所有字节模 16 相加即可执行非常简单、非常快速的校验和。
看看 Murmur 哈希，它的碰撞率非常低，并且性能可以接受（就我而言）：stackoverflow.com/questions/11899616/murmurhash-what-is-it

标签： string algorithm checksum

【解决方案1】：

一种非常常见的快速校验和是 CRC-32，一种 32 位多项式循环冗余校验。以下是 CRC-32 的三种 C 语言实现，它们的速度和复杂度各不相同：（来自http://www.hackersdelight.org/hdcodetxt/crc.c.txt）

#include <stdio.h>
#include <stdlib.h>

// ---------------------------- reverse --------------------------------

// Reverses (reflects) bits in a 32-bit word.
unsigned reverse(unsigned x) {
   x = ((x & 0x55555555) <<  1) | ((x >>  1) & 0x55555555);
   x = ((x & 0x33333333) <<  2) | ((x >>  2) & 0x33333333);
   x = ((x & 0x0F0F0F0F) <<  4) | ((x >>  4) & 0x0F0F0F0F);
   x = (x << 24) | ((x & 0xFF00) << 8) |
       ((x >> 8) & 0xFF00) | (x >> 24);
   return x;
}

// ----------------------------- crc32a --------------------------------

/* This is the basic CRC algorithm with no optimizations. It follows the
logic circuit as closely as possible. */

unsigned int crc32a(unsigned char *message) {
   int i, j;
   unsigned int byte, crc;

   i = 0;
   crc = 0xFFFFFFFF;
   while (message[i] != 0) {
      byte = message[i];            // Get next byte.
      byte = reverse(byte);         // 32-bit reversal.
      for (j = 0; j <= 7; j++) {    // Do eight times.
         if ((int)(crc ^ byte) < 0)
              crc = (crc << 1) ^ 0x04C11DB7;
         else crc = crc << 1;
         byte = byte << 1;          // Ready next msg bit.
      }
      i = i + 1;
   }
   return reverse(~crc);
}

// ----------------------------- crc32b --------------------------------

/* This is the basic CRC-32 calculation with some optimization but no
table lookup. The the byte reversal is avoided by shifting the crc reg
right instead of left and by using a reversed 32-bit word to represent
the polynomial.
   When compiled to Cyclops with GCC, this function executes in 8 + 72n
instructions, where n is the number of bytes in the input message. It
should be doable in 4 + 61n instructions.
   If the inner loop is strung out (approx. 5*8 = 40 instructions),
it would take about 6 + 46n instructions. */

unsigned int crc32b(unsigned char *message) {
   int i, j;
   unsigned int byte, crc, mask;

   i = 0;
   crc = 0xFFFFFFFF;
   while (message[i] != 0) {
      byte = message[i];            // Get next byte.
      crc = crc ^ byte;
      for (j = 7; j >= 0; j--) {    // Do eight times.
         mask = -(crc & 1);
         crc = (crc >> 1) ^ (0xEDB88320 & mask);
      }
      i = i + 1;
   }
   return ~crc;
}

// ----------------------------- crc32c --------------------------------

/* This is derived from crc32b but does table lookup. First the table
itself is calculated, if it has not yet been set up.
Not counting the table setup (which would probably be a separate
function), when compiled to Cyclops with GCC, this function executes in
7 + 13n instructions, where n is the number of bytes in the input
message. It should be doable in 4 + 9n instructions. In any case, two
of the 13 or 9 instrucions are load byte.
   This is Figure 14-7 in the text. */

unsigned int crc32c(unsigned char *message) {
   int i, j;
   unsigned int byte, crc, mask;
   static unsigned int table[256];

   /* Set up the table, if necessary. */

   if (table[1] == 0) {
      for (byte = 0; byte <= 255; byte++) {
         crc = byte;
         for (j = 7; j >= 0; j--) {    // Do eight times.
            mask = -(crc & 1);
            crc = (crc >> 1) ^ (0xEDB88320 & mask);
         }
         table[byte] = crc;
      }
   }

   /* Through with table setup, now calculate the CRC. */

   i = 0;
   crc = 0xFFFFFFFF;
   while ((byte = message[i]) != 0) {
      crc = (crc >> 8) ^ table[(crc ^ byte) & 0xFF];
      i = i + 1;
   }
   return ~crc;
}

如果您只是google "CRC32"，您将获得比您可能吸收的更多的信息。

【讨论】：

【解决方案2】：

您可以编写自定义哈希函数：(c++)

long long int hash(String s){
    long long k = 7;
    for(int i = 0; i < s.length(); i++){
        k *= 23;
        k += s[i];
        k *= 13;
        k %= 1000000009;
    }
    return k;
}

这应该会给你一个很好的（对于大多数样本来说没有冲突）哈希值。

【讨论】：

23和13的乘法是随机的还是背后有什么逻辑？