【问题标题】：compute md5 hash of multi part data (multiple strings)计算多部分数据（多个字符串）的 md5 哈希
【发布时间】：2011-06-14 16:35:10
【问题描述】：

我正在尝试创建多个字符串的 [single] md5 哈希 [in Java]。这就是我想要的

md5(string1, string2, string3, ..., stringN)

目前我正在尝试将所有字符串与一些很少使用的分隔符（如#）连接起来。那是

md5(string1#string2#...#stringN)

这看起来很hacky，我担心一些奇怪的字符串实际上将分隔符作为它的一部分。最好的方法是什么？

【问题讨论】：

嘿，md5 无论如何都不是抗碰撞的！

标签： java algorithm hash

【解决方案1】：

分隔符是否是字符串的一部分并不重要。您可能甚至不需要分隔符，因为您不会将连接的字符串分解为多个部分

【讨论】：

如果字符串是“ab”、“c”和“a”“bc”——如果我不使用分隔符，它们最终会映射到相同的哈希码。我担心分隔符是“ab#”、“c”和“ab”、“#c”，它们将具有相同的哈希码。 [请注意，我使用 md5 几乎等于 collison 非常罕见 - 我将它用作 Hadoop reducer 的关键，如果我能提供帮助，我不想再次进行相等比较]
这种情况真的经常发生吗？如果是，那么确实使用不应出现在字符串中的分隔符（如果您知道它们的内容）。如果没有，那么……别打扰了！
根本不是抗碰撞哈希函数

【解决方案2】：

这可能会更好：

md5(md5(string1) + md5(string2) + ... + md5(stringN))

它会消除分隔符的问题，但很难说它有多好。

【讨论】：

这很好，从密码学上讲。
确实如此，否则 a,b,c 将与 c,b,a 相同。
@Nick - 任何链接 - 它不会减少很多熵吗？我有一种 [未经证实但非常确定] 的感觉，即它的熵可能更少。一方面，它现在不太可能导致哈希值处于低端。我们最终哈希值 0、1、2 等的可能性较小。对于 0，我们需要所有这些值来产生 0 等。它在 md5 范围内不再统一。
@Fakrudeen 安全哈希算法（如 MD5）的设计目的是让您无法从输出中了解有关输入的任何信息。如果可以通过以特定方式简单地使用它来将偏差引入到生成的哈希中，那将是一个致命的弱点。我不确定您要对哈希值说什么 - 您所描述的不是安全哈希函数的工作原理。
我现在从 GregS 的问题中意识到塞巴斯蒂安在说什么。由于 md5() 会产生 128 位数字，我认为您是在总结所有 md5 数字而不是串联。现在这个解决方案看起来确实不错！

【解决方案3】：

我之前遇到过类似的问题，我能想到的最佳解决方案是使用不可输入的 ascii 字符作为分隔符。查看“man ascii”并选择一个。我最喜欢的是 '\a'，它是“铃声”的 ASCII 符号。

【讨论】：

【解决方案4】：

只是不要将它们分开。这是一种哈希方法：将它们分开是没有用的......

MessageDigest md5 = MessageDigest.getInstance("MD5");
byte[] bytes = ...;
for (String toHash: stringsToHash) {
  md5.update(toHash.getBytes("UTF-8"));
}
md5.digest(bytes);

【讨论】：

如果目的是尝试根据某些字段生成最可能的唯一键，则很有用。

【解决方案5】：

如果您想确保将文本从一个字符串移动到另一个字符串不会发生冲突，我推荐这种方案：

md5(<len1>+str1+<len2>+str2...)

这里，len1 是 str1 长度的定长表示。对于 md5，使用四字节 int 值是最合适的（假设您知道字符串不会超过 2**31）。或者，使用“十进制长度#”，即（以 Python 表示法）

md5(str(len(str1))+"#"+str(len(str2))+"#"+str2+...)

这不会通过将文本从一个字符串移动到另一个字符串来产生冲突，因为长度会改变。

【讨论】：

从碰撞的角度来看，这看起来是最安全的。但是我担心它是否会降低有效熵[不涵盖整个 md5 范围]，因为字符串中的某些字符始终是 int。
如果 +str1++str2+ 的其他组合的二进制序列与当前组合匹配，因此我们将发生冲突

【解决方案6】：

对于字符串，我认为 +Coly Klein 添加不可类型字符的解决方案是最好的。

如果您也想要一个适用于二进制数据的解决方案，或者您不确定字符串不包含这些字符，您可以使用递归哈希，例如：

md5(md5(str1)+md5(str2)+md5(str3)+...+)

取决于此解决方案可能需要大量资源的数据量（不久前，我分析了一个程序，发现它有 97% 的时间都在计算 sha1，所以我必须警告你..）

【讨论】：

对不起，我没有在所有其他答案中注意到他。除此之外，我认为警告值得分享。
是的 - 这 [警告] 很有用。

【解决方案7】：

将所有答案放在一起，这是一个具有单个公共和静态方法的类，它以有效的方式解决了提出的问题。随意评论、批评或使用此代码（公共领域和所有）...

import java.nio.charset.Charset;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

/**
 * MD5Summer is a utility class that abstracts the complexity of computing
 * the MD5 sum of an array of Strings.
 * <p>
 * Submitted as an answer to the StackOverflow bounty question:
 * <a href="http://stackoverflow.com/questions/4785275/compute-md5-hash-of-multi-part-data-multiple-strings">
 * compute md5 hash of multi part data (multiple strings)</a>
 * <p>
 * This solution uses the 'fast' "byte[] to hex string" mechanism described here in
 * <a href="http://stackoverflow.com/questions/9655181/convert-from-byte-array-to-hex-string-in-java">
 * Convert from byte array to hex string in java</a>.
 * <p>
 * The MD5 sum is always calculated by converting the inputStrings to bytes based on
 * the UTF-8 representation of those Strings. Different platforms using this class
 * will thus always calculate the same MD5sum for the same Java Strings.
 * <p>
 * Using a ThreadLocal for storing the MessageDigest instance significantly reduces the amount of time spent
 * obtaining a Digest instance from the java.security subsystem.
 * <p>
 * <i>Copyright - This code is released in to the public domain</i>
 */
public final class MD5Summer {

    /**
     * Calculate the MD5 sum on the input Strings.
     * <p>
     * The MD5 sum is calculated as if the input values were concatenated
     * together. The sum is returned as a String value containing the
     * hexadecimal representation of the MD5 sum.
     * <p>
     * The MD5 sum is always calculated by converting the inputStrings to bytes based on
     * the UTF-8 representation of those Strings. Different platforms using this class
     * will thus always calculate the same MD5sum for the same Java Strings.
     * 
     * @param values The string values to calculate the MD5 sum on.
     * @return the calculated MD5 sum as a String of hexadecimal.
     * @throws IllegalStateException in the highly unlikely event that the MD5 digest is not installed.
     * @throws NullPointerException if the input, or any of the input values is null.
     */
    public static final String digest(final String ...values) {
        return LOCAL_MD5.get().calculateMD5(values);
    }

    /**
     * A Thread-Local instance of the MD5Digest saves construct time significantly,
     * while avoiding the need for any synchronization.
     */
    private static final ThreadLocal<MD5Summer> LOCAL_MD5 = new ThreadLocal<MD5Summer>() {
        @Override
        protected MD5Summer initialValue() {
            return new MD5Summer();
        }   
    };

    private static final char[] HEXCHARS = "0123456789abcdef".toCharArray();
    private static final Charset UTF8 = Charset.forName("UTF-8");


    private final MessageDigest md5digest;

    /**
     * Private constructor - cannot create instances of this class from outside
     */
    private MD5Summer () {
        // private constructor making only thread-local instances possible.
        try {
            md5digest = MessageDigest.getInstance("MD5");
        } catch (NoSuchAlgorithmException e) {
            // MD5 should always be available.
            throw new IllegalStateException("Unable to get MD5 MessageDigest instance.", e);
        }
    }

    /**
     * Private implementation on the Thread-local instance.
     * @param values The string values to calculate the MD5 sum on.
     * @return the calculated MD5 sum as a String of hexadecimal bytes.
     */
    private String calculateMD5(final String ... values) {
        try {
            for (final String val : values) {
                md5digest.update(val.getBytes(UTF8));
            }
            final byte[] digest = md5digest.digest();
            final char[] chars = new char[digest.length * 2];
            int c = 0;
            for (final byte b : digest) {
                chars[c++] = HEXCHARS[(b >>> 4) & 0x0f];
                chars[c++] = HEXCHARS[(b      ) & 0x0f];
            }
            return new String(chars);
        } finally {
            md5digest.reset();
        }
    }

}

我通过以下小测试将该程序的结果与 Linux 的 md5sum 程序进行了比较：

public class MD5Tester {
//    [rolf@rolfl ~/md5data]$ echo "Frodo Baggins" >> frodo
//    [rolf@rolfl ~/md5data]$ echo "Bilbo Baggins" >> bilbo
//    [rolf@rolfl ~/md5data]$ cat frodo bilbo 
//    Frodo Baggins
//    Bilbo Baggins
//    [rolf@rolfl ~/md5data]$ cat frodo bilbo | md5sum 
//    a8a25988435405b9a62634c887287b40 *-
//    [rolf@rolfl ~/md5data]$ 


    public static void main(String[] args) {
        String[] data = {"Frodo Baggins\n", "Bilbo Baggins\n"};
        String md5data = MD5Summer.digest(data);
        System.out.println("Expect a8a25988435405b9a62634c887287b40");
        System.out.println("Got    " + md5data);
        if (!"a8a25988435405b9a62634c887287b40".equals(md5data)) {
            System.out.println("Data does not match!!!!");
        }
    }
}

【讨论】：

【解决方案8】：

您可以在附加字符串之前使用 base64 对字符串进行编码。然后在 md5 函数中拆分您的字符串并对其进行解码。例如

public class MutilMd5 {

public static void main(String[] args) throws Base64DecodingException {
    String s1 = "12#3";
    String s2 = "#12345";

    multMd5(Base64.encode(s1.getBytes()) + "#" + Base64.encode(s2.getBytes()));

}

public static void multMd5(String value) throws Base64DecodingException {
    String md5 = "";
    String[] encodeStrings = value.split("#");
    if (encodeStrings != null) {
        for (String encodeString : encodeStrings) {
            System.out.println(new String(Base64.decode(encodeString.getBytes())));
            md5 = md5 + DigestUtils.md5Hex(encodeString);
        }
    }

    System.out.println(md5);
}

}

输出是

12#3

12345

13094636ff02b51be53c496d04d39bc2375704c2e00da07d2c9acc7646b2a844

【讨论】：

和塞巴斯蒂安的回答有什么不同吗？

【解决方案9】：

您可以使用 MessageDigest 类来生成您的代码。如果我是你，如果在预处理阶段我知道每个字符串的长度，我会将它们作为唯一的字符串传递。如果您传递的字符串具有无法知道的不同随机长度，我会将它们一一进行哈希处理，但我需要知道原始实体和收据实体是否已很好地同步以了解它们的消息长度互相穿梭。

private static final char[] HEXADECIMAL = { '0', '1', '2', '3',
    '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f' };

public  String hash(String stringToHash)  {
    try {
        MessageDigest md = MessageDigest.getInstance("MD5");
        byte[] bytes = md.digest(stringToHash.getBytes());
        StringBuilder sb = new StringBuilder(2 * bytes.length);
        for (int i = 0; i < bytes.length; i++) {
            int low = (int)(bytes[i] & 0x0f);
            int high = (int)((bytes[i] & 0xf0) >> 4);
            sb.append(HEXADECIMAL[high]);
            sb.append(HEXADECIMAL[low]);
        }
        return sb.toString();
    } catch (NoSuchAlgorithmException e) {
        //exception handling goes here
        return null;
    }
}

【讨论】：

【解决方案10】：

据我了解，您希望对字符串列表进行哈希处理，以确保没有两个不同的列表给出相同的结果。这完全不用考虑哈希函数就可以解决。

您需要一个函数String f(List<String> l)，其中没有两个输入值会产生相同的输出（从List<String> 到String 的injective 函数）。有了这个，您可以将输出提供给您的散列函数，并确保在散列函数本身确保的范围内不会发生冲突（注意 MD5 在几年前就被破坏了，因此它可能不是一个合适的选择）。这里有两种实现f的方法：

转换为字符集的子集

最直接的方法是将每个输入映射到不包含分隔符的字符串字符集的子集：

public static String hex(String s) {
    try {
        String o = "";
        for(byte b: s.getBytes("utf-8"))
        o += String.format("%02x", b&0xff);
        return o;
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}

public static String f(String... l) {
    if (l.length == 0) return "";
    String o = hex(l[0]);
    if (l.length == 1) return o;
    for (int i = 1; i < l.length; i++) o += "#" + hex(l[i]);
    return o;
}

f("a#","b") => 6123#62
f("a","#b") => 61#2362

长度前缀

这也很简单，但缺点是不能重写以在流中工作。

public static String prefix(String s) {
    return s.length() + "." + s;
}

public static String f(String... l) {
    if (l.length == 0) return "";
    String o = prefix(l[0]);
    if (l.length == 1) return o;
    for (int i = 1; i < l.length; i++) o += "#" + prefix(l[i]);
    return o;
}

f("a#","b") => 2.a##1.b
f("a","#b") => 1.a#2.#b

【讨论】：

@Fakrudeen：是的，它和他的 "md5(+str1++str2...)" 一样，除了它有点以防输入以数字开头，这可能没有必要，但更容易看出它以这种方式工作。我还建议转义 # 的出现（通过将它们更改为 \#） - 然后你不需要像我的第一个解决方案那样对整个字符串进行编码，但要做到这一点并非易事，它引入了微妙之处在一个简单的实现中（例如，f("a\\","#b") => a\#\#b 和 f("a##b") => a\#\#b）。

【解决方案11】：

由于您使用字符串作为输入，并且我们知道字符串没有 NULL 字符，因此可以使用 NULL 字符作为所有字符串的分隔符。您在身份验证时检查输入的 NULL 字符。

md5(String1+NULL+String2+NULL+String3....)

为您节省多个 Md5 的时间。

【讨论】：