为什么 md5 在字符串上的散列比在 python 中的 numpy 数组上快得多？答案

【问题标题】：Why is md5 hashing so much faster on strings than on numpy arrays in python?为什么 md5 在字符串上的散列比在 python 中的 numpy 数组上快得多？
【发布时间】：2014-04-08 12:39:40
【问题描述】：

在 python/numpy 中，我有一个名为 random_matrix 的 10,000x10,000 数组。我使用 md5 来计算 str(random_matrix) 和 random_matrix 本身的哈希值。字符串版本需要 0.00754404067993 秒，numpy 数组版本需要 1.6968960762。当我将它变成 20,000x20,000 数组时，字符串版本需要 0.0778470039368 秒，而 numpy 数组版本需要 60.641119957 秒。为什么是这样？ numpy 数组是否比字符串占用更多内存？另外，如果我想通过这些矩阵识别文件名，在计算哈希之前转换为字符串是一个好主意，还是有一些缺点？

【问题讨论】：

标签： python numpy hash md5

【解决方案1】：

str(random_matrix) 不会包含所有矩阵，因为 numpy 用“...”省略了一些东西：

>>> x = np.ones((1000, 1000))
>>> print str(x)
[[ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 ..., 
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]]

因此，当您散列 str(random_matrix) 时，您并没有真正散列所有数据。

请参阅 this previous question 和 this one 了解如何散列 numpy 数组。

【讨论】：