【问题标题】:Persisting hashlib state持久化 hashlib 状态
【发布时间】:2011-01-09 00:12:37
【问题描述】:

我想创建一个hashlib 实例,update() 它,然后以某种方式保持其状态。稍后,我想使用此状态数据重新创建对象,并继续update() 它。最后,我想获得总累积运行数据的hexdigest()。状态持久性必须在多次运行中存活。

例子:

import hashlib
m = hashlib.sha1()
m.update('one')
m.update('two')
# somehow, persist the state of m here

#later, possibly in another process
# recreate m from the persisted state
m.update('three')
m.update('four')
print m.hexdigest()
# at this point, m.hexdigest() should be equal to hashlib.sha1().update('onetwothreefour').hextdigest()

编辑:

我在 2010 年没有找到使用 python 执行此操作的好方法,最终用 C 编写了一个小型辅助应用程序来完成此操作。但是,下面有一些很好的答案,当时我不知道或不知道。

【问题讨论】:

  • 你能在某处写出解决方案吗?
  • @EsseTi,已经很多年了,但我记得我能够捕获 SHA_CTX 的状态,然后在稍后的不同过程中以类似状态重新创建上下文。

标签: python persistence hash pickle hashlib


【解决方案1】:

您可以使用ctypes 这样做,不需要C 中的辅助应用程序:-

rehash.py

#! /usr/bin/env python

''' A resumable implementation of SHA-256 using ctypes with the OpenSSL crypto library

    Written by PM 2Ring 2014.11.13
'''

from ctypes import *

SHA_LBLOCK = 16
SHA256_DIGEST_LENGTH = 32

class SHA256_CTX(Structure):
    _fields_ = [
        ("h", c_long * 8),
        ("Nl", c_long),
        ("Nh", c_long),
        ("data", c_long * SHA_LBLOCK),
        ("num", c_uint),
        ("md_len", c_uint)
    ]

HashBuffType = c_ubyte * SHA256_DIGEST_LENGTH

#crypto = cdll.LoadLibrary("libcrypto.so")
crypto = cdll.LoadLibrary("libeay32.dll" if os.name == "nt" else "libssl.so")

class sha256(object):
    digest_size = SHA256_DIGEST_LENGTH

    def __init__(self, datastr=None):
        self.ctx = SHA256_CTX()
        crypto.SHA256_Init(byref(self.ctx))
        if datastr:
            self.update(datastr)

    def update(self, datastr):
        crypto.SHA256_Update(byref(self.ctx), datastr, c_int(len(datastr)))

    #Clone the current context
    def _copy_ctx(self):
        ctx = SHA256_CTX()
        pointer(ctx)[0] = self.ctx
        return ctx

    def copy(self):
        other = sha256()
        other.ctx = self._copy_ctx()
        return other

    def digest(self):
        #Preserve context in case we get called before hashing is
        # really finished, since SHA256_Final() clears the SHA256_CTX
        ctx = self._copy_ctx()
        hashbuff = HashBuffType()
        crypto.SHA256_Final(hashbuff, byref(self.ctx))
        self.ctx = ctx
        return str(bytearray(hashbuff))

    def hexdigest(self):
        return self.digest().encode('hex')

#Tests
def main():
    import cPickle
    import hashlib

    data = ("Nobody expects ", "the spammish ", "imposition!")

    print "rehash\n"

    shaA = sha256(''.join(data))
    print shaA.hexdigest()
    print repr(shaA.digest())
    print "digest size =", shaA.digest_size
    print

    shaB = sha256()
    shaB.update(data[0])
    print shaB.hexdigest()

    #Test pickling
    sha_pickle = cPickle.dumps(shaB, -1)
    print "Pickle length:", len(sha_pickle)
    shaC = cPickle.loads(sha_pickle)

    shaC.update(data[1])
    print shaC.hexdigest()

    #Test copying. Note that copy can be pickled
    shaD = shaC.copy()

    shaC.update(data[2])
    print shaC.hexdigest()


    #Verify against hashlib.sha256()
    print "\nhashlib\n"

    shaD = hashlib.sha256(''.join(data))
    print shaD.hexdigest()
    print repr(shaD.digest())
    print "digest size =", shaD.digest_size
    print

    shaE = hashlib.sha256(data[0])
    print shaE.hexdigest()

    shaE.update(data[1])
    print shaE.hexdigest()

    #Test copying. Note that hashlib copy can NOT be pickled
    shaF = shaE.copy()
    shaF.update(data[2])
    print shaF.hexdigest()


if __name__ == '__main__':
    main()

resumable_SHA-256.py

#! /usr/bin/env python

''' Resumable SHA-256 hash for large files using the OpenSSL crypto library

    The hashing process may be interrupted by Control-C (SIGINT) or SIGTERM.
    When a signal is received, hashing continues until the end of the
    current chunk, then the current file position, total file size, and
    the sha object is saved to a file. The name of this file is formed by
    appending '.hash' to the name of the file being hashed.

    Just re-run the program to resume hashing. The '.hash' file will be deleted
    once hashing is completed.

    Written by PM 2Ring 2014.11.14
'''

import cPickle as pickle
import os
import signal
import sys

import rehash

quit = False

blocksize = 1<<16   # 64kB
blocksperchunk = 1<<8

chunksize = blocksize * blocksperchunk

def handler(signum, frame):
    global quit
    print "\nGot signal %d, cleaning up." % signum
    quit = True


def do_hash(fname, filesize):
    hashname = fname + '.hash'
    if os.path.exists(hashname):
        with open(hashname, 'rb') as f:
            pos, fsize, sha = pickle.load(f)
        if fsize != filesize:
            print "Error: file size of '%s' doesn't match size recorded in '%s'" % (fname, hashname)
            print "%d != %d. Aborting" % (fsize, filesize)
            exit(1)
    else:
        pos, fsize, sha = 0, filesize, rehash.sha256()

    finished = False
    with open(fname, 'rb') as f:
        f.seek(pos)
        while not (quit or finished):
            for _ in xrange(blocksperchunk):
                block = f.read(blocksize)
                if block == '':
                    finished = True
                    break
                sha.update(block)

            pos += chunksize
            sys.stderr.write(" %6.2f%% of %d\r" % (100.0 * pos / fsize, fsize))
            if finished or quit:
                break

    if quit:
        with open(hashname, 'wb') as f:
            pickle.dump((pos, fsize, sha), f, -1)
    elif os.path.exists(hashname):
        os.remove(hashname)

    return (not quit), pos, sha.hexdigest()


def main():
    if len(sys.argv) != 2:
        print "Resumable SHA-256 hash of a file."
        print "Usage:\npython %s filename\n" % sys.argv[0]
        exit(1)

    fname = sys.argv[1]
    filesize = os.path.getsize(fname)

    signal.signal(signal.SIGINT, handler)
    signal.signal(signal.SIGTERM, handler)

    finished, pos, hexdigest = do_hash(fname, filesize)
    if finished:
        print "%s  %s" % (hexdigest, fname)
    else:
        print "sha-256 hash of '%s' incomplete" % fname
        print "%s" % hexdigest
        print "%d / %d bytes processed." % (pos, filesize)


if __name__ == '__main__':
    main()

演示

import rehash
import pickle
sha=rehash.sha256("Hello ")
s=pickle.dumps(sha.ctx)
sha=rehash.sha256()
sha.ctx=pickle.loads(s)
sha.update("World")
print sha.hexdigest()

输出

a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e

注意:我要感谢 PM2Ring 的精彩代码。

【讨论】:

  • 这是一个了不起的答案,也是使用 ctypes 的一个很好的例子。谢谢。
  • 我最终写了一个库,做了一些非常相似的事情:github.com/kislyuk/rehash
  • @anthony 很高兴你喜欢它。 ;) FWIW,我原来的答案是here
【解决方案2】:

hashlib.sha1 是一个 C 库的包装器,因此您无法腌制它。

它需要实现 __getstate____setstate__ 方法,以便 Python 访问其内部状态

如果对您的要求足够快,您可以使用 sha1 的 pure Python 实现

【讨论】:

    【解决方案3】:

    我也遇到了这个问题,但没有找到现有的解决方案,所以我最终编写了一个库,它的功能与 Devesh Saini 所描述的非常相似:https://github.com/kislyuk/rehash。示例:

    import pickle, rehash
    hasher = rehash.sha256(b"foo")
    state = pickle.dumps(hasher)
    
    hasher2 = pickle.loads(state)
    hasher2.update(b"bar")
    
    assert hasher2.hexdigest() == rehash.sha256(b"foobar").hexdigest()
    

    【讨论】:

      【解决方案4】:
      【解决方案5】:

      您可以轻松地围绕散列对象构建一个包装器对象,它可以透明地保存数据。

      明显的缺点是它需要完整地保留散列数据才能恢复状态 - 因此,根据您处理的数据大小,这可能不适合您的需求。但它应该可以正常工作到几十 MB。

      不幸的是,hashlib 没有将散列算法公开为适当的类,而是提供了构造散列对象的工厂函数 - 所以我们不能在不加载保留符号的情况下正确子类化那些 - 我宁愿避免这种情况。这只意味着你必须从一开始就构建你的包装类,这不是 Python 的开销。

      这里有一个示例代码,甚至可以满足您的需求:

      import hashlib
      from cStringIO import StringIO
      
      class PersistentSha1(object):
          def __init__(self, salt=""):
              self.__setstate__(salt)
      
          def update(self, data):
              self.__data.write(data)
              self.hash.update(data)
      
          def __getattr__(self, attr):
              return getattr(self.hash, attr)
      
          def __setstate__(self, salt=""):
              self.__data = StringIO()
              self.__data.write(salt)
              self.hash = hashlib.sha1(salt)
      
          def __getstate__(self):
              return self.data
      
          def _get_data(self):
              self.__data.seek(0)
              return self.__data.read()
      
          data = property(_get_data, __setstate__)
      

      您可以访问“数据”成员本身以直接获取和设置状态,或者您可以使用 python 酸洗函数:

      >>> a = PersistentSha1()
      >>> a
      <__main__.PersistentSha1 object at 0xb7d10f0c>
      >>> a.update("lixo")
      >>> a.data
      'lixo'
      >>> a.hexdigest()
      '6d6332a54574aeb35dcde5cf6a8774f938a65bec'
      >>> import pickle
      >>> b = pickle.dumps(a)
      >>>
      >>> c = pickle.loads(b)
      >>> c.hexdigest()
      '6d6332a54574aeb35dcde5cf6a8774f938a65bec'
      
      >>> c.data
      'lixo'
      

      【讨论】:

      • 这是一个很好的例子,展示了如何构建一个 picklable 类,但是存储被散列的数据是不行的,它可能会很大。哈希上下文本身很小,但似乎 Python 可能不会暴露它。
      猜你喜欢
      • 1970-01-01
      • 2021-05-20
      • 1970-01-01
      • 1970-01-01
      • 2022-11-07
      • 2018-12-28
      • 2021-04-24
      • 2020-12-25
      • 2019-03-11
      相关资源
      最近更新 更多