在没有 git 的情况下分配目录的 git SHA答案

【问题标题】：Assigning git's SHA of a directory without git在没有 git 的情况下分配目录的 git SHA
【发布时间】：2016-08-08 00:45:43
【问题描述】：

所以，我发现了这个问题： How to assign a Git SHA1's to a file without Git?

但我不确定如何为目录执行此方法。如何在不使用 git 的情况下对程序中的目录进行哈希处理，使其与 git 给出的 sha1 匹配？

【问题讨论】：

标签： python git sha1

【解决方案1】：

事实证明这比我预期的要难，但我现在确实可以做到。

与I commented 和hobbs answered 一样，计算树形哈希并非易事。您必须散列每个子树中的每个文件，计算这些子树的散列，并使用这些散列计算顶级树的散列。

附加的 python 代码似乎至少适用于一些测试用例（例如，为 git 源本身计算树哈希）。作为 cmets，我对我一路上发现的一些意想不到的事情进行了解释。

这也在my github "scripts" repository中。

[编辑：github 版本现在有一些 Python3 修复，通常可能会更新/更好。]

#! /usr/bin/env python

"""
Compute git hash values.

This is meant to work with both Python2 and Python3, but
has only been tested with Python2.7.
"""

from __future__ import print_function

import argparse
import os
import stat
import sys

from hashlib import sha1

def strmode(mode):
    """
    Turn internal mode (octal with leading 0s suppressed) into
    print form (i.e., left pad => right justify with 0s as needed).
    """
    return mode.rjust(6, '0')

def classify(path):
    """
    Return git classification of a path (as both mode,
    100644/100755 etc, and git object type, i.e., blob vs tree).
    Also throw in st_size field since we want it for file blobs.
    """
    # We need the X bit of regular files for the mode, so
    # might as well just use lstat rather than os.isdir().
    st = os.lstat(path)
    if stat.S_ISLNK(st.st_mode):
        gitclass = 'blob'
        mode = '120000'
    elif stat.S_ISDIR(st.st_mode):
        gitclass = 'tree'
        mode = '40000' # note: no leading 0!
    elif stat.S_ISREG(st.st_mode):
        # 100755 if any execute permission bit set, else 100644
        gitclass = 'blob'
        mode = '100755' if (st.st_mode & 0111) != 0 else '100644'
    else:
        raise ValueError('un-git-able file system entity %s' % fullpath)
    return mode, gitclass, st.st_size

def blob_hash(stream, size):
    """
    Return (as hash instance) the hash of a blob,
    as read from the given stream.
    """
    hasher = sha1()
    hasher.update(b'blob %u\0' % size)
    nread = 0
    while True:
        # We read just 64K at a time to be kind to
        # runtime storage requirements.
        data = stream.read(65536)
        if data == '':
            break
        nread += len(data)
        hasher.update(data)
    if nread != size:
        raise ValueError('%s: expected %u bytes, found %u bytes' %
            (stream.name, size, nread))
    return hasher

def symlink_hash(path):
    """
    Return (as hash instance) the hash of a symlink.
    Caller must use hexdigest() or digest() as needed on
    the result.
    """
    hasher = sha1()
    # XXX os.readlink produces a string, even though the
    # underlying data read from the inode (which git will hash)
    # are raw bytes.  It's not clear what happens if the raw
    # data bytes are not decode-able into Unicode; it might
    # be nice to have a raw_readlink.
    data = os.readlink(path).encode('utf8')
    hasher.update(b'blob %u\0' % len(data))
    hasher.update(data)
    return hasher


def tree_hash(path, args):
    """
    Return the hash of a tree.  We need to know all
    files and sub-trees.  Since order matters, we must
    walk the sub-trees and files in their natural (byte) order,
    so we cannot use os.walk.

    This is also slightly defective in that it does not know
    about .gitignore files (we can't just read them since git
    retains files that are in the index, even if they would be
    ignored by a .gitignore directive).

    We also do not (cannot) deal with submodules here.
    """
    # Annoyingly, the tree object encodes its size, which requires
    # two passes, one to find the size and one to compute the hash.
    contents = os.listdir(path)
    tsize = 0
    to_skip = ('.', '..') if args.keep_dot_git else ('.', '..', '.git')
    pass1 = []
    for entry in contents:
        if entry not in to_skip:
            fullpath = os.path.join(path, entry)
            mode, gitclass, esize = classify(fullpath)
            # git stores as mode<sp><entry-name>\0<digest-bytes>
            encoded_form = entry.encode('utf8')
            tsize += len(mode) + 1 + len(encoded_form) + 1 + 20
            pass1.append((fullpath, mode, gitclass, esize, encoded_form))

    # Git's cache sorts foo/bar before fooXbar but after foo-bar,
    # because it actually stores foo/bar as the literal string
    # "foo/bar" in the index, rather than using recursion.  That is,
    # a directory name should sort as if it ends with '/' rather than
    # with '\0'.  Sort pass1 contents with funky sorting.
    #
    # (i[4] is the utf-8 encoded form of the name, i[1] is the
    # mode which is '40000' for directories.)
    pass1.sort(key = lambda i: i[4] + '/' if i[1] == '40000' else i[4])

    args.depth += 1
    hasher = sha1()
    hasher.update(b'tree %u\0' % tsize)
    for (fullpath, mode, gitclass, esize, encoded_form) in pass1:
        sub_hash = generic_hash(fullpath, mode, esize, args)
        if args.debug: # and args.depth == 0:
            print('%s%s %s %s\t%s' % ('    ' * args.depth,
                strmode(mode), gitclass, sub_hash.hexdigest(),
                encoded_form.decode('utf8')))

        # Annoyingly, git stores the tree hash as 20 bytes, rather
        # than 40 ASCII characters.  This is why we return the
        # hash instance (so we can use .digest() directly).
        # The format here is <mode><sp><path>\0<raw-hash>.
        hasher.update(b'%s %s\0' % (mode, encoded_form))
        hasher.update(sub_hash.digest())
    args.depth -= 1
    return hasher

def generic_hash(path, mode, size, args):
    """
    Hash an object based on its mode.
    """
    if mode == '120000':
        hasher = symlink_hash(path)
    elif mode == '40000':
        hasher = tree_hash(path, args)
    else:
        # 100755 if any execute permission bit set, else 100644
        with open(path, 'rb') as stream:
            hasher = blob_hash(stream, size)
    return hasher

def main():
    """
    Parse arguments and invoke hashers.
    """
    parser = argparse.ArgumentParser('compute git hashes')
    parser.add_argument('-d', '--debug', action='store_true')
    parser.add_argument('-k', '--keep-dot-git', action='store_true')
    parser.add_argument('path', nargs='+')
    args = parser.parse_args()
    args.depth = -1 # for debug print
    status = 0
    for path in args.path:
        try:
            try:
                mode, gitclass, size = classify(path)
            except ValueError:
                print('%s: unhashable!' % path)
                status = 1
                continue
            hasher = generic_hash(path, mode, size, args)
            result = hasher.hexdigest()
            if args.debug:
                print('%s %s %s\t%s' % (strmode(mode), gitclass, result,
                    path))
            else:
                print('%s: %s hash = %s' % (path, gitclass, result))
        except OSError as err:
            print(str(err))
            status = 1
    sys.exit(status)

if __name__ == '__main__':
    try:
        sys.exit(main())
    except KeyboardInterrupt:
        sys.exit('\nInterrupted')

【讨论】：

确实！虽然我必须承认我想知道 OP 在做什么——本质上，在我们不仅要重建单个文件的哈希，还要重建整个树的“校验和”的情况下，我们真的在做 “混帐”;因此，所有这些文件的外部工作树上的git add 会更容易。
感谢您的详细解答。我要做的是一个来自 GitHub 的自动更新程序，由于 GitHub API 只允许来自单个 IP 的这么多连接，我想减少 API 调用。我可以在一个 API 调用中获取存储库根目录的内容，但每个目录都需要另一个 API 调用才能解包。我有文件夹的 SHA 哈希，如果我可以将它们与磁盘进行比较，我可以检查文件夹是否已更改。如果没有，我可以跳过对其及其所有子目录的更多 API 调用。
啊哈。我从来没有研究过 github API，但你基本上是在复制 git fetch 在决定请求哪些对象时已经做了什么。只保留一个存储库并获取它会容易得多......

【解决方案2】：

git 中目录中所有文件的状态由“树”对象表示，在this SO answer 和this section of the Git book 中进行了描述。为了计算树对象的哈希值，您必须自己生成树。

对于目录中的每个项目，您需要四件事：

它的名字。对象存储在按名称排序的树中（如果每个人都没有就规范顺序达成一致，那么每个人可能对同一棵树有不同的表示形式和不同的哈希值）。
它的模式。模式基于 Unix 模式（struct stat 的 st_mode 字段），但 restricted to a few values: 主要用途是 040000 用于目录，100644 用于非可执行文件，100755 用于可执行文件，以及 120000 用于符号链接。李>
代表该项目的对象的哈希值。对于文件，这是它的blob hash。对于符号链接，它是包含符号链接目标的 blob 的哈希值。对于子目录，它是那个目录的树对象的哈希，所以这是一个递归算法。
3.中提到的对象的类型。

如果您为目录中的所有条目收集此信息，并以正确的格式、正确的顺序写出数据，您将拥有一个有效的树对象。如果你计算这个对象的 SHA，你应该得到和 git 和 GitHub 一样的结果。

【讨论】：

【解决方案3】：

如何在不使用 git 的情况下对程序中的目录进行哈希处理，使其与 git 给出的 sha1 匹配？

一点也不——git 不会“散列”目录。它对包含的文件进行哈希处理，并具有树的表示；见the git object storage docs。

【讨论】：

关键在于表示目录和树之间存在区别。 OP 的问题明确期望 git 将 SHA 哈希分配给目录。
为什么投反对票？这是正确答案；要计算 git 将分配给特定树的 SHA-1，您必须为每个 blob 和子树计算 SHA-1，这基本上意味着重复 git 除了编写提交之外的所有操作。
@torek 您的评论远不止是一个答案。实际上解释这将是一个好的答案。
@hobbs：好的，我会将它与其他链接的答案结合起来，只是为了它：-)
所以，如果我想将目录的 SHA 与 GitHub 的 API 提供的 SHA 进行比较，以避免必须对目录中的每个文件进行哈希处理，那么我在这方面不走运吗？