Python3：根据文件内容递归比较两个目录答案

【问题标题】：Python3: Recursively compare two directories based on file contentsPython3：根据文件内容递归比较两个目录
【发布时间】：2019-02-11 10:34:14
【问题描述】：

我有两个目录，其中包含一堆文件和子文件夹。我想检查两个目录中的文件内容是否相同（忽略文件名）。子文件夹结构也应该相同。

我查看了filecmp.dircmp 但这没有帮助，因为它没有考虑文件内容； filecmp.dircmp() 没有 shallow=False 选项，请参阅 here。

this SO 答案中的解决方法也不起作用，因为它考虑了文件名。

进行比较的最佳方法是什么？

【问题讨论】：

所以您想将一个目录中的每个文件与另一个目录中的每个文件进行比较，以查找是否存在可能的匹配项？这似乎是一项非常漫长的任务，也许是xy。你能澄清你为什么要这样做吗？您基本上想要解决方法，但允许任意两对文件之间进行匹配。
是的，解决方法看起来不错，除了它考虑了文件名（以及我想的其他 os.stat 数据）这一事实。
你能解决我的其他问题吗？如果您有两个目录，其中有 100 个不同名称的文件，那么在最坏的情况下，您将比较文件 10000 次。这似乎太过分了，尤其是对于大文件。
我想这样做是因为我需要知道两个文件夹是否具有相同的结构并包含相同的文件。如果是，我有一个“重复”，可以删除两者之一。
如果我尝试尽快停止比较，最坏的情况不太可能发生，例如首先比较总大小，然后比较文件数量等。

标签： python-3.x file stat

【解决方案1】：

解决了这个问题。经过少量测试后，这似乎可行，尽管还需要更多。同样，这可能会非常长，具体取决于文件的数量和大小：

import filecmp
import os
from collections import defaultdict
from sys import argv

def compareDirs(d1,d2):
    files1 = defaultdict(set)
    files2 = defaultdict(set)
    subd1  = set()
    subd2  = set()
    for entry in os.scandir(d1):
        if entry.is_dir(): subd1.add(entry)
        else: files1[os.path.getsize(entry)].add(entry)
    #Collecting first to compare length since we are guessing no
    #match is more likely. Can compare files directly if this is
    # not true.
    for entry in os.scandir(d2):
        if entry.is_dir(): subd2.add(entry)
        else: files2[os.path.getsize(entry)].add(entry)

    #Structure not the same. Checking prior to content.
    if len(subd1) != len(subd2) or len(files1) != len(files2): return False

    for size in files2:
        for entry in files2[size]:
            for fname in files1[size]: #If size does not exist will go to else
                if filecmp.cmp(fname,entry,shallow=False): break
            else: return False
            files1[size].remove(fname)
            if not files1[size]: del files1[size]

    #Missed a file
    if files1: return False

    #This is enough since we checked lengths - if all sd2 are matched, sd1
    #will be accounted for.
    for sd1 in subd1:
        for sd2 in subd2:
            if compareDirs(sd1,sd2): break
        else: return False #Did not find a sub-directory
        subd2.remove(sd2)

    return True

print(compareDirs(argv[1],argv[2]))

递归输入两个目录。比较第一级的文件 - 如果不匹配则失败。然后尝试将第一个目录中的任何子目录递归匹配到下一个目录中的任何子目录，直到全部匹配为止。

这是最幼稚的解决方案。在一般情况下，可能遍历树并且仅匹配大小和结构将是有益的。在这种情况下，函数看起来很相似，只是我们比较 getsize 而不是使用 filecmp，并保存匹配的树结构，因此第二次运行会更快。

当然，如果有几个子目录具有完全相同的结构和大小，我们仍然需要比较所有匹配的可能性。

【讨论】：