检查多个文件之间重复数据的最有效方法是什么？答案

【问题标题】：What's the most efficient way to check for duplicate data between multiple files?检查多个文件之间重复数据的最有效方法是什么？
【发布时间】：2018-10-24 06:47:32
【问题描述】：

假设您有一个包含数百或数千个 .csv 或 .txt 文件的文件夹，这些文件可能包含不同的信息，但您想确保 joe041.txt 实际上不包含与 joe526.txt 相同的数据事故。

我没有将所有内容加载到一个文件中（如果每个文件都有数千行，这可能会很麻烦），而是使用 Python 脚本来读取目录中的每个文件并计算校验和，然后您可以进行比较在您的数千个文件之间。

有没有更有效的方法来做到这一点？

即使使用filecmp 似乎效率较低，因为该模块只有 file vs file 和 dir vs dir 比较，但没有 file vs dir 命令——这意味着要使用它，您必须遍历 x² 次（dir 中的所有文件与dir 中的所有其他文件对比）。

import os
import hashlib

outputfile = []

for x in(os.listdir("D:/Testing/New folder")):
    with open("D:/Testing/New folder/%s" % x, "rb") as openfile:
        text=openfile.read()
        outputfile.append(x)
        outputfile.append(",")
        outputfile.append(hashlib.md5(text).hexdigest())
        outputfile.append("\n")

print(outputfile)

with open("D:/Testing/New folder/output.csv","w") as openfile:
    for x in outputfile:
        openfile.write(x)

【问题讨论】：

也许可以通过在第一遍比较文件大小，然后在第二遍中仅读取前两三行，最后继续整个文件内容以消除所有误报来优化它第三遍。
使用filecmp 将不起作用，因为它仅比较date 和size。不要使用list 来保存校验和，而是使用dict 和condition if md5 in dict。

标签： python checksum

【解决方案1】：

受@sɐunıɔןɐqɐp 评论的启发，您可以尝试一种迭代方法，首先对所有文件执行廉价操作（获取文件大小），然后对具有相同大小的文件进行更深入的比较。

此代码首先比较大小，然后比较文件的第一行，最后比较整个文件的 md5 哈希值。您可以随意调整它以适合您的用例。

我使用长变量名使其明确；不要因此而分心。

import os
import hashlib

def calc_md5(file_path):
    hash_md5 = hashlib.md5()
    with open(file_path, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b''):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

def get_duplicates_by_size(dir_path):
    files_by_size = {}

    for elem in os.listdir(dir_path):
        file_path = os.path.join(dir_path, elem)
        if os.path.isfile(file_path):
            size = os.stat(file_path).st_size

            if size not in files_by_size:
                files_by_size[size] = []
            files_by_size[size].append(file_path)

    # keep only entries with more than one file;
    # the others don't need to be kept in memory
    return {
        size: file_list
        for size, file_list in files_by_size.items()
        if len(file_list) > 1}

def get_duplicates_by_first_content(files_by_size, n_chars):
    files_by_size_and_first_content = {}

    for size, file_list in files_by_size.items():
        d = {}
        for file_path in file_list:
            with open(file_path) as f:
                first_content = f.read(n_chars)

            if first_content not in d:
                d[first_content] = []
            d[first_content].append(file_path)

        # keep only entries with more than one file;
        # the others don't need to be kept in memory
        d = {
            (size, first_content): file_list_2
            for first_content, file_list_2 in d.items()
            if len(file_list_2) > 1}
        files_by_size_and_first_content.update(d)

    return files_by_size_and_first_content

def get_duplicates_by_hash(files_by_size_and_first_content):
    files_by_size_and_first_content_and_hash = {}

    for (size, first_content), file_list in files_by_size_and_first_content.items():
        d = {}
        for file_path in file_list:
            file_hash = calc_md5(file_path)

            if file_hash not in d:
                d[file_hash] = []
            d[file_hash].append(file_path)

        # keep only entries with more than one file;
        # the others don't need to be kept in memory
        d = {
            (size, first_content, file_hash): file_list_2
            for file_hash, file_list_2 in d.items()
            if len(file_list_2) > 1}
        files_by_size_and_first_content_and_hash.update(d)

    return files_by_size_and_first_content_and_hash

if __name__ == '__main__':
    r = get_duplicates_by_size('D:/Testing/New folder')
    r = get_duplicates_by_first_content(r, 20)  # customize the number of chars to read
    r = get_duplicates_by_hash(r)

    for k, v in r.items():
        print('Key:', k)
        print('  Files:', v)

【讨论】：