【问题标题】:Pythons os.walk() visits all folders instead of only the given folderPython os.walk() 访问所有文件夹而不是仅访问给定文件夹
【发布时间】:2023-06-03 22:30:01
【问题描述】:

我想使用一个简单的脚本来获取给定文件夹下的所有图像并比较它们/查找重复项。

当解决方案的第一步已经存在时,为什么还要发明*: Finding duplicate files and removing them

但它已经在第一步失败了,因为它访问了给定 USB 闪存驱动器上的所有文件夹。我剥离了所有散列的东西,我试图只获取文件列表,但即便如此,它也会永远持续并访问 USB 驱动器上的每个文件。

from __future__ import print_function   # py2 compatibility
from collections import defaultdict
import hashlib
import os
import sys


folder_to_check = "D:\FileCompareTest"

def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes_by_size = defaultdict(list)  # dict of size_in_bytes: [full_path_to_file1, full_path_to_file2, ]
    hashes_on_1k = defaultdict(list)  # dict of (hash1k, size_in_bytes): [full_path_to_file1, full_path_to_file2, ]
    hashes_full = {}   # dict of full_file_hash: full_path_to_file_string

    for path in paths:
        for dirpath, dirnames, filenames in os.walk(path):
            # get all files that have the same size - they are the collision candidates
            for filename in filenames:
                full_path = os.path.join(dirpath, filename)
                try:
                    # if the target is a symlink (soft one), this will 
                    # dereference it - change the value to the actual target file
                    full_path = os.path.realpath(full_path)
                    file_size = os.path.getsize(full_path)
                    hashes_by_size[file_size].append(full_path)
                except (OSError,):
                    # not accessible (permissions, etc) - pass on
                    continue




check_for_duplicates(folder_to_check)

我没有在几毫秒内获得 hashes_by_size 列表,而是陷入了一个永恒的循环,或者程序在数小时后退出,所有文件都在 USB 上。

关于 os.walk() 有什么我不明白的地方?

【问题讨论】:

标签: python python-3.x os.walk


【解决方案1】:

你应该打电话

paths_to_check = []
paths_to_check.append(folder_to_check)
check_for_duplicates(paths_to_check)

按照您的调用方式,您在路径的每个字符上都获得了生成器,而不是在正确的路径上。

【讨论】:

  • 哇。万分感谢。就是这样。我不会自己想出它。我必须再等 6 分钟才能接受答案。