【发布时间】:2023-06-03 22:30:01
【问题描述】:
我想使用一个简单的脚本来获取给定文件夹下的所有图像并比较它们/查找重复项。
当解决方案的第一步已经存在时,为什么还要发明*: Finding duplicate files and removing them
但它已经在第一步失败了,因为它访问了给定 USB 闪存驱动器上的所有文件夹。我剥离了所有散列的东西,我试图只获取文件列表,但即便如此,它也会永远持续并访问 USB 驱动器上的每个文件。
from __future__ import print_function # py2 compatibility
from collections import defaultdict
import hashlib
import os
import sys
folder_to_check = "D:\FileCompareTest"
def check_for_duplicates(paths, hash=hashlib.sha1):
hashes_by_size = defaultdict(list) # dict of size_in_bytes: [full_path_to_file1, full_path_to_file2, ]
hashes_on_1k = defaultdict(list) # dict of (hash1k, size_in_bytes): [full_path_to_file1, full_path_to_file2, ]
hashes_full = {} # dict of full_file_hash: full_path_to_file_string
for path in paths:
for dirpath, dirnames, filenames in os.walk(path):
# get all files that have the same size - they are the collision candidates
for filename in filenames:
full_path = os.path.join(dirpath, filename)
try:
# if the target is a symlink (soft one), this will
# dereference it - change the value to the actual target file
full_path = os.path.realpath(full_path)
file_size = os.path.getsize(full_path)
hashes_by_size[file_size].append(full_path)
except (OSError,):
# not accessible (permissions, etc) - pass on
continue
check_for_duplicates(folder_to_check)
我没有在几毫秒内获得 hashes_by_size 列表,而是陷入了一个永恒的循环,或者程序在数小时后退出,所有文件都在 USB 上。
关于 os.walk() 有什么我不明白的地方?
【问题讨论】:
-
所以我不能限制 os.walk 只能在特定文件夹下行走?我觉得很难相信
-
您可以在文档页面中找到有关
os.walk工作原理的详细信息。 https://docs.python.org/3/library/os.html#os.walk.
标签: python python-3.x os.walk