【问题标题】:Finding duplicate files with python使用python查找重复文件
【发布时间】:2012-09-24 12:56:37
【问题描述】:

我正在尝试编写一个 Python 脚本,该脚本将爬过一个目录并查找所有重复的文件并报告重复项。解决这个问题的最佳方法是什么?

import os, sys

def crawlDirectories(directoryToCrawl):
    crawledDirectory = [os.path.join(path, subname) for path, dirnames, filenames in os.walk(directoryToCrawl) for subname in dirnames + filenames]
    return crawledDirectory

#print 'Files crawled',crawlDirectories(sys.argv[1])

directoriesWithSize = {}
def getByteSize(crawledDirectory):
    for eachFile in crawledDirectory:
        size = os.path.getsize(eachFile)
        directoriesWithSize[eachFile] = size
    return directoriesWithSize

getByteSize(crawlDirectories(sys.argv[1]))

#print directoriesWithSize.values()

duplicateItems = {}

def compareSizes(dictionaryDirectoryWithSizes):
    for key,value in dictionaryDirectoryWithSizes.items():
        if directoriesWithSize.values().count(value) > 1:
            duplicateItems[key] = value

compareSizes(directoriesWithSize)

#print directoriesWithSize.values().count(27085)

compareSizes(directoriesWithSize)

print duplicateItems

为什么会抛出这个错误?

Traceback (most recent call last):
  File "main.py", line 16, in <module>
    getByteSize(crawlDirectories(sys.argv[1]))
  File "main.py", line 12, in getByteSize
    size = os.path.getsize(eachFile)
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/genericpath.py", line 49, in getsize
OSError: [Errno 2] No such file or directory:        '../Library/Containers/com.apple.ImageKit.RecentPictureService/Data/Documents/iChats'

【问题讨论】:

  • 以 >>python filename.py folderNameInHome 运行时没有错误
  • 似乎与符号链接有关。有什么办法不爬那些?

标签: python file duplicates directory web-crawler


【解决方案1】:

在我看来你的crawledDirectory函数太复杂了:

def crawlDirectories(directoryToCrawl):
    output = []
    for path, dirnames, filenames in os.walk(directoryToCrawl):
        for fname in filenames:
            output.append(os.path.join(path,fname))
    return output

【讨论】:

    【解决方案2】:

    我建议尝试:

    def crawlDirectories(directoryToCrawl):
        crawledDirectory = [os.path.realpath(os.path.join(p, f)) 
                                             for (p, d, f) in os.walk(directoryToCrawl)]
    return crawledDirectory
    

    也就是说,在您的爬网中使用规范路径而不是相对路径。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-05-20
      • 2020-09-14
      • 1970-01-01
      • 1970-01-01
      • 2010-12-06
      • 2017-10-23
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多