【发布时间】:2020-04-01 19:31:52
【问题描述】:
我有一个目录结构,其中包含许多带有非 ascii 字符的目录,主要是梵文。我正在为脚本中的这些目录/文件编制索引,但不知道如何最好地处理这些实例。这是我的流程:
- 以递归方式散列所有文件,将每个文件的路径、文件名和散列写入 .tsv 文件。
- 浏览此文件,根据是否存在重复的哈希对每一行进行排序。生成具有以下形式的字典:
{'path': columns[0], 'filename': columns[1], 'status': True},其中 status 确定稍后是否对文件执行操作。 - 浏览此字典,将重复项从其原始位置移出并移至偏移根路径(例如,./duplicates 而不是 ./)。
- 为每次移动写入一个文件,运行一个命令,如果需要,将反转移动(只是
mv a b);这并不重要,但我想我会把它包括在内。
以下是一些示例数据以及我目前所写的内容:
生成的 tsv 示例(路径/名称/哈希):
./Personal Research/Ramnad 9"14"10 DSC_0004.JPG 850cd9dcb0075febd4c0dcd549dd7860
./Personal Research/Ramnad 9"14"10 DSC_0010.JPG 9db2219fc4c9423016fb9e295452f1ad
./Personal Research/Ramnad 9"14"10 DSC_0006.JPG ef7d13b88bbaabc029390bcef1319bb1
" 实际上是 unicode:
块:私人使用区
Unicode: U+F019
UTF-8: 0xEF 0x80 0x99
JavaScript: 0xF019
代码: 将以上内容写入文件(fulltsv):
for root, dirs, files in os.walk(SRC_DIR, topdown=True):
files[:] = [f for f in files if any(ext in f for ext in EXT_LIST) if not f.startswith('.')]
for file in files:
with open(os.path.join(root,file),'r') as f:
with open(SAVE_DIR+re.sub(r'\W+', '', os.path.basename(root).lower())+'.tsv', 'a') as fout:
writer = csv.writer(fout, delimiter='\t', quotechar='\"', quoting=csv.QUOTE_MINIMAL)
checksums = []
with open(os.path.join(root, file), 'rb') as _file:
checksums.append([root, file, hashlib.md5(_file.read()).hexdigest()])
writer.writerows(checksums)
从该文件中读取:
# generate list of all tsv
for (dir, subs, files) in os.walk(ROOT):
# remove the new-root from the search
subs = [s for s in subs if NROOT not in s]
for f in files:
fpath = os.path.join(dir,f)
if ".tsv" in fpath:
TSVLIST.append(fpath)
# open/append all TSV content to a single new TSV
with open(FULL,'w') as wfd:
for f in TSVLIST:
with open(f,'r') as fd:
wfd.write(fd.read())
lines = sum(1 for line in f)
# add all entries to a dictionary keyed to their hash
entrydict = {}
ec = 0
with open(FULL, 'r') as fulltsv:
for line in fulltsv:
columns = line.strip().split('\t')
if not columns[2].startswith('.'):
if columns[2] not in entrydict.keys():
entrydict[str(columns[2])] = []
entrydict[str(columns[2])].append({'path': columns[0], 'filename': columns[1], 'status': True})
if len(entrydict[str(columns[2])]) > 1:
ec += 1
ed = {k:v for k,v in entrydict.items() if len(v)>=2}
移动重复:
for e in f:
if len(f)-mvcnt > 1:
if e['status'] is True:
p = e['path'] # path
n = e['filename'] # name
n0,n0ext = os.path.splitext(n)
n1 = n
# directory structure for new file
FROOT = p.replace(p.split('/')[0],NROOT,1)
n1 = n
rebk = 'mv {0}/{1} {2}/{3}'.format(FROOT,n,p,n)
shutil.move('{0}/{1}'.format(p,n),'{0}/{1}'.format(FROOT,n))
dupelist.write('{0} #{1}\n'.format(rebk,str(h)))
mvcnt += 1
我遇到的错误:
Traceback (most recent call last):
File "/usr/lib/python3.6/shutil.py", line 550, in move
os.rename(src, real_dst)
FileNotFoundError: [Errno 2] No such file or directory: '"./Personal Research/Ramnad 9""14""10"/DSC_0003.NEF' -> './duplicateRoot/Personal Research/Ramnad 9""14""10"/DSC_0003.NEF'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "dCompare.py", line 164, in <module>
shutil.move('{0}/{1}'.format(p,n),'{0}/{1}'.format(FROOT,n))
File "/usr/lib/python3.6/shutil.py", line 564, in move
copy_function(src, real_dst)
File "/usr/lib/python3.6/shutil.py", line 263, in copy2
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/usr/lib/python3.6/shutil.py", line 120, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '"./Personal Research/Ramnad 9""14""10"/DSC_0003.NEF'
显然这与我处理 unicode 字符的方式有关,但我以前从未使用过这个,并且不确定在什么时候/我应该如何处理文件名。在适用于 linux、python 3 的 windows 子系统下使用 ubuntu 10。
【问题讨论】:
-
它与问题没有直接关系,但可能值得使用pathlib。
-
我在您提供的源代码清单中没有看到
with open(src, 'rb') as fsrc:。你如何构建字符串src? -
@HeatfanJohn 来自 shutil.py
-
@AMC 我会的,但我有兴趣了解我遇到的问题的原因,而不是仅仅扔模块。