【发布时间】:2019-06-29 04:09:42
【问题描述】:
我的以下代码运行速度非常慢。这是一个拆分大文件(80 gig)并将其放入树形文件夹结构以便快速查找的程序。我在代码中做了几个cmets来帮助大家理解。
# Libraries
import os
# Variables
file="80_gig_file.txt"
outputdirectory="sorted"
depth=4 # This is the tree depth
# Preperations
os.makedirs(outputdirectory)
# Process each line in the file
def pipeline(line):
# Strip symbols from line
line_stripped=''.join(e for e in line if e.isalnum())
# Reverse the line
line_stripped_reversed=line_stripped[::-1]
file=outputdirectory
# Create path location in folderbased tree
for i in range(min((depth),len(line_stripped))):
file=os.path.join(file,line_stripped_reversed[i])
# Create folders if they don't exist
os.makedirs(os.path.dirname(file), exist_ok=True)
# Name the file, with "-file"
file=file+"-file"
# This is the operation that slows everything down.
# It opens, writes and closes a lot of small files.
# I cannot keep them open because currently half a million possibilities (and thus files) are worst case open (n=26^4).
f = open(file, "a")
f.write(line)
f.close()
# Read file line by line and by not loading it entirely in memory
# Here it is possible to work with a queue I think, but how to do it properly without loading too much in memory?
with open(file) as infile:
for line in infile:
pipeline(line)
有没有办法让多线程工作?因为我自己尝试了一些我在网上找到的示例,它会将所有内容都放入内存中,导致我的计算机多次冻结。
【问题讨论】:
-
因为瓶颈是 HDD 访问,所以不要指望并行化会大大加快速度(如果在您的系统中以某种方式实现并行文件访问,您可能会有所收获,但因为占用的不是 CPU,添加更多内核无济于事)
-
只有一个核心使用率 100%,根据系统监控,我的磁盘使用率低于 4%。我有一个 NVMe SSD,所以我真的认为多核可能还有改进的空间。
-
那听起来很有希望。大文件是否需要保持这种状态,或者您可以将其分成几块?如果分成块,并行化会容易得多
-
大文件可能会被预处理并分割成块。我不熟悉使用块,所以如果你能指导我看几个例子,我可以研究如何解决这个问题。我在看blopig.com/blog/2016/08/processing-large-files-using-python,但不知何故最后一个代码块给了我
ValueError: I/O operation on closed file.
标签: python python-3.x multithreading queue python-multithreading