如果你想为每个块编写新的 chunk1.txt ... chunkN.txt,你可以这样写:
def chunk_file(name, lines_per_chunk, chunks_per_file):
def write_chunk(chunk_no, chunk):
with open("chunk{}.txt".format(chunk_no), "w") as outfile:
outfile.write("".join(i for i in chunk))
count, chunk_no, chunk_count, chunk = 1, 1, 0, []
with open(name, "r") as f:
for row in f:
if count > lines_per_chunk and row == "\n":
chunk_count += 1
count = 1
chunk.append("\n")
if chunk_count == chunks_per_file:
write_chunk(chunk_no, chunk)
chunk = []
chunk_count = 0
chunk_no += 1
else:
count += 1
chunk.append(row)
if chunk:
write_chunk(chunk_no, chunk)
chunk_file("test.txt", 3, 1)
您必须指定属于一个块的行,然后预期换行。
假设你想分块这个文件:
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
第一个块的行数与第二个块有很大的不同。 (7 行对 3 行)
此示例的输出将是 chunk1.txt:
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
还有chunk2.txt:
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
这种方法假定lines_per_chunk 是最小块大小,因此即使块具有不同的行数,它也可以工作。当达到最小块大小时,我们只寻找一个空白行来结束块。
在上面的例子中,第 2 行有一个空行是没有问题的,因为还没有达到最小块大小。如果第 4 行出现空行,之后块数据继续,就会出现问题,因为指定的标准(行号和空行)不能单独识别块。