在python中，在特定字符之后将大文件分成更小的块？答案

【问题标题】：Dividing the large file into smaller chunks after specific character, in python?在python中，在特定字符之后将大文件分成更小的块？
【发布时间】：2020-07-03 04:45:10
【问题描述】：

我正在尝试将一个大文件 (1.1GB) 读入 python。文件中将有“这里”一词。我不知道我会在哪一行找到这个词。我将文件读成块。我的第一个块是直到单词“HERE”的数据。我的代码在这里运行良好。（即在“HERE”之前存储数据并对其进行处理）但是我无法继续读取“HERE”之后的数据，因为“HERE”之后的数据太大。有什么办法可以让我逐行读取“这里”之后的数据？我参考了参考：Reading a file until a specific character in python 我的代码是：

def each_chunk(stream, separator):
  buffer = ''
  while True:  # until EOF
    chunk = stream.read()  # I propose 4096 or so
    if not chunk:  # EOF?
      yield buffer
      break
    buffer += chunk
    while True:  # until no separator is found
      try:
        part, buffer = buffer.split(separator, 1)
      except ValueError:
        break
      else:
        yield part

def first_chunk(chunk):
    .... #my function

def chunk_after(data_line_by_line):
    .... #my function

global This_1st_chunk
This_1st_chunk=True

myFile= open(r"C:\Users\Mavis\myFile.txt","r")
for chunk in each_chunk(myFile, separator='HERE'):
    if This_1st_chunk:
        first_chunk(chunk)
        This_1st_chunk=False
    elif not This_1st_chunk:
        print('*******after 1st chunk*********')
        #**I WANT TO READ THE DATA LINE BY LINE HERE.**
        chunk_after(data_line_by_line)

【问题讨论】：

标签： python file

【解决方案1】：

据我了解这个问题它认为你想在 txt 文件中的 HERE 标记上的 python 中将文本文件分成更小的块，如果我说的是真的试试这个

with open(myFile, "r") as file:
    Data = file.read()
    # will create a list where each item is the text between 
    # HERE's not including them
    DataList = Data.split("HERE")
    for i in DataList:
        with open("Random.txt", "w") as f:
            f.write(i)

这会将不同的“块”分成文件，您可以这样做，但对于新行：

DataList = Data.split("\n") # a list containing every line
for i in DataList:
    print (i) # will print every line

你也可以使用

Data.readline() # returns 1 line

你可以用这个方法重新加入他们：

"string between the items".join(DataList)

希望对您有所帮助！

【讨论】：

【解决方案2】：

问题在于 .read() 方法默认读取整个文件。如果文件足够大，您的内存将爆炸。正如官方文档中所写：

要读取文件的内容，调用 f.read(size)，它会读取一些数据量并将其作为字符串（在文本模式下）或字节返回对象（二进制模式）。 size 是一个可选的数字参数。什么时候 size 被省略或为负，文件的全部内容将是阅读并返回；如果文件是两倍大，那是你的问题你机器的内存。否则，最多 size 个字符（在文本中模式）或大小字节（二进制模式）被读取并返回。如果结束已到达文件， f.read() 将返回一个空字符串 ('')。

您可以在此处找到更多信息：https://docs.python.org/3/tutorial/inputoutput.html。

相反，正如文档所建议的那样，您可以为 read() 方法提供一个大小参数，也可以使用 readline() 来获取一行。

代码文档示例：

$ f.read()
'This is the entire file.\n'
$ .read()
'This is the entire file.\n'
$ f.readline()
'This is the first line of the file.\n'
$ f.readline()
'Second line of the file\n'

【讨论】：

【解决方案3】：

逐行读取文件到第一个块（由"HERE" 分隔）可能更简单，然后收集所有行，处理该块，然后继续逐行读取文件。

类似这样的：

with open(r"C:\Users\Mavis\myFile.txt","r") as myFile:
    chunk = []
    first_chunk_found = False
    while not first_chunk_found:
        line = myFile.readline()
        if "HERE" in line:
            first_chunk_found = True
            line, remainder = line.split("HERE")
            line += "HERE"  # current line up to "HERE"
        chunk.append(line)
    chunk = ''.join(chunk)
    # do whatever you want with the first chunk here.
    # also, the variable remainder has the rest of the line
    # that contained the word "HERE", in case you want it
    for line in myFile:
        # now we process the rest of the file line by line

【讨论】：