使用 python 生成器处理大型文本文件答案

【问题标题】：using a python generator to process large text files使用 python 生成器处理大型文本文件
【发布时间】：2018-09-20 00:05:13
【问题描述】：

我是使用生成器的新手，并且已经阅读了一些内容，但需要一些帮助来处理大块文本文件。我知道已经涵盖了这个主题，但是示例代码的解释非常有限，如果人们不理解发生了什么，就很难修改代码。

我的问题很简单，我有一系列包含人类基因组测序数据的大型文本文件，格式如下：

chr22   1   0
chr22   2   0
chr22   3   1
chr22   4   1
chr22   5   1
chr22   6   2

文件长度在 1Gb 到 ~20Gb 之间，太大而无法读入 RAM。因此，我想一次读取 10000 行的块/箱中的行，以便我可以对这些箱大小的最后一列进行计算。

基于此链接here 我写了以下内容：

def read_large_file(file_object):
    """A generator function to read a large file lazily."""

    bin_size=5000
    start=0
    end=start+bin_size

    # Read a block from the file: data
    while True:
        data = file_object.readlines(end) 
        if not data:
            break
        start=start+bin_size
        end=end+bin_size
        yield data


def process_file(path):

    try:
        # Open a connection to the file
        with open(path) as file_handler:
            # Create a generator object for the file: gen_file
            for block in read_large_file(file_handler):
                print(block)
                # process block

    except (IOError, OSError):
        print("Error opening / processing file")    
    return    

if __name__ == '__main__':
            path='C:/path_to/input.txt'
    process_file(path)

在“process_block”中，我希望返回的“block”对象是一个 10000 个元素长的列表，但不是吗？第一个列表是 843 个元素。第二个是2394个元素？

我想取回一个块中的“N”行，但对这里发生的事情感到非常困惑？

这个解决方案here 似乎可以提供帮助，但我还是不明白如何修改它以一次读取 N 行？

这个here 看起来也是一个非常棒的解决方案，但同样，没有足够的背景解释让我理解到足以修改代码。

任何帮助将不胜感激？

【问题讨论】：

使用熊猫pandas.pydata.org/pandas-docs/stable/generated/…
来自docs for readlines()：“如果存在可选的 sizehint 参数，而不是读取到 EOF，整行总计大约 sizehint 字节” 所以readlines(10000) 永远不会会给你10,000行。

标签： python generator large-files chunks

【解决方案1】：

不要在文件中使用偏移量，而是尝试从循环中构建并生成 10000 个元素的列表：

def read_large_file(file_handler, block_size=10000):
    block = []
    for line in file_handler:
        block.append(line)
        if len(block) == block_size:
            yield block
            block = []

    # don't forget to yield the last block
    if block:
        yield block

with open(path) as file_handler:
    for block in read_large_file(file_handler):
        print(block)

【讨论】：

这很好用！并感谢您的解释。我已经接受它作为答案，因为它是一个完整的工作解决方案。尽管我决定采用 Dimitrii K 建议的 pandas 解决方案，因为它非常简洁易懂。将在下面发布我的代码。

【解决方案2】：

如果它可以帮助其他有类似问题的人，这里是基于here的解决方案

import pandas as pd

def process_file(path,binSize):

    for chunk in pd.read_csv(path, sep='\t', chunksize=binSize):
        print(chunk)
        print(chunk.ix[:,2]) # get 3rd col
        # Do something with chunk....  

if __name__ == '__main__':
    path='path_to/infile.txt'
    binSize=5000
    process_file(path,binSize)

【讨论】：

【解决方案3】：

不是一个正确的答案，但找出这种行为的原因大约需要 27 秒：

(blook)bruno@bigb:~/Work/blookup/src/project$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
pythonrc start
pythonrc done
>>> help(file.readlines)

Help on method_descriptor:

readlines(...)
    readlines([size]) -> list of strings, each a line from the file.

    Call readline() repeatedly and return a list of the lines so read.
    The optional size argument, if given, is an approximate bound on the
    total number of bytes in the lines returned.

我知道这里不是每个人都是专业程序员 - 当然文档并不总是足以解决问题（我很乐意回答这类问题），但确实是数字在文档开头用普通字母写答案的问题变得有点烦人。

【讨论】：

相对而言，我认为输出可能与返回字节而不是行的函数有关，但仅仅知道这并不能真正帮助函数根据行返回对象这是这篇文章的主要目标。