Python转到文本文件行而不读取前一行答案

【问题标题】：Python goto text file line without reading previous linesPython转到文本文件行而不读取前一行
【发布时间】：2015-06-30 05:01:15
【问题描述】：

我正在处理一个非常大的文本文件 (tsv)，大约有 2 亿个条目。其中一列是日期，记录按日期排序。现在我想从给定日期开始阅读记录。目前我只是从头开始阅读，这非常慢，因为我需要阅读近 100-1.5 亿条记录才能达到该记录。我在想如果我可以使用二进制搜索来加速它，我最多可以取消最多 28 个额外的记录读取（log（2 亿））。 python是否允许读取第n行而不缓存或读取它之前的行？

【问题讨论】：

除非你的行都有固定的长度，否则 Python 不会简单地知道一行是什么。它必须阅读所有内容才能找到表示行尾的\n 字符。除非您可以以某种方式计算行尾的字节偏移量，因为您的数据结构允许这样做，否则没有神奇的解决方法。
How to jump to a particular line in a huge text file?的可能重复
@deceze 是的，你是对的，python 无法知道'\n' 的存在。不幸的是，我当前的文件没有固定的行字节大小。为了将来的目的，我会记住这一点。知道行的字节大小后如何跳过行？
如果你必须这样做，可能值得将 tsv 转换为数据库（如 sqlite）并在感兴趣的列上放置索引。
@Naman 很难说。显然会有导入数据库的开销（每个文件一次）。一旦它进入数据库，我猜查询和提取至少与读取 tsv 一样快，但我不确定。您应该使用虚拟数据制作一个快速原型数据库以找出答案。

标签： python

【解决方案1】：

如果文件不是固定长度的，那你就不走运了。某些功能必须读取文件。如果文件是固定长度的，可以打开文件，使用函数file.seek(line*linesize)。然后从那里读取文件。

【讨论】：

【解决方案2】：

如果要读取的文件很大，并且您不想一次读取内存中的整个文件：

fp = open("file")
for i, line in enumerate(fp):
    if i == 25:
        # 26th line
    elif i == 29:
        # 30th line
    elif i > 29:
        break
fp.close()

注意i == n-1 在第 n 行。

【讨论】：

【解决方案3】：

可以使用fileObject.seek(offset[, whence])的方法

#offset -- This is the position of the read/write pointer within the file.

#whence -- This is optional and defaults to 0 which means absolute file positioning, other values are 1 which means seek relative to the current position and 2 means seek relative to the file's end.


file = open("test.txt", "r")
line_size = 8 # Because there are 6 numbers and the newline
line_number = 5
file.seek(line_number * line_size, 0)
for i in range(5):
    print(file.readline())
file.close()

对于这段代码，我使用下一个文件：

【讨论】：

【解决方案4】：

python 无法跳过文件中的“行”。我知道的最好方法是使用生成器根据特定条件生成行，即date > 'YYYY-MM-DD'。至少这样可以减少内存使用和 i/o 花费的时间。

示例：

# using python 3.4 syntax (parameter type annotation)

from datetime import datetime

def yield_right_dates(filepath: str, mydate: datetime):

    with open(filepath, 'r') as myfile:

        for line in myfile:
        # assume:
        #    the file is tab separated (because .tsv is the extension) 
        #    the date column has column-index == 0
        #    the date format is '%Y-%m-%d'
            line_splt = line.split('\t')
            if datetime.strptime(line_splt[0], '%Y-%m-%d') > mydate:
                yield line_splt

my_file_gen = yield_right_dates(filepath='/path/to/my/file', mydate=datetime(2015,01,01))
# then you can do whatever processing you need on the stream, or put it in one giant list.
desired_lines = [line for line in my_file_gen]

但这仍然限制您使用一个处理器:(

假设您使用的是类 unix 系统，并且 bash 是您的 shell，我将使用 shell 实用程序 split 拆分文件，然后使用多处理和上面定义的生成器。 p>

我现在没有要测试的大文件，但我稍后会更新这个答案，并用一个关于整个迭代的基准，而不是分裂然后用生成器和多处理模块迭代它。

有了对文件的更多了解（例如，如果所有所需的日期都聚集在开头 | 中心 | 结尾），您也许可以进一步优化读取。

【讨论】：

你能解释一下使用 yield 有什么帮助吗？