如何从文本文件中仅读取特定行？ [复制]答案

【问题标题】：How to read only specific rows from a text file? [duplicate]如何从文本文件中仅读取特定行？ [复制]
【发布时间】：2015-08-15 16:53:28
【问题描述】：

我正在尝试处理存储在一个类似于 test.dat 的文本文件中的数据：

-1411.85  2.6888   -2.09945   -0.495947   0.835799   0.215353   0.695579   
-1411.72  2.82683   -0.135555   0.928033   -0.196493   -0.183131   -0.865999   
-1412.53  0.379297   -1.00048   -0.654541   -0.0906588   0.401206   0.44239   
-1409.59  -0.0794765   -2.68794   -0.84847   0.931357   -0.31156   0.552622   
-1401.63  -0.0235102   -1.05206   0.065747   -0.106863   -0.177157   -0.549252   
....
....

该文件有几个 GB，我非常想以小行的形式读取它。我想使用numpy'sloadtxt 函数，因为这会将所有内容快速转换为numpy array。但是，我无法管理，因为该功能似乎只提供了一些列，如下所示：

data = np.loadtxt("test.dat", delimiter='  ', skiprows=1, usecols=range(1,7))

任何想法如何实现这一目标？如果loadtxt 无法使用Python 中的任何其他选项？

【问题讨论】：

loadtxt 的 fname 参数可以是生成器，因此要读取小块行使用文件读取生成器，例如 nosklo 在stackoverflow.com/questions/519633/… 中的回答中所示，但转换为仅读取少量行而不是字节.
另见：stackoverflow.com/a/27962976/901925 - Fastest way to read every n-th row with numpy's genfromtxt

标签： python numpy

【解决方案1】：

hpaulj 在他的评论中指出了我正确的方向。

使用以下代码非常适合我：

import numpy as np
import itertools
with open('test.dat') as f_in:
    x = np.genfromtxt(itertools.islice(f_in, 1, 12, None), dtype=float)
    print x[0,:]

非常感谢！

【讨论】：

【解决方案2】：

如果你可以使用pandas，那就更简单了：

In [2]: import pandas as pd

In [3]: df = pd.read_table('test.dat', delimiter='  ', skiprows=1, usecols=range(1,7), nrows=3, header=None)

In [4]: df.values
Out[4]:
array([[ 2.82683  , -0.135555 ,  0.928033 , -0.196493 , -0.183131 ,
        -0.865999 ],
       [ 0.379297 , -1.00048  , -0.654541 , -0.0906588,  0.401206 ,
         0.44239  ],
       [-0.0794765, -2.68794  , -0.84847  ,  0.931357 , -0.31156  ,
         0.552622 ]])

编辑

如果您想读取每个k 行，您可以指定chunksize。例如，

reader = pd.read_table('test.dat', delimiter='  ', usecols=range(1,7), header=None, chunksize=2)
for chunk in reader:
    print(chunk.values)

输出：

[[ 2.6888   -2.09945  -0.495947  0.835799  0.215353  0.695579]
 [ 2.82683  -0.135555  0.928033 -0.196493 -0.183131 -0.865999]]
[[ 0.379297  -1.00048   -0.654541  -0.0906588  0.401206   0.44239  ]
 [-0.0794765 -2.68794   -0.84847    0.931357  -0.31156    0.552622 ]]
[[-0.0235102 -1.05206    0.065747  -0.106863  -0.177157  -0.549252 ]]

您必须按照自己的意愿处理如何将它们存储在 for 循环中。请注意，在这种情况下，reader 是 TextFileReader，而不是 DataFrame，因此您可以懒惰地遍历它。

您可以阅读this了解更多详情。

【讨论】：

我看不出我会如何阅读前三个，然后是后三个，依此类推。你能解释一下吗？感谢您的努力！
您的意思是将前三个读入一个 ndarray，然后将接下来的三个读入另一个 ndarray，依此类推？
是的，这就是我需要的！
@andi 不过，您的问题中并没有很清楚地说明这一点。我也一下子没看懂。
可能带有嵌套在无限 while 循环中的 try: read_table(...) /except EOFError: break 语句

【解决方案3】：

您可能想要使用 itertools 配方。

from itertools import izip_longest
import numpy as np


def grouper(n, iterable, fillvalue=None):
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)


def lazy_reader(fp, nlines, sep, skiprows, usecols):
    with open(fp) as inp:
        for chunk in grouper(nlines, inp, ""):
            yield np.loadtxt(chunk, delimiter=sep, skiprows=skiprows, usecols=usecols)

该函数返回一个数组生成器。

lazy_data = lazy_reader(...)
next(lazy_data)  # this will give you the next chunk
# or you can iterate 
for chunk in lazy_data:
    ...

【讨论】：