如何获得一个精确的python迭代器？答案

【问题标题】：How to get an exact one of python iterator?如何获得一个精确的python迭代器？
【发布时间】：2017-03-25 16:46:21
【问题描述】：

我正在使用 pandas 读取一个超大的 csv 文件（10G），并且 read_csv(filename, chunksize=chunksize) 返回一个迭代器（假设它命名为“reader”）。现在我想得到一个精确的块，因为我只想要几行（例如，我读取的 csv 文件有 1000000000 行，我想得到数字 50000000 行和它之后的 1000 行），我该怎么办除了遍历迭代器直到它到达我想要的块？

这是我以前的代码：

def get_lines_by_chunk(file_name, line_beg, line_end, chunk_size=-1):
func_name = 'get_lines_by_chunk'
line_no = get_file_line_no(file_name)

if chunk_size < 0:
    chunk_size = get_chunk_size(line_no, line_beg, line_end)

reader = pd.read_csv(file_name, chunksize=chunk_size)
data = pd.DataFrame({})

flag = 0

for chunk in reader:
    line_before = flag * chunk_size
    flag = flag + 1
    line_after = flag * chunk_size
    if line_beg >= line_before and line_beg <= line_after:
        if line_end >= line_after:
            temp = chunk[line_beg - line_before : chunk_size]
            data = pd.concat([data, temp], ignore_index=True)
        else:
            temp = chunk[line_beg - line_before : line_end - line_before]
            data = pd.concat([data, temp], ignore_index=True)
            return data
    elif line_end <= line_after and line_end >= line_before:
        temp = chunk[0 : line_end - line_before]
        data = pd.concat([data, temp], ignore_index=True)
        return data
    elif line_beg < line_before and line_end > line_after:
        temp = chunk[0 : chunk_size]
        data = pd.concat([data, temp], ignore_index=True)

return data

【问题讨论】：

你不能只做df = pd.read_csv(file_name, skiprows=50000000, nrows=1000)吗？
哦...它似乎有效，我是熊猫新手..
标题 "How to get an exact one of python iterator?" 对我来说没有任何意义。可以改写吗？
我的意思是 pandas.read_csv 在为其分配块大小时返回一个迭代器'i'，我想要 i.next().next().next()...（例如 500 个next) 没有 500 次迭代，而是像数组一样直接获取操作...

标签： python csv pandas dataframe io

【解决方案1】：

如果您需要读取具有不同大小块的 CSV 文件，您可以使用 iterator=True:

假设我们有一个 1000 行的 DF（请参阅设置部分了解它是如何生成的）

In [103]: reader = pd.read_csv(fn, iterator=True)

In [104]: reader.get_chunk(5)
Out[104]:
   a   b
0  1   8
1  2  28
2  3  85
3  4  56
4  5  29

In [105]: reader.get_chunk(3)
Out[105]:
   a   b
5  6  55
6  7  16
7  8  96

注意：get_chunk 不能跳过数据，它会不断读取指定块大小的数据

如果您只想读取第 100 - 110 行：

In [106]: cols = pd.read_csv(fn, nrows=1).columns.tolist()

In [107]: cols
Out[107]: ['a', 'b']

In [109]: pd.read_csv(fn, header=None, skiprows=100, nrows=10, names=cols)
Out[109]:
     a   b
0  100  52
1  101  15
2  102  74
3  103  10
4  104  35
5  105  73
6  106  48
7  107  49
8  108   1
9  109  56

但如果您可以使用 HDF5 格式 - 它会更容易和更快：

让我们先将其保存为 HDF5：

In [110]: df.to_hdf('c:/temp/test.h5', 'mydf', format='t', data_columns=True, compression='blosc', complevel=9)

现在我们可以通过索引位置读取它，如下所示：

In [113]: pd.read_hdf('c:/temp/test.h5', 'mydf', start=99, stop=109)
Out[113]:
       a   b
99   100  52
100  101  15
101  102  74
102  103  10
103  104  35
104  105  73
105  106  48
106  107  49
107  108   1
108  109  56

或查询（类似 SQL）：

In [115]: pd.read_hdf('c:/temp/test.h5', 'mydf', where="a >= 100 and a <= 110")
Out[115]:
       a   b
99   100  52
100  101  15
101  102  74
102  103  10
103  104  35
104  105  73
105  106  48
106  107  49
107  108   1
108  109  56
109  110  23

设置：

In [99]: df = pd.DataFrame({'a':np.arange(1, 1001), 'b':np.random.randint(0, 100, 1000)})

In [100]: fn = r'C:\Temp\test.csv'

In [101]: df.to_csv(fn, index=False)

In [102]: df.shape
Out[102]: (1000, 2)

【讨论】：

谢谢，顺便问一下，您知道 pandas.read_csv(skiprows=skiprows) 的工作原理吗？它使用 C 引擎吗？
@flyingrose，是的，它应该使用 C 引擎，除非它警告您因为“”而无法使用 C 引擎...
但它是如何工作的？通过哪种方式它会忽略前几行？分块阅读？
@flyingrose，对不起，我不明白......你能用一个可重复的小例子（包括小样本数据集）打开一个新问题吗？