为什么 Pandas 在我的代码中迭代 csv 时会跳过第一组块答案

【问题标题】：Why does Pandas skip first set of chunks when iterating over csv in my code为什么 Pandas 在我的代码中迭代 csv 时会跳过第一组块
【发布时间】：2017-02-05 13:02:35
【问题描述】：

我有一个非常大的 CSV 文件，我使用 pandas 的块函数通过迭代读取。问题：如果例如chunksize=2，它会跳过前 2 行，我收到的第一个块是第 3-4 行。

基本上，如果我用 nrows=4 读取 CSV，我会得到前 4 行，而用 chunksize=2 分块同一个文件会得到第 3 行和第 4 行，然后是第 5 行和第 6 行，...

#1. Read with nrows  
#read first 4 rows in csv files and merge date and time column to be used as index
reader = pd.read_csv('filename.csv', delimiter=',', parse_dates={"Datetime" : [1,2]}, index_col=[0], nrows=4)

print (reader)

01/01/2016 - 09:30 - A - 100
01/01/2016 - 13:30 - A - 110
01/01/2016 - 15:30 - A - 120
02/01/2016 - 10:30 - A - 115

#2. Iterate over csv file with chunks
#iterate over csv file in chunks and merge date and time column to be used as index
reader = pd.read_csv('filename.csv', delimiter=',', parse_dates={"Datetime" : [1,2]}, index_col=[0], chunksize=2)

for chunk in reader:

    #create a dataframe from chunks
    df = reader.get_chunk()
    print (df)

01/01/2016 - 15:30 - A - 120
02/01/2016 - 10:30 - A - 115

将块大小增加到 10 会跳过前 10 行。

有什么办法可以解决这个问题吗？我已经找到了一个可行的解决方法，我想知道我哪里做错了。

感谢任何输入！

【问题讨论】：

不要打电话给get_chunk。由于您正在遍历阅读器，因此您已经有了自己的块，即 chunk 是您的 DataFrame。在循环中调用 print(chunk) 应该会打印前两行。
非常感谢您的快速帮助，就像一个魅力。所以'get_chunk'基本上已经让我下一个块了。抱歉新手问题，从文档中不明白这一点。您想将此作为答案发布，以便我说它是正确的并关闭此问题吗？
@David，看看this example - 它可能会有所帮助
@MaxU 谢谢，这清楚地表明了 get_chunk 的用途。

标签： python csv pandas chunks

【解决方案1】：

不要打电话给get_chunk。由于您正在遍历阅读器，因此您已经有了自己的块，即 chunk 是您的 DataFrame。在你的循环中调用print(chunk)，你应该会看到预期的输出。

正如@MaxU 在 cmets 中指出的那样，如果您想要不同大小的块，您想使用 get_chunk：reader.get_chunk(500)、reader.get_chunk(100) 等。

【讨论】：

如果你想阅读不同大小的块，你想使用get_chunk()：reader.get_chunk(100); ... reader.get_chunk(500); ... reader.get_chunk(30); ...
@MaxU：谢谢，这更有意义。更新了答案。