jupyter notebooks中的IPython：用pandas读取大数据文件变得非常慢（高内存消耗？）答案

【问题标题】：IPython in jupyter notebooks: reading a large datafile with pandas becomes very slow (high memory consumption?)jupyter notebooks中的IPython：用pandas读取大数据文件变得非常慢（高内存消耗？）
【发布时间】：2026-01-15 06:25:01
【问题描述】：

我想在 jupyter notebook 中处理一个巨大的数据文件。我在 for 循环中使用 pandas 来指定我从文件中读取的行：

import pandas as pd 
import gc
from tqdm import tqdm


# Create a training file with simple derived features
rowstoread = 150_000
chunks = 50

for chunks in tqdm(range(chunks)):
    rowstoskip = range(1, chunks*rowstoread-1) if segment > 0 else 0
    chunk = pd.read_csv("datafile.csv", dtype={'attribute_1': np.int16, 'attribute_2': np.float64}, skiprows=rowstoskip, nrows=rowstoread)

    x = chunk['attribute_1'].values
    y = chunk['attribute_2'].values[-1]

    #process data here and try to get rid of memory afterwards

    del chunk, x, y
    gc.collect()

虽然我尝试释放我之后读取的数据的内存，但导入开始很快并且变得非常慢，具体取决于当前块的数量。

我有什么遗漏吗？有人知道它的原因以及如何解决吗？

提前致谢， smaica

编辑：感谢@Wen-Ben，我可以使用 pandas read_csv 中的 chunk 方法来规避这个问题。不过我不知道为什么会发生这种情况

【问题讨论】：

pd.read_csv 有方法块

标签： python pandas memory garbage-collection jupyter-notebook

【解决方案1】：

根据我的经验，gc.collect() 没有多大作用。

如果您有一个可以放入磁盘的大文件，那么您可以使用其他库，例如 Sframes。

这是一个 example 来读取 csv 文件：

sf = SFrame(data='~/mydata/foo.csv')

API 与 Pandas 非常相似。

【讨论】：