熊猫 - 巨大的内存消耗答案

【问题标题】：Pandas - Huge memory consumption熊猫 - 巨大的内存消耗
【发布时间】：2018-07-03 12:11:22
【问题描述】：

从大约 1500 万行（占用大约 250 MB）的 pickle 加载数据帧后，我对其执行一些搜索操作，然后删除一些行。在这些操作期间，内存使用量猛增至 5 GB，有时甚至 7 GB，这很烦人，因为交换（我的笔记本电脑只有 8 GB 内存）。

关键是当操作完成时（即执行下面代码中的最后两行时），该内存不会被释放。所以 Python 进程仍然占用 7 GB 的内存。

知道为什么会这样吗？我正在使用 Pandas 0.20.3。

下面的最小示例。现实中的“数据”变量大约有 1500 万行，但我不知道如何在此处发布。

import datetime, pandas as pd

data = {'Time':['2013-10-29 00:00:00', '2013-10-29 00:00:08', '2013-11-14 00:00:00'], 'Watts': [0, 48, 0]}
df = pd.DataFrame(data, columns = ['Time', 'Watts'])
# Convert string to datetime
df['Time'] = pd.to_datetime(df['Time'])
# Make column Time as the index of the dataframe
df.index = df['Time']
# Delete the column time
df = df.drop('Time', 1)

# Get the difference in time between two consecutive data points
differences = df.index.to_series().diff()
# Keep only the differences > 60 mins
differences = differences[differences > datetime.timedelta(minutes=60)]
# Get the string of the day of the data points when the data gathering resumed
toRemove = [datetime.datetime.strftime(date, '%Y-%m-%d') for date in differences.index.date]

# Remove data points belonging to the day where the differences was > 60 mins
for dataPoint in toRemove:
    df.drop(df[dataPoint].index, inplace=True)

【问题讨论】：

请向我们展示一个可用于重现您的情况的最小示例。
stackoverflow.com/help/mcve
我第二个@FlashTek。无论如何，您是否考虑过使用Dask？
我想喝一杯。它专为处理大数据集而设计。
编辑了原帖，希望现在更清楚了。

标签： python pandas dataframe memory

【解决方案1】：

您可能想尝试调用垃圾收集器。 gc.collect() 更多信息请见How can I explicitly free memory in Python?

【讨论】：

它实际上释放了内存。因此，我的问题的根源是垃圾收集器不够快，我需要手动释放内存调用它？
什么释放了什么内存？（我不知道您评论中的“它”是什么）。如果您正在释放内存，您将看不到 7GB 的内存消耗。仅仅因为您执行 df.drop 之类的操作并不意味着内存已被回收
对不起，我的意思是 gc.collect() 释放了我的记忆。调用该命令后，内存消耗下降到 ~290MB。考虑到变量“数据”单独占用约 250MB，这没关系。
那么您还需要什么才能接受答案吗？
完成！您是否确认我在对您的回答的第一条评论中提出的疑问？另外，一个250MB的文件，这么多内存用于操作正常吗？