【发布时间】:2026-01-24 07:30:01
【问题描述】:
刚刚启动了一个 Jupyter 终端并将一个 Excel 文件 (~12MB) 加载到 Pandas 数据帧中
加载文件之前:
>> import resource
>> print 'Memory usage: %s (Mb)' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024)
内存使用量:40 (Mb)
将文件加载到 Pandas 数据框后:
>> import pandas as pd
>> df = pd.read_excel('/var/www/temp_test_files/*_survey_2016.xlsx')
>> print 'Memory usage: %s (Mb)' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024)
内存使用量:193 (Mb)
为什么一个 12Mb 的文件在 pandas 中加载时占用的内存是其实际大小的 12 倍多 150mb?
下面列 dtypes 的详细分类。我猜对象 dtypes 分配的内存比列的实际使用量更多?
>> df.info(memory_usage=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 56030 entries, 0 to 56029
Data columns (total 57 columns):
collector 56030 non-null object
country 55528 non-null object
un_subregion 55313 non-null object
so_region 55390 non-null object
age_range 55727 non-null object
age_midpoint 55336 non-null float64
gender 55586 non-null object
self_identification 54202 non-null object
occupation 49519 non-null object
occupation_group 46934 non-null object
experience_range 49520 non-null object
experience_midpoint 49520 non-null float64
salary_range 46121 non-null object
salary_midpoint 41742 non-null float64
programming_ability 46982 non-null float64
employment_status 49576 non-null object
industry 40110 non-null object
company_size_range 39932 non-null object
team_size_range 39962 non-null object
women_on_team 39808 non-null object
remote 40118 non-null object
job_satisfaction 40110 non-null object
job_discovery 40027 non-null object
commit_frequency 46598 non-null object
hobby 46673 non-null object
dogs_vs_cats 45239 non-null object
desktop_os 46451 non-null object
unit_testing 46657 non-null object
rep_range 46143 non-null object
visit_frequency 46154 non-null object
why_learn_new_tech 46145 non-null object
education 44955 non-null object
open_to_new_job 44380 non-null object
new_job_value 43658 non-null object
job_search_annoyance 42851 non-null object
interview_likelihood 42263 non-null object
star_wars_vs_star_trek 34398 non-null object
agree_tech 42662 non-null object
agree_notice 42755 non-null object
agree_problemsolving 42659 non-null object
agree_diversity 42505 non-null object
agree_adblocker 42627 non-null object
agree_alcohol 42692 non-null object
agree_loveboss 42096 non-null object
agree_nightcode 42613 non-null object
agree_legacy 42382 non-null object
agree_mars 42685 non-null object
important_variety 42628 non-null object
important_control 42572 non-null object
important_sameend 42531 non-null object
important_newtech 42604 non-null object
important_buildnew 42538 non-null object
important_buildexisting 42580 non-null object
important_promotion 42483 non-null object
important_companymission 42529 non-null object
important_wfh 42582 non-null object
important_ownoffice 42538 non-null object
dtypes: float64(4), object(53)
memory usage: 24.8+ MB
None
是否有任何“最佳实践”方法来减少 Pandas 数据帧的实际内存占用?
- 修改数据类型?
- 分类?
【问题讨论】:
-
所以,您的大部分
dtypes都是object。这意味着底层的numpy数组是objectdtype。因此,内存使用仅适用于数组中的 PyObject 指针。但是,Python 必须将实际数据存储在某个地方。可能,这些主要是字符串字段,是吗? -
是的,它们确实大部分是字符串字段
-
我不是专家,但您可能会发现 Wes McKinney 提供的这两个链接很有帮助:slideshare.net/wesm/practical-medium-data-analytics-with-python 和 wesmckinney.com/blog/apache-arrow-pandas-internals
标签: python pandas memory dataframe categorical-data