【发布时间】:2013-07-04 14:47:56
【问题描述】:
[使用 Python3.3] 我有一个巨大的 CSV 文件,其中包含 XX 百万行并包含几列。我想读取该文件,添加几个计算列并吐出几个“分段”csv 文件。我在以下代码上尝试了一个较小的测试文件,它完全符合我的要求。但是现在我正在加载原始的 CSV 文件(大约 3.2 GB)并且我得到一个内存错误。有没有更节省内存的方法来编写下面的代码?
请注意,我对 Python 很陌生,因此可能有很多我不完全了解的东西。
输入数据示例:
email cc nr_of_transactions last_transaction_date timebucket total_basket
email1@email.com us 2 datetime value 1 20.29
email2@email.com gb 3 datetime value 2 50.84
email3@email.com ca 5 datetime value 3 119.12
... ... ... ... ... ...
这是我的代码:
import csv
import scipy.stats as stats
import itertools
from operator import itemgetter
def add_rankperc(filename):
'''
Function that calculates percentile rank of total basket value of a user (i.e. email) within a country. Next, it assigns the user to a rankbucket based on its percentile rank, using the following rules:
Percentage rank between 75 and 100 -> top25
Percentage rank between 25 and 74 -> mid50
Percentage rank between 0 and 24 -> bottom25
'''
# Defining headers for ease of use/DictReader
headers = ['email', 'cc', 'nr_transactions', 'last_transaction_date', 'timebucket', 'total_basket']
groups = []
with open(filename, encoding='utf-8', mode='r') as f_in:
# Input file is tab-separated, hence dialect='excel-tab'
r = csv.DictReader(f_in, dialect='excel-tab', fieldnames=headers)
# DictReader reads all dict values as strings, converting total_basket to a float
dict_list = []
for row in r:
row['total_basket'] = float(row['total_basket'])
# Append row to a list (of dictionaries) for further processing
dict_list.append(row)
# Groupby function on cc and total_basket
for key, group in itertools.groupby(sorted(dict_list, key=itemgetter('cc', 'total_basket')), key=itemgetter('cc')):
rows = list(group)
for row in rows:
# Calculates the percentile rank for each value for each country
row['rankperc'] = stats.percentileofscore([row['total_basket'] for row in rows], row['total_basket'])
# Percentage rank between 75 and 100 -> top25
if 75 <= row['rankperc'] <= 100:
row['rankbucket'] = 'top25'
# Percentage rank between 25 and 74 -> mid50
elif 25 <= row['rankperc'] < 75:
row['rankbucket'] = 'mid50'
# Percentage rank between 0 and 24 -> bottom25
else:
row['rankbucket'] = 'bottom25'
# Appending all rows to a list to be able to return it and use it in another function
groups.append(row)
return groups
def filter_n_write(data):
'''
Function takes input data, groups by specified keys and outputs only the e-mail addresses to csv files as per the respective grouping.
'''
# Creating group iterator based on keys
for key, group in itertools.groupby(sorted(data, key=itemgetter('timebucket', 'rankbucket')), key=itemgetter('timebucket', 'rankbucket')):
# List comprehension to create a list of lists of email addresses. One row corresponds to the respective combination of grouping keys.
emails = list([row['email'] for row in group])
# Dynamically naming output file based on grouping keys
f_out = 'output-{}-{}.csv'.format(key[0], key[1])
with open(f_out, encoding='utf-8', mode='w') as fout:
w = csv.writer(fout, dialect='excel', lineterminator='\n')
# Writerows using list comprehension to write each email in emails iterator (i.e. one address per row). Wrapping email in brackets to write full address in one cell.
w.writerows([email] for email in emails)
filter_n_write(add_rankperc('infile.tsv'))
提前致谢!
【问题讨论】:
-
“我有一个巨大的 CSV 文件,其中包含大约 4600 万行和几列” .... 为什么?这是关于存储数据效率最低的方法...您应该切换数据存储方法,而不是尝试让 CSV 为您工作...为什么不尝试一些 SQL 呢? (或任何其他实际使用数据库或存储方法意味着用于存储大量数据 - 不像 csv 文件)
-
因为这是一个 csv 是从数据库系统导出的。为什么我要编写 python 脚本是因为“分组”并将输出写入多个 csv 文件。没错,我可以在数据库系统中执行此操作,但它需要我下载每个电子邮件地址列表,最多可能是 180 个 csv 文件。因此,我考虑编写一个脚本来为我完成这项工作。这样更有意义吗?
-
为什么不用Python直接与数据库交互呢?然后只需准确提取您需要的内容,并以最有效的方式创建您想要的输出/结果文件..
标签: python memory csv python-3.x