在连接之前通过预处理/清理迭代 Pandas 块答案

【问题标题】：Iterate Pandas Chunks through preprocessing/cleaning before concatenating在连接之前通过预处理/清理迭代 Pandas 块
【发布时间】：2021-12-22 07:22:39
【问题描述】：

Python/Pandas 大师们美好的一天：

我在本地机器上执行数据分析时处理内存问题。我通常处理 (15000000+, 50+) 形状的数据。我通常在 pd.read_csv() 中将数据分块为 chunksize=1000000，这对我来说总是很有效。

我想知道如何在整个数据清理/预处理部分迭代每个块，这样我就不必通过这部分代码运行整个数据帧。我发现我遇到了系统限制并且内存不足。

我想读取 pandas 块，通过一个函数或只是重命名列、过滤数据框和分配数据类型的一系列步骤迭代每个块。一旦对所有块完成此预处理，我希望将现在处理的块连接在一起，创建完整的数据帧。

df_chunks = pandas.read_csv("File.path", chunksize=10000)

for chunks in df_chunks:
   Task 1: Rename Columns
   Task 2: Filter(s)
   Task 3: Assign data types to non-object fields

processed_df = pd.concat(df_chunks)

以下是我运行整个数据框以进行预处理的代码示例，但由于我拥有的数据量达到了系统限制：

billing_docs_clean.columns = ['BillingDocument', 'BillingDocumentItem', 'BillingDocumentType', 'BillingCategory', 'DocumentCategory',
                'DocumentCurrency', 'SalesOrganization', 'DistributionChannel', 'PricingProcedure',
                'DocumentConditionNumber', 'ShippingConditions', 'BillingDate', 'CustomerGroup', 'Incoterms',
                'PostingStatus', 'PaymentTerms', 'DestinationCountry', 'Region', 'CreatedBy', 'CreationTime',
                'SoldtoNumber', 'Curr1', 'Divison', 'Curr2', 'ExchangeRate', 'BilledQuantitySUn', 'SalesUnits',
                'Numerator', 'Denominator', 'BilledQuantityBUn', 'BaseUnits', 'RequiredQuantity', 'BUn1', 'ExchangeRate2',
                'ItemNetValue', 'Curr3', 'ReferenceDocument', 'ReferenceDocumentItem', 'ReferencyDocumentCategory',
                'SalesDocument', 'SalesDocumentItem', 'Material', 'MaterialDescription', 'MaterialGroup',
                'SalesDocumentItemCategory', 'SalesProductHierarchy', 'ShippingPoint', 'Plant', 'PlantRegion',
                'SalesGroup', 'SalesOffice', 'Returns', 'Cost', 'Curr4', 'GrossValue', 'Curr5', 'NetValue', 'Curr6',
                'CashDiscount', 'Curr7', 'FreightCharges', 'Curr8', 'Rebate', 'Curr9', 'OVCFreight', 'Curr10', 'ProfitCenter',
                'CreditPrice', 'Curr11', 'SDDocumentCategory']

# Filter data to obtain US, Canada, and Mexico industrial sales for IFS Profit Center
billing_docs_clean = billing_docs_clean[
    (billing_docs_clean['DistributionChannel'] == '02') & 
    (billing_docs_clean['ProfitCenter'].str.startswith('00001', na=False)) &  
    (billing_docs_clean['ReferenceDocumentItem'].astype(float) < 900000) &   
    (billing_docs_clean['PostingStatus']=='C') &  
    (billing_docs_clean['PricingProcedure'] != 'ZEZEFD') & 
    (billing_docs_clean['SalesDocumentItemCategory'] != 'TANN')]


# Correct Field Formats and data types
Date_Fields_billing_docs_clean = ['BillingDate']
for datefields in Date_Fields_billing_docs_clean:
    billing_docs_clean[datefields] = pd.to_datetime(billing_docs_clean[datefields])

Trim_Zeros_billing_docs_clean = ['BillingDocument', 'BillingDocumentItem', 'ProfitCenter', 'Material', 'ReferenceDocument',
                      'ReferenceDocumentItem', 'SalesDocument', 'SalesDocumentItem']
for TrimFields in Trim_Zeros_billing_docs_clean:
    billing_docs_clean[TrimFields] = billing_docs_clean[TrimFields].str.lstrip('0')

Numeric_Fields_billing_docs_clean = ['ExchangeRate', 'BilledQuantitySUn', 'Numerator', 'Denominator', 'BilledQuantityBUn',
                          'RequiredQuantity', 'ExchangeRate2', 'ItemNetValue', 'Cost', 'GrossValue', 'NetValue',
                          'CashDiscount', 'FreightCharges', 'Rebate', 'OVCFreight', 'CreditPrice']
for NumericFields in Numeric_Fields_billing_docs_clean:
    billing_docs_clean[NumericFields] = billing_docs_clean[NumericFields].astype('str').str.replace(',','').astype(float)

我对用于数据分析的 python 编码还是比较陌生，但渴望学习！因此，我感谢对这篇文章中代码的任何和所有解释或任何其他建议。

谢谢！

【问题讨论】：

标签： python pandas dataframe for-loop

【解决方案1】：

任务 1：重命名列

为此，您可以利用pandas.read_csv 可选参数header 和names。考虑以下简单示例，让file.csv 内容为

A,B,C
1,2,3
4,5,6

然后

import pandas as pd
df = pd.read_csv("file.csv", header=0,names=["X","Y","Z"])
print(df)

输出

   X  Y  Z
0  1  2  3
1  4  5  6

【讨论】：