【问题标题】:How to transpose values from top few rows in python dataframe into new columns如何将python数据框中前几行的值转换为新列
【发布时间】:2021-08-10 14:36:30
【问题描述】:

我正在尝试从 python 排序数据框中每个组的前 3 条记录中选择值并将它们放入新列中。我有一个处理每个组的函数,但我很难找到正确的方法来提取、重命名系列,然后将结果组合为一个系列以返回。

以下是输入数据帧 (df_in) 和预期输出 (df_out) 的简化示例:

import pandas as pd
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
        'Price':  [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0], 
        'Qty': [15 , 13, 14, 10, 5, 2, 1]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Price', 'Qty'])

我正在复制以下 2 个我测试过的函数示例,并试图获得一个更有效的选项,尤其是在我必须处理更多列和记录的情况下。 函数 best3_prices_v1 有效,但必须明确指定每个列或变量,这尤其是一个问题,因为我必须添加更多列。

def best3_prices_v1(x):     
   d = {}

# get best 3 records if records available, else set volumes as zeroes   
best_price_lv1 = x.iloc[0].copy()

rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0

recs = len(x) # number of records
if (recs == 1): 
    # 2nd and 3rd records not available
    best_price_lv2 = rec_with_zeroes.copy()
    best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):        
    best_price_lv2 = x.iloc[1]
    # 3rd record not available
    best_price_lv3 = rec_with_zeroes.copy()
else:
    best_price_lv2 = x.iloc[1]
    best_price_lv3 = x.iloc[2]    

# 1st best
d['Price_1'] = best_price_lv1['Price'] 
d['Qty_1'] = best_price_lv1['Qty'] 

# 2nd best
d['Price_2'] = best_price_lv2['Price'] 
d['Qty_2'] = best_price_lv2['Qty'] 

# 3rd best
d['Price_3'] = best_price_lv3['Price'] 
d['Qty_3'] = best_price_lv3['Qty'] 

# return combined results as a series
return pd.Series(d, index=['Price_1', 'Qty_1', 'Price_2', 'Qty_2', 'Price_3', 'Qty_3'])

调用函数的代码:

# sort dataframe by Product and Price
df_in.sort_values(by=['Product', 'Price'], ascending=True, inplace=True)
# get best 3 prices and qty as new columns
df_out = df_in.groupby(['Product']).apply(best3_prices_v1).reset_index()

第二次尝试改进/减少每个变量的代码和显式名称......不完整且不起作用。

def best3_prices_v2(x):     
d = {}

# get best 3 records if records available, else set volumes as zeroes   
best_price_lv1 = x.iloc[0].copy()

rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0

recs = len(x) # number of records
if (recs == 1): 
    # 2nd and 3rd records not available
    best_price_lv2 = rec_with_zeroes.copy()
    best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):        
    best_price_lv2 = x.iloc[1]
    # 3rd record not available
    best_price_lv3 = rec_with_zeroes.copy()
else:
    best_price_lv2 = x.iloc[1]
    best_price_lv3 = x.iloc[2]   


stats_columns = ['Price', 'Qty']

 # get records values for best 3 prices
d_lv1 = best_price_lv1[stats_columns]
d_lv2 = best_price_lv2[stats_columns] 
d_lv3 = best_price_lv3[stats_columns] 

# How to rename (keys?) or combine values to return?
lv1_stats_columns = [c + '_1' for c in stats_columns]
lv2_stats_columns = [c + '_2' for c in stats_columns]
lv3_stats_columns = [c + '_3' for c in stats_columns]
    
# return combined results as a series
return pd.Series(d, index=lv1_stats_columns + lv2_stats_columns + lv3_stats_columns)

【问题讨论】:

    标签: python pandas dataframe pandas-groupby series


    【解决方案1】:

    让我们unstack():

    df_in=(df_in.set_index([df_in.groupby('Product').cumcount().add(1),'Product'])
                 .unstack(0,fill_value=0))
    df_in.columns=[f"{x}_{y}" for x,y in df_in]
    df_in=df_in.reset_index()
    

    或通过pivot()

    df_in=(df_in.assign(key=df_in.groupby('Product').cumcount().add(1))
          .pivot('Product','key',['Price','Qty'])
          .fillna(0,downcast='infer'))
    df_in.columns=[f"{x}_{y}" for x,y in df_in]
    df_in=df_in.reset_index()
    

    【讨论】:

    • 补充一点,因为 OP 有兴趣只获得每个 Product 的前三个结果(假设按索引),您可以使用 df_in.groupby('Product').head(3) 快速获取该 DataFrame
    【解决方案2】:

    根据上面@AnuragDabas 的pivot 解决方案和@ceruler 的反馈,我现在可以将其扩展为更普遍的问题。

    具有更多组和列的新数据框:

    data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
           'Model': ['A1', 'A1', 'A1', 'A2', 'B1', 'C1', 'C1'],
        'Price':  [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0], 
        'Qty': [15 , 13, 14, 10, 5, 2, 1],
        'Ratings': [9, 7, 8, 10, 6, 7, 8 ]}
    df_in = pd.DataFrame (data_in, columns = ['Product', 'Model' ,'Price', 'Qty', 'Ratings'])
    
    
    group_list = ['Product', 'Model']
    stats_list = ['Price','Qty', 'Ratings']
    df_out = df_in.groupby(group_list).head(3)
    df_out=(df_out.assign(key=df_out.groupby(group_list).cumcount().add(1))
      .pivot(group_list,'key', stats_list)
      .fillna(0,downcast='infer'))
    df_out.columns=[f"{x}_{y}" for x,y in df_out]
    df_out = df_out.reset_index()
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-06-16
      • 2018-08-07
      • 1970-01-01
      • 2018-03-05
      • 1970-01-01
      • 1970-01-01
      • 2020-04-19
      • 2020-10-22
      相关资源
      最近更新 更多