【问题标题】:Pandas Dataframe - Group by column value and lookup values from other columnsPandas Dataframe - 按列值分组并从其他列中查找值
【发布时间】:2020-12-03 13:16:42
【问题描述】:

我有这个数据框:

标签 |时间_0 |值_0 |时间_1 |值_1 |时间_2 | value_2

value_0 列对应 time_0,value_1 对应 time_1...

time_i 是唯一的,这意味着对于所有行 time_0 具有相同的值,对于 time_1、time_2 也是如此。

我想实现两件事:

1- 第一个 group_by 标签,然后是 time_i

结果应该是这样的:

Label0  time_0(0)  value_0(0)
                value_0(1)
                ...
        time_1(0)  value_1(0)
                value_1(1)
                ..
         ...

Label1  time_0(0)  value_0(0)
                value_0(1)
                ...
        time_1(0)  value_1(0)
                value_1(1)
                ..
         ...   
Label2  time_0(0)  value_0(0)
                value_0(1)
                ...
        time_1(0)  value_1(0)
                value_1(1)
                ..
         ...   

2- 第一个 group_by 标签,然后 time_i 和 sum all values_i

结果应该是这样的

Label1  time_0(0)  sum(value_0)
        time_1(1)  sum(value_1)
        time_2(2)  sum(value_2)
       
Label2  time_0(0)  sum(value_0)
        time_1(1)  sum(value_1)
        time_2(2)  sum(value_2)
         ...   

我尝试了 pd.merge、group_by 的不同组合,但没有成功

这是一个带有值的示例

【问题讨论】:

  • 什么是列名?是否可以创建一些示例数据?
  • 例如这里 sum values value_0 是不可能的,因为字符串。
  • @jezrael,刚刚为初始数据框添加了一个示例

标签: pandas dataframe merge aggregate


【解决方案1】:

使用wide_to_long 和聚合sum

df = (pd.wide_to_long(df.reset_index(), 
                     stubnames=['value','time'],
                     i=['index','Label'], j='tmp', sep='_')
        .groupby(['Label','time'])['value']
        .sum()
        .reset_index())
print (df)
     Label        time      value
0  EUR/CHF  2020-12-04 -51.260248
1  EUR/CHF  2020-12-10  98.202053
2  USD/CHF  2020-12-04   0.134488
3  USD/CHF  2020-12-10   4.510396
4  USD/NOK  2020-12-04   0.395785
5  USD/NOK  2020-12-10  -0.801768

编辑:

用途:

print (df)
      Label veg__0_time  veg__0_value veg__1_time  veg__1_value
0   USD/CHF  2020-12-04      0.000000  2020-12-10      0.000000
1   USD/CHF  2020-12-04     -0.439058  2020-12-10      1.392752
2   USD/CHF  2020-12-04     -0.012020  2020-12-10      0.043742
3   USD/CHF  2020-12-04      0.000000  2020-12-10      0.000000
4   USD/CHF  2020-12-04      0.000000  2020-12-10      0.000000
5   USD/CHF  2020-12-04     -0.525791  2020-12-10      1.273146
6   USD/CHF  2020-12-04      1.306578  2020-12-10      1.115313
7   USD/CHF  2020-12-04     -0.195221  2020-12-10      0.685444
8   USD/NOK  2020-12-04      0.395785  2020-12-10     -0.801768
9   EUR/CHF  2020-12-04    -29.385792  2020-12-10     45.951600
10  EUR/CHF  2020-12-04    -21.874456  2020-12-10     52.250453

df = df.set_index('Label')
df.columns = df.columns.str.split('_', expand=True).droplevel([0,1])
print (df)
                  0                      1           
               time      value        time      value
Label                                                
USD/CHF  2020-12-04   0.000000  2020-12-10   0.000000
USD/CHF  2020-12-04  -0.439058  2020-12-10   1.392752
USD/CHF  2020-12-04  -0.012020  2020-12-10   0.043742
USD/CHF  2020-12-04   0.000000  2020-12-10   0.000000
USD/CHF  2020-12-04   0.000000  2020-12-10   0.000000
USD/CHF  2020-12-04  -0.525791  2020-12-10   1.273146
USD/CHF  2020-12-04   1.306578  2020-12-10   1.115313
USD/CHF  2020-12-04  -0.195221  2020-12-10   0.685444
USD/NOK  2020-12-04   0.395785  2020-12-10  -0.801768
EUR/CHF  2020-12-04 -29.385792  2020-12-10  45.951600
EUR/CHF  2020-12-04 -21.874456  2020-12-10  52.250453

df = df.stack(0).groupby(['Label','time'])['value'].sum().reset_index()
print (df)
     Label        time      value
0  EUR/CHF  2020-12-04 -51.260248
1  EUR/CHF  2020-12-10  98.202053
2  USD/CHF  2020-12-04   0.134488
3  USD/CHF  2020-12-10   4.510396
4  USD/NOK  2020-12-04   0.395785
5  USD/NOK  2020-12-10  -0.801768

EDIT1:重命名解决方案:

def f(x):
    
    splitted = x.split('_')
    if len(splitted) > 1:
        return f'{splitted[-1]}_{splitted[-2]}'
    else:
        return x

df = df.rename(columns=f)
print (df)
      Label      time_0    value_0      time_1    value_1
0   USD/CHF  2020-12-04   0.000000  2020-12-10   0.000000
1   USD/CHF  2020-12-04  -0.439058  2020-12-10   1.392752
2   USD/CHF  2020-12-04  -0.012020  2020-12-10   0.043742
3   USD/CHF  2020-12-04   0.000000  2020-12-10   0.000000
4   USD/CHF  2020-12-04   0.000000  2020-12-10   0.000000
5   USD/CHF  2020-12-04  -0.525791  2020-12-10   1.273146
6   USD/CHF  2020-12-04   1.306578  2020-12-10   1.115313
7   USD/CHF  2020-12-04  -0.195221  2020-12-10   0.685444
8   USD/NOK  2020-12-04   0.395785  2020-12-10  -0.801768
9   EUR/CHF  2020-12-04 -29.385792  2020-12-10  45.951600
10  EUR/CHF  2020-12-04 -21.874456  2020-12-10  52.250453

df = (pd.wide_to_long(df.reset_index(), 
                     stubnames=['value','time'],
                     i=['index','Label'], j='tmp', sep='_')
        .groupby(['Label','time'])['value']
        .sum()
        .reset_index())
print (df)
     Label        time      value
0  EUR/CHF  2020-12-04 -51.260248
1  EUR/CHF  2020-12-10  98.202053
2  USD/CHF  2020-12-04   0.134488
3  USD/CHF  2020-12-10   4.510396
4  USD/NOK  2020-12-04   0.395785
5  USD/NOK  2020-12-10  -0.801768

【讨论】:

  • 感谢您,在我的案例中,列名比我提供的示例涉及更多。它是 veg__0_value , veg__0_time, veg__1_value, veg__1_time。在这种情况下这将如何工作?
  • @Crovish - 嗯,所以有分隔器双 __ 和一个 _
  • 第二步会产生这个错误cannot reindex from a duplicate axis1
  • @Crovish - 数据重复,添加了 EDIT1 解决方案,它应该在这里工作。
  • 忘了说我的列比我的例子多,veg__22_time, veg__22_value ..
猜你喜欢
  • 2021-10-02
  • 1970-01-01
  • 2016-02-17
  • 2017-04-15
  • 1970-01-01
  • 2019-04-07
  • 1970-01-01
相关资源
最近更新 更多