【问题标题】:Pandas: normalize within the group熊猫:在组内标准化
【发布时间】:2017-09-26 06:22:49
【问题描述】:

假设我们有以下数据集:

import pandas as pd

data = [('apple', 'red', 155), ('apple', 'green', 102), ('apple', 'iphone', 48),
         ('tomato', 'red', 175), ('tomato', 'ketchup', 96), ('tomato', 'gun', 12)]

df = pd.DataFrame(data)
df.columns = ['word', 'rel_word', 'weight']

我想重新计算权重,以便它们在每组(例如苹果、番茄)中的总和为 1.0,并保持相关权重不变(例如,苹果/红色到苹果/绿色仍然应该是 155/102) .

【问题讨论】:

  • 你能添加想要的输出吗?
  • 请在单独的列中提及预期输出以便更好地理解

标签: python pandas


【解决方案1】:

使用transform - 比apply 和查找更快

In [3849]: df['weight'] / df.groupby('word')['weight'].transform('sum')
Out[3849]:
0    0.508197
1    0.334426
2    0.157377
3    0.618375
4    0.339223
5    0.042403
Name: weight, dtype: float64

In [3850]: df['norm_w'] = df['weight'] / df.groupby('word')['weight'].transform('sum')

In [3851]: df
Out[3851]:
     word rel_word  weight    norm_w
0   apple      red     155  0.508197
1   apple    green     102  0.334426
2   apple   iphone      48  0.157377
3  tomato      red     175  0.618375
4  tomato  ketchup      96  0.339223
5  tomato      gun      12  0.042403

或者,

In [3852]: df.groupby('word')['weight'].transform(lambda x: x/x.sum())
Out[3852]:
0    0.508197
1    0.334426
2    0.157377
3    0.618375
4    0.339223
5    0.042403
Name: weight, dtype: float64

时间

In [3862]: df.shape
Out[3862]: (12000, 4)

In [3864]: %timeit df['weight'] / df.groupby('word')['weight'].transform('sum')
100 loops, best of 3: 2.44 ms per loop

In [3866]: %timeit df.groupby('word')['weight'].transform(lambda x: x/x.sum())
100 loops, best of 3: 5.16 ms per loop

In [3868]: %%timeit
      ...: group_weights = df.groupby('word').aggregate(sum)
      ...: df.apply(lambda row: row['weight']/group_weights.loc[row['word']][0],axis=1)
1 loop, best of 3: 2.5 s per loop

【讨论】:

    【解决方案2】:

    您可以使用groupby 计算每组的总权重,然后apply 对每一行进行归一化 lambda 函数:

    group_weights = df.groupby('word').aggregate(sum)
    df['normalized_weights'] = df.apply(lambda row: row['weight']/group_weights.loc[row['word']][0],axis=1)
    

    输出:

        word    rel_word    weight  normalized_weights
    0   apple   red         155     0.508197
    1   apple   green       102     0.334426
    2   apple   iphone      48      0.157377
    3   tomato  red         175     0.618375
    4   tomato  ketchup     96      0.339223
    

    【讨论】:

    • 很好的解决方案将命令式编程包装到 Pandas 的思维方式中。谢谢!
    【解决方案3】:

    使用np.bincount & pd.factorize
    这应该非常快速且可扩展

    f, u = pd.factorize(df.word.values)
    w = df.weight.values
    
    df.assign(norm_w=w / np.bincount(f, w)[f])
    
         word rel_word  weight    norm_w
    0   apple      red     155  0.508197
    1   apple    green     102  0.334426
    2   apple   iphone      48  0.157377
    3  tomato      red     175  0.618375
    4  tomato  ketchup      96  0.339223
    5  tomato      gun      12  0.042403
    

    【讨论】:

      猜你喜欢
      • 2022-09-27
      • 1970-01-01
      • 2019-12-27
      • 2020-09-05
      • 1970-01-01
      • 1970-01-01
      • 2017-04-06
      • 2020-10-09
      • 2021-01-18
      相关资源
      最近更新 更多