【问题标题】:Python find mean of all rows by a column and then find distancePython按列查找所有行的平均值,然后查找距离
【发布时间】:2020-06-04 08:09:18
【问题描述】:

我有一个如下的数据框。我知道df.groupby("degree").mean() 会给我提供degree 的意思。我想采用这些方法并找到每个数据点与这些平均值之间的距离。在这种情况下。对于每个数据点,我想从平均值(df.groupby("degree").mean() 的输出)(4,40)(2,80)和(4,94)获得 3 个距离,并创建 3 个新列。距离应该用公式计算,BCA_mean=(name-4)^3+(score-40)^3,M.Tech_mean=(name-2)^3+(score-80)^3,MBA_mean=(name-4)^3+(score-94)^3

import pandas as pd 

# dictionary of lists 
dict = {'name':[5, 4, 2, 3], 
        'degree': ["MBA", "BCA", "M.Tech", "MBA"], 
        'score':[90, 40, 80, 98]} 

# creating a dataframe from a dictionary  
df = pd.DataFrame(dict) 

print (df)

   name  degree  score
0     5     MBA     90
1     4     BCA     40
2     2  M.Tech     80
3     3     MBA     98


df.groupby("degree").mean()    

degree name score       
BCA     4   40
M.Tech  2   80
MBA     4   94

更新1

我的真实数据集有 100 多列。我更喜欢可以满足这种需要的东西。逻辑还是一样的,对于每个平均值,从一列中减去平均值,然后取每个单元格的立方体并相加

我发现了类似下面的内容。但不确定是否有其他有效的方法

y=df.groupby("degree").mean()
print (y)
import numpy as np
(np.square(df[['name','score']].subtract(y.iloc[0,:],axis=1))).sum(axis=1)

df["mean0"]=(np.square(df[['name','score']].subtract(y.iloc[0,:],axis=1))).sum(axis=1)
df

【问题讨论】:

    标签: python dataframe mean


    【解决方案1】:
    import pandas as pd 
    
    # dictionary of lists 
    dict = {'degree': ["MBA", "BCA", "M.Tech", "MBA","BCA"], 
            'name':[5, 4, 2, 3,2], 
            'score':[90, 40, 80, 98,60],
           'game':[100,200,300,100,400],
           'money':[100,200,300,100,400],
           'loan':[100,200,300,100,400],
           'rent':[100,200,300,100,400],
           'location':[100,200,300,100,400]} 
    
    # creating a dataframe from a dictionary  
    df = pd.DataFrame(dict) 
    
    print (df)
    
    dfx=df.groupby("degree").mean()
    print(dfx)
    
    
    
    def fun(x):
    
        if x[0]=='BCA':
            return x[1:] - dfx.iloc[0,:].tolist()
        if x[0]=='M.Tech': 
                return x[1:]-dfx.iloc[1,:].tolist()
        if x[0]=='MBA':
                return x[1:]-dfx.iloc[2,:].tolist()
    
    
    df_added=df.apply(fun,axis=1)
    df_added
    

    结果

       degree  name  score  game  money  loan  rent  location
    0     MBA     5     90   100    100   100   100       100
    1     BCA     4     40   200    200   200   200       200
    2  M.Tech     2     80   300    300   300   300       300
    3     MBA     3     98   100    100   100   100       100
    4     BCA     2     60   400    400   400   400       400
    ``````
    mean  which is dfx
    ``````````
            name  score  game  money  loan  rent  location
    degree                                                
    BCA        3     50   300    300   300   300       300
    M.Tech     2     80   300    300   300   300       300
    MBA        4     94   100    100   100   100       100
    ````````````
    df_added********  
    difference of each element from their mean column value
    ``````````
            name    score   game    money   loan    rent    location
       0       1      -4       0       0       0       0    0
       1       1     -10    -100     -100   -100    -100    -100
       2       0       0       0       0       0       0    0
       3      -1       4       0       0       0       0    0
       4      -1      10      100     100    100     100    100
    

    【讨论】:

    • 我的真实数据集有100多列。有没有更好的方法来做到这一点?
    猜你喜欢
    • 1970-01-01
    • 2023-03-06
    • 2020-08-10
    • 2015-01-21
    • 2021-11-03
    • 1970-01-01
    • 1970-01-01
    • 2020-03-30
    • 2015-05-17
    相关资源
    最近更新 更多