【问题标题】:Pandas fuzzy group summary statisticsPandas 模糊分组汇总统计
【发布时间】:2016-10-06 07:48:25
【问题描述】:

我有一个从 CSV 定义的数据框,并且想计算基本的汇总统计信息,例如所有模型的训练部分的均值、方差、...。

插入型号并按此进行分组可以正常工作 - 但似乎不是一个好的解决方案。 我如何获得每个模型的汇总统计数据(仅用于训练),因为 group_by modelName 由于计数器而不起作用。

df.groupby(['modelName', 'typeOfRun'])['kappa'].mean()

df[df.typeOfRun != 'validation'].describe()

不会产生预期的结果。

AUC_R,Accuracy,Error rate,False negative rate,False positive rate,Lift value,Precision J,Precision N,Rate of negative predictions,Rate of positive predictions,Sensitivity (true positives rate),Specificity (true negatives rate),f1_R,kappa,modelName,typeOfRun
0.7747622323007851,0.7182416731216111,0.28175832687838887,0.16519823788546256,0.28527729751296715,2.769918376242967,0.08117369886485329,0.9930703132218424,0.029305447973147433,0.3013813581203202,0.8348017621145375,0.7147227024870328,0.8312130234716368,0.09987857210248623,00_testing_1-training,training
0.7688154033277225,0.7295055512522592,0.27049444874774076,0.1894273127753304,0.27294188056922464,2.807689674786938,0.08228060368921185,0.9921956531603068,0.029305447973147433,0.28869739220242707,0.8105726872246696,0.7270581194307754,0.8391825769931881,0.10159217699431862,00_testing_2-training,training
0.7653761718477654,0.7217918925897238,0.2782081074102763,0.1883259911894273,0.2809216651150419,2.737743031677203,0.08023078597866318,0.9921552436003304,0.029305447973147433,0.29647560030983733,0.8116740088105727,0.7190783348849581,0.8338281219878937,0.09791120175612114,00_testing_3-training,training
0.7666987721022418,0.7202566535628756,0.2797433464371244,0.18396711202466598,0.2826353437708505,2.7358921138891255,0.08018987022168358,0.9923159476282464,0.02931031885891585,0.2982693958700465,0.816032887975334,0.7173646562291496,0.8327314318650539,0.097878484924986,00_testing-validation,validation
0.7776426005660843,0.7300542215336948,0.2699457784663052,0.17180616740088106,0.2729086314669504,2.8639238514789174,0.08392857142857142,0.9929168180167091,0.029305447973147433,0.28918151303898787,0.8281938325991189,0.7270913685330496,0.8394625719769673,0.10476961017159536,01_otherSet_1-training,training
0.7691501646636157,0.737412858249419,0.26258714175058095,0.197136563876652,0.2645631067961165,2.8639098209585327,0.08392816025788626,0.9919723742039644,0.029305447973147433,0.2803382390911438,0.802863436123348,0.7354368932038835,0.8446557452170924,0.1044486077353842,01_otherSet_2-training,training
0.770174515310113,0.7342176607281178,0.2657823392718823,0.19162995594713655,0.26802101343263735,2.847815513920855,0.08345650938032974,0.9921582766235522,0.029305447973147433,0.283856183836819,0.8083700440528634,0.7319789865673627,0.8424375777288816,0.10367514449353035,01_otherSet_3-training,training
0.7676347850606817,0.7317488289428102,0.26825117105718976,0.19424460431654678,0.2704858255620898,2.8156062097690264,0.08252631578947368,0.9920241385858671,0.02931031885891585,0.2861747473378218,0.8057553956834532,0.7295141744379102,0.8407546494992847,0.10196584743637081,01_otherSet-validation,validation

【问题讨论】:

    标签: python pandas group-by summary


    【解决方案1】:

    IIUC 你可以使用DataFrameGroupBy.describe:

    print (df.groupby(['modelName', 'typeOfRun']).describe())
    
                                                 f1_R     kappa  
    modelName              typeOfRun                             
    00_testing-validation  validation count  1.000000  1.000000  
                                      mean   0.832731  0.097878  
                                      std         NaN       NaN  
                                      min    0.832731  0.097878  
                                      25%    0.832731  0.097878  
                                      50%    0.832731  0.097878  
                                      75%    0.832731  0.097878  
                                      max    0.832731  0.097878  
    00_testing_1-training  training   count  1.000000  1.000000  
                                      mean   0.831213  0.099879  
                                      std         NaN       NaN  
                                      min    0.831213  0.099879  
                                      25%    0.831213  0.099879  
                                      50%    0.831213  0.099879  
                                      75%    0.831213  0.099879  
                                      max    0.831213  0.099879  
    00_testing_2-training  training   count  1.000000  1.000000  
                                      mean   0.839183  0.101592  
                                      std         NaN       NaN  
    ...
    ...                                  
    

    您可以通过Seriessplit 创建groupby 并由str[0] 选择列表的第一项:

    print (df.modelName.str.split('_').str[0])
    0    00
    1    00
    2    00
    3    00
    4    01
    5    01
    6    01
    7    01
    Name: modelName, dtype: object
    
    print (df.groupby([df.modelName.str.split('_').str[0]]).describe())
                        AUC_R  Accuracy  Error;rate  False;negative;rate  \
    modelName                                                              
    00        count  4.000000  4.000000    4.000000             4.000000   
              mean   0.768913  0.722449    0.277551             0.181730   
              std    0.004149  0.004924    0.004924             0.011270   
              min    0.765376  0.718242    0.270494             0.165198   
              25%    0.766368  0.719753    0.276280             0.179275   
              50%    0.767757  0.721024    0.278976             0.186147   
              75%    0.770302  0.723720    0.280247             0.188601   
              max    0.774762  0.729506    0.281758             0.189427   
    01        count  4.000000  4.000000    4.000000             4.000000   
              mean   0.771151  0.733358    0.266642             0.188704   
              std    0.004452  0.003198    0.003198             0.011488   
              min    0.767635  0.730054    0.262587             0.171806   
              25%    0.768771  0.731325    0.264984             0.186674   
              50%    0.769662  0.732983    0.267017             0.192937   
              75%    0.772042  0.735016    0.268675             0.194968   
              max    0.777643  0.737413    0.269946             0.197137   
              ...
              ...
    

    【讨论】:

    • 快到了。但是,我不想严格按 modelName 分组,因为它是不同的/每次折叠都向上计数。相反,我只想按模型名称的第一部分(不变)进行分组。
    • 嗯,所以你只需要groupby modelName print (df.groupby(['modelName']).describe()) 吗?或者你能解释更多吗?
    • 我过滤数据,所以只有“训练”数据在 df 中,然后我想按模型名称执行分组。但是由于名称像00_testing_1-training00_testing_2-training 我想忽略不同的计数器
    • 对不起,我有点困惑。你能解释一下你需要什么吗? ignore the different counters 是什么意思?但我有一个想法 - 你能改变这个小数据框df = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6], 'C':[7,8,9], 'D':[1,3,5], 'E':[5,3,6], 'F':[7,4,3]}) print (df) 并解释你需要什么吗?
    • 重点是我想按包含 a1 a2 a3 和 b1 b2 b3 的列进行分组,这将导致 6 个组。但我只想收到 2 组 lsh a 和 b
    猜你喜欢
    • 1970-01-01
    • 2019-08-18
    • 2019-03-19
    • 2022-01-27
    • 1970-01-01
    • 1970-01-01
    • 2015-01-04
    • 1970-01-01
    • 2020-05-05
    相关资源
    最近更新 更多