【问题标题】:Pandas, groupby and finding maximum in groups, returning value and countPandas,groupby 并在组中找到最大值,返回值和计数
【发布时间】:2014-11-03 01:49:00
【问题描述】:

我有一个带有日志数据的 pandas DataFrame:

        host service
0   this.com    mail
1   this.com    mail
2   this.com     web
3   that.com    mail
4  other.net    mail
5  other.net     web
6  other.net     web

我想在每台主机上找到错误最多的服务:

        host service  no
0   this.com    mail   2
1   that.com    mail   1
2  other.net     web   2

我找到的唯一解决方案是按主机和服务分组,然后迭代 超过指数的 0 级。

谁能推荐一个更好、更短的版本?没有迭代?

df = df_logfile.groupby(['host','service']).agg({'service':np.size})

df_count = pd.DataFrame()
df_count['host'] = df_logfile['host'].unique()
df_count['service']  = np.nan
df_count['no']    = np.nan

for h,data in df.groupby(level=0):
  i = data.idxmax()[0]   
  service = i[1]             
  no = data.xs(i)[0]
  df_count.loc[df_count['host'] == h, 'service'] = service
  df_count.loc[(df_count['host'] == h) & (df_count['service'] == service), 'no']   = no

完整代码https://gist.github.com/bjelline/d8066de66e305887b714

【问题讨论】:

    标签: python numpy pandas


    【解决方案1】:

    给定df,下一步是单独按host 值分组,并且
    idxmax 聚合。这为您提供了索引 对应最大的服务价值。然后您可以使用df.loc[...] 选择df 中对应于最大服务值的行:

    import numpy as np
    import pandas as pd
    
    df_logfile = pd.DataFrame({ 
        'host' : ['this.com', 'this.com', 'this.com', 'that.com', 'other.net', 
                  'other.net', 'other.net'],
        'service' : ['mail', 'mail', 'web', 'mail', 'mail', 'web', 'web' ] })
    
    df = df_logfile.groupby(['host','service'])['service'].agg({'no':'count'})
    mask = df.groupby(level=0).agg('idxmax')
    df_count = df.loc[mask['no']]
    df_count = df_count.reset_index()
    print("\nOutput\n{}".format(df_count))
    

    产生数据帧

            host service  no
    0  other.net     web   2
    1   that.com    mail   1
    2   this.com    mail   2
    

    【讨论】:

    猜你喜欢
    • 2017-12-23
    • 2018-03-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-11-26
    • 1970-01-01
    • 2019-08-16
    • 2017-02-05
    相关资源
    最近更新 更多