【问题标题】:Python Pandas | Find maximum value only from a specific part of a columnPython 熊猫 |仅从列的特定部分查找最大值
【发布时间】:2018-06-27 05:03:18
【问题描述】:

我一直在尝试这样做。 Pandas max() 会从整列中找到最大值。我需要的是:

我的输入 csv 文件:

Id  Param1          Param2              Val1
1  -5.00138282776   2.04990620034e-08   1.738e-05
1  -4.80147838593   2.01516989762e-08   1.628e-05
1  -4.60159301758   1.98263165885e-08   1.671e-05
1  -4.40133094788   1.94918392538e-08   1.576e-05
1  -4.20143127441   1.91767686175e-08   
2  -5.00141859055   6.88369405921e-09   5.512e-06
2  -4.80152130126   6.77335965093e-09   5.964e-06
2  -4.60163593292   6.65415056389e-09
3  -5.00138044357   1.16316911658e-08   4.008e-06
3  -4.80148792267   1.15515588206e-08   7.347e-06
3  -4.60160970681   1.14048361866e-08   8.446e-06
3  -4.40137386322   1.12357021465e-08   

需要输出:

Id  Param1          Param2              Val1        Max_Val1_for_each_Id
1  -5.00138282776   2.04990620034e-08   1.738e-05   1.738e-05
1  -4.80147838593   2.01516989762e-08   1.628e-05
1  -4.60159301758   1.98263165885e-08   1.671e-05
1  -4.40133094788   1.94918392538e-08   1.576e-05
1  -4.20143127441   1.91767686175e-08   
2  -5.00141859055   6.88369405921e-09   5.512e-06   5.964e-06
2  -4.80152130126   6.77335965093e-09   5.964e-06
2  -4.60163593292   6.65415056389e-09
3  -5.00138044357   1.16316911658e-08   4.008e-06   8.446e-06
3  -4.80148792267   1.15515588206e-08   7.347e-06
3  -4.60160970681   1.14048361866e-08   8.446e-06
3  -4.40137386322   1.12357021465e-08 

我不确定如何从具有相同 ID 的 Val1 列中选择/分组值,然后找到它们的最大值。此外,我在 Val1 列中有一些空白,将其数据类型呈现为对象。我不知道该怎么做。任何帮助都将受到欢迎。

【问题讨论】:

    标签: python python-2.7 pandas dataframe


    【解决方案1】:

    GroupBy.transform 用于每组max 值的新列:

    df['Max_Val1_for_each_Id'] = df.groupby('Id')['Val1'].transform('max')
    print (df)
        Id    Param1        Param2      Val1  Max_Val1_for_each_Id
    0    1 -5.001383  2.049906e-08  0.000017              0.000017
    1    1 -4.801478  2.015170e-08  0.000016              0.000017
    2    1 -4.601593  1.982632e-08  0.000017              0.000017
    3    1 -4.401331  1.949184e-08  0.000016              0.000017
    4    1 -4.201431  1.917677e-08       NaN              0.000017
    5    2 -5.001419  6.883694e-09  0.000006              0.000006
    6    2 -4.801521  6.773360e-09  0.000006              0.000006
    7    2 -4.601636  6.654151e-09       NaN              0.000006
    8    3 -5.001380  1.163169e-08  0.000004              0.000008
    9    3 -4.801488  1.155156e-08  0.000007              0.000008
    10   3 -4.601610  1.140484e-08  0.000008              0.000008
    11   3 -4.401374  1.123570e-08       NaN              0.000008
    

    然后如果只需要第一个值添加 whereduplicated 创建的掩码 ~ 用于反转掩码:

    df['Max_Val1_for_each_Id'] = df['Max_Val1_for_each_Id'].where(~df['Id'].duplicated())
    print (df)
        Id    Param1        Param2      Val1  Max_Val1_for_each_Id
    0    1 -5.001383  2.049906e-08  0.000017              0.000017
    1    1 -4.801478  2.015170e-08  0.000016                   NaN
    2    1 -4.601593  1.982632e-08  0.000017                   NaN
    3    1 -4.401331  1.949184e-08  0.000016                   NaN
    4    1 -4.201431  1.917677e-08       NaN                   NaN
    5    2 -5.001419  6.883694e-09  0.000006              0.000006
    6    2 -4.801521  6.773360e-09  0.000006                   NaN
    7    2 -4.601636  6.654151e-09       NaN                   NaN
    8    3 -5.001380  1.163169e-08  0.000004              0.000008
    9    3 -4.801488  1.155156e-08  0.000007                   NaN
    10   3 -4.601610  1.140484e-08  0.000008                   NaN
    11   3 -4.401374  1.123570e-08       NaN                   NaN
    

    编辑:

    如果Val1 没有NaN 值和上面的解决方案引发错误:

    TypeError: 'float' 和 'str' 的实例之间不支持 '>='

    第一步是将非数字转换为NaNs:

    df['Val1'] = pd.to_numeric(df['Val1'], errors='coerce')
    df['Max_Val1_for_each_Id'] = df.groupby('Id')['Val1'].transform('max')
    df['Max_Val1_for_each_Id'] = df['Max_Val1_for_each_Id'].where(~df['Id'].duplicated())
    

    【讨论】:

      【解决方案2】:

      使用 Numpy 的有趣方式

      f, u = pd.factorize(df.Id)
      out = np.zeros(len(u))
      whr = np.ones(len(u), np.int64) * len(f)
      
      mask = np.isnan(df.Val1)
      
      np.maximum.at(out, f[~mask], df.Val1[~mask])
      np.minimum.at(whr, f, np.arange(len(f)))
      
      df.assign(Max_Val1_for_each_Id=pd.Series(out, df.index[whr]))
      
          Id    Param1        Param2      Val1  Max_Val1_for_each_Id
      0    1 -5.001383  2.049906e-08  0.000017              0.000017
      1    1 -4.801478  2.015170e-08  0.000016                   NaN
      2    1 -4.601593  1.982632e-08  0.000017                   NaN
      3    1 -4.401331  1.949184e-08  0.000016                   NaN
      4    1 -4.201431  1.917677e-08       NaN                   NaN
      5    2 -5.001419  6.883694e-09  0.000006              0.000006
      6    2 -4.801521  6.773360e-09  0.000006                   NaN
      7    2 -4.601636  6.654151e-09       NaN                   NaN
      8    3 -5.001380  1.163169e-08  0.000004              0.000008
      9    3 -4.801488  1.155156e-08  0.000007                   NaN
      10   3 -4.601610  1.140484e-08  0.000008                   NaN
      11   3 -4.401374  1.123570e-08       NaN                   NaN
      

      【讨论】:

        猜你喜欢
        • 2022-10-16
        • 2012-08-23
        • 2018-06-09
        • 2021-11-28
        • 1970-01-01
        • 1970-01-01
        • 2021-01-13
        相关资源
        最近更新 更多