【问题标题】:how do you groupby multiple columns in Pandas and add rows for missing groups你如何在 Pandas 中对多列进行分组并为缺失的组添加行
【发布时间】:2022-09-27 21:33:50
【问题描述】:

假设在我的数据集中,我有 3 个名义/分类变量 - 年(2 个唯一值)、性别(2 个唯一值)、国家(2 个唯一值)和 2 个数值变量 - 年工作经验和薪水。现在想象一下 2010 年美国女性没有数据(有多个这样的组)。我想:

  1. 按年份、性别、国家/地区分组,并按平均值汇总工作经验和薪水。
  2. 然后对于缺失的组 - 将可能缺失的组添加为行,并相应地添加say、work exp 和salary 作为零。

    我可以通过 pandas groupby 实现第 1 步。在第 2 步中需要帮助。或者是否有更好的整体方法来解决这个问题?

    例子: 原始数据

    Years Gender Country Salary Work ex
    2010 Male USA 50 2
    2011 Female India 30 1
    2011 Male Ind 10 3
    2011 Male USA 50 2
    2011 Female USA 80 2
    2010 Male USA 50 1

    步骤 1 之后:

    Years Gender Country Mean Salary Mean Work ex
    2010 Male USA 50 1.5
    2011 Female India 30 1
    2011 Male India 10 3
    2011 Male USA 50 2
    2011 Female USA 80 2

    步骤 2 之后:

    Years Gender Country Mean Salary Mean Work ex
    2010 Male USA 50 1.5
    2010 Male India NA NA
    2010 Female USA NA NA
    2010 Female India NA NA
    2011 Female India 30 1
    2011 Male India 10 3
    2011 Male USA 50 2
    2011 Female USA 80 2

    标签: python pandas group-by data-manipulation


    【解决方案1】:

    假设您完成了第 1 步,我们将其称为 df_grp。

    然后使用 ['Years', 'Gender', 'Country'] 的所有可能组合创建一个数据框,例如:

    df_all = pd.MultiIndex.from_product([[2010, 2011], ['Male', 'Female'], ['India', 'USA']]).to_frame()
    df_all = df_all.reset_index(drop=True)
    df_all.columns = ['Years', 'Gender', 'Country']
    

    然后与 df_grp 进行外部合并

    out = df_all.merge(df_grp, on=['Years', 'Gender', 'Country'], how = 'outer')
    

    打印出):

       Years  Gender Country  Mean Salary  Mean Work ex.
    0   2010    Male   India          NaN            NaN
    1   2010    Male     USA         50.0            1.5
    2   2010  Female   India          NaN            NaN
    3   2010  Female     USA          NaN            NaN
    4   2011    Male   India         10.0            3.0
    5   2011    Male     USA         50.0            2.0
    6   2011  Female   India         30.0            1.0
    7   2011  Female     USA         80.0            2.0
    

    【讨论】:

      【解决方案2】:

      确保变量是类别,然后使用pd.groupby()

      df = pd.DataFrame({'Years': {0: 2010, 1: 2011, 2: 2011, 3: 2011, 4: 2011, 5: 2010},
                         'Gender': {0: 'Male', 1: 'Female', 2: 'Male', 3: 'Male', 4: 'Female', 5: 'Male'},
                         'Country': {0: 'USA', 1: 'India', 2: 'India', 3: 'USA', 4: 'USA', 5: 'USA'},
                         'Salary': {0: 50, 1: 30, 2: 10, 3: 50, 4: 80, 5: 50},
                         'Work ex': {0: 2, 1: 1, 2: 3, 3: 2, 4: 2, 5: 1}})
      
      df[['Years', 'Gender', 'Country']] = df[['Years', 'Gender', 'Country']].astype('category')
      
      df.groupby(['Years', 'Gender', 'Country'])[['Salary', 'Work ex']].mean().reset_index()
      

      输出:

        Years  Gender Country  Salary  Work ex
      0  2010  Female   India     NaN      NaN
      1  2010  Female     USA     NaN      NaN
      2  2010    Male   India     NaN      NaN
      3  2010    Male     USA    50.0      1.5
      4  2011  Female   India    30.0      1.0
      5  2011  Female     USA    80.0      2.0
      6  2011    Male   India    10.0      3.0
      7  2011    Male     USA    50.0      2.0
      

      您还可以通过执行以下操作将缺失值设置为零:

      df.groupby(['Years', 'Gender', 'Country'])[['Salary', 'Work ex']].mean().fillna(0).reset_index()
      

      输出:

        Years  Gender Country  Salary  Work ex
      0  2010  Female   India     0.0      0.0
      1  2010  Female     USA     0.0      0.0
      2  2010    Male   India     0.0      0.0
      3  2010    Male     USA    50.0      1.5
      4  2011  Female   India    30.0      1.0
      5  2011  Female     USA    80.0      2.0
      6  2011    Male   India    10.0      3.0
      7  2011    Male     USA    50.0      2.0
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-12-29
        • 2019-01-10
        • 1970-01-01
        • 1970-01-01
        • 2023-03-21
        • 1970-01-01
        相关资源
        最近更新 更多