如何修复由于 Pandas Groupby 中的级别导致的索引错误答案

【问题标题】：How to fix Index Error due to level in Pandas Groupby如何修复由于 Pandas Groupby 中的级别导致的索引错误
【发布时间】：2020-09-18 06:51:10
【问题描述】：

我有以下 DataFrame badges。 UserId 列包含同一用户的多个条目。对于给定的BadgeName，我想为每个UserId 获取Date 的最小值。我创建了一个函数user_badge_dt 来执行相同的操作，但出现索引错误。需要注意的一点是，尽管所有用户的数据集都是相同的，但我只对某些徽章而不是其他徽章收到此错误。我不知道为什么会这样。

Badges DataFrame 的一部分

    UserId    BadgeName            Date                   
0     23    Curious         2016-01-12T18:44:49.267 
1     22    Autobiographer  2017-01-12T18:44:49.267 
2     23    Curious         2018-01-12T18:44:49.267 
3     20    Autobiographer  2019-01-12T18:44:49.267 
4     22    Autobiographer  2020-01-12T18:44:49.267
5     30    Curious         2020-01-12T18:44:49.267

功能

#Function to obtain UserId with the date-time of obtaining given badge for the first time
def user_badge_dt(badge_name):
  
  #Creating DataFrame to obtain all UserId and date-Time of given badge
  df = badges[['UserId','Date']].loc[badges.Name == badge]
  
  #Obtaining the first date-time of badge attainment
  v = df.groupby("UserId", group_keys=False)['Date'].nsmallest(1)
  v.index = v.index.droplevel(1)

  df['date'] = df['UserId'].map(v)
  df.drop(columns='Date',inplace=True)
  
  #Removing all duplicate values of Users
  df.drop_duplicates(subset='UserId',  inplace=True )

  return df

错误

IndexError: Too many levels: Index has only 1 level, not 2

注意
在进一步检查中，我发现错误是在这一行引起的 v.index = v.index.droplevel(1)

这是因为前面的代码行对不同的徽章名称给出了不同的结果：

案例 1：当代码对给定的徽章正常工作时

df = badges[['UserId','Date']].loc[badges.Name == 'Autobigrapher']
v = df.groupby("UserId", group_keys=False)['Date'].nsmallest(1) 打印(v)

o/p:

    1   22    2017-01-12T18:44:49.267 
    3   20    2019-01-12T18:44:49.267

（此输出具有index、UserId 和给定徽章的最小值Date）

案例 2：给定徽章的代码工作不正确时

df = badges[['UserId','Date']].loc[badges.Name == 'Curious']
v = df.groupby("UserId", group_keys=False)['Date'].nsmallest(1) 打印(v)

o/p:

      23   2016-01-12T18:44:49.267 
      30   2020-01-12T18:44:49.267

（此输出没有index，这就是代码在下一行失败的原因。我不知道它是怎么发生的。）

任何输入 badge_name 的函数的预期输出应返回一个数据框，其中包含 UserId 和给定标记的最小值 Date。如果我的功能不清楚，请提供另一种方法来使用新功能实现此目的。

【问题讨论】：

标签： python python-3.x pandas dataframe pandas-groupby

【解决方案1】：

我无法模拟您的错误，但我认为您的解决方案应该简化为 DataFrame.sort_values - 然后让所有第一个用户的日期最短：

badges['Date'] = pd.to_datetime(badges['Date'])

def user_badge_dt(badge_name):
  
  #Creating DataFrame to obtain all UserId and date-Time of given badge
  return  (badges.loc[badges.BadgeName == badge_name, ['UserId','Date']]
                 .sort_values('Date')
                 .drop_duplicates(subset='UserId'))

【讨论】：