在 pandas 中按组连续获得最长连续周数答案

【问题标题】：Get longest streak of consecutive weeks by group in pandas在 pandas 中按组连续获得最长连续周数
【发布时间】：2023-03-12 06:30:02
【问题描述】：

目前我正在处理不同主题的每周数据，但它可能有一些没有数据的长期连续性，所以，我想做的就是为每个 id 保持最长连续几周的连续性。我的数据如下所示：

我的预期输出是：

我有点接近，试图在week==week.shift()+1 时用 1 标记。问题是这种方法不会标记连续出现的第一次出现，而且我也无法过滤最长的一次：

df.loc[ (df['id'] == df['id'].shift())&(df['week'] == df['week'].shift()+1),'streak']=1

根据我的例子，这会带来：

id    week  streak
1      8     nan
1      15    nan
1      60    nan
1      61    1
1      62    1
2      10    nan
2      11    1
2      12    1
2      13    1
2      25    nan
2      26    1

关于如何实现我想要的任何想法？

【问题讨论】：

您可以通过以下方式获得另一列（streak1）：week==week.shift(-1)-1，这样您也可以识别第一列。您可能需要 xor streak 和 streak1 才能获得最终结果

标签： python pandas time-series

【解决方案1】：

试试这个：

df['consec'] = df.groupby(['id',df['week'].diff(-1).ne(-1).shift().bfill().cumsum()]).transform('count')

df[df.groupby('id')['consec'].transform('max') == df.consec]

输出：

   id  week  consec
2   1    60       3
3   1    61       3
4   1    62       3
5   2    10       4
6   2    11       4
7   2    12       4
8   2    13       4

【讨论】：

哇，太好了。唯一的问题是我在运行第一行时遇到了这个错误：ValueError: Wrong number of items passed 30, placement implies 1。你知道会发生什么吗？
先尝试升级pandas。
只是在你的第二个 groupkey 中的一个小建议，我认为使用 groupby 创建它更节省。df.groupby('id').week.apply(lambda x : x.diff().ne(1).cumsum())
尝试更新熊猫，现在我无法导入它。收到此错误：AttributeError: module 'numpy.core.umath' has no attribute 'divmod'
编辑：更新了 numpy，现在它可以工作了。现在我收到ValueError: Wrong number of items passed 25, placement implies 1

【解决方案2】：

不像@ScottBoston 那样简洁，但我喜欢这种方法

def max_streak(s):
  a = s.values    # Let's deal with an array

  # I need to know where the differences are not `1`.
  # Also, because I plan to use `diff` again, I'll wrap
  # the boolean array with `True` to make things cleaner
  b = np.concatenate([[True], np.diff(a) != 1, [True]])

  # Tell the locations of the breaks in streak
  c = np.flatnonzero(b)

  # `diff` again tells me the length of the streaks
  d = np.diff(c)

  # `argmax` will tell me the location of the largest streak
  e = d.argmax()

  return c[e], d[e]

def make_thing(df):
  start, length = max_streak(df.week)
  return df.iloc[start:start + length].assign(consec=length)

pd.concat([
  make_thing(g) for _, g in df.groupby('id')    
])

   id  week  consec
2   1    60       3
3   1    61       3
4   1    62       3
5   2    10       4
6   2    11       4
7   2    12       4
8   2    13       4

【讨论】：