连续相同值的总和答案

【问题标题】：Running total of consecutive identical values连续相同值的总和
【发布时间】：2016-03-26 02:38:19
【问题描述】：

如何在 pandas Series 中获得连续的 1？

例如，s = pd.Series([5, 1, 4, 1, 1, 2, 3, 1, 1, 1, 4])。我要获取pd.Series([0, 1, 0, 1, 2, 0, 0, 1, 2, 3, 0])。

（熊猫 0.18.0）

【问题讨论】：

【解决方案1】：

您可以尝试将groupby 与cumcount 比较s1 != 1 与cumsum：

print s1.groupby((s1 != 1).cumsum()).cumcount()
0     0
1     1
2     0
3     1
4     2
5     0
6     0
7     1
8     2
9     3
10    0
dtype: int64

更好的解释：

df = pd.DataFrame(s1, columns=['orig'])
df['not1'] = s1 != 1
df['cumsum'] = (s1 != 1).cumsum()
df['cumcount'] = s1.groupby((s1 != 1).cumsum()).cumcount()
#s1.groupby((s1 != 1).cumsum()).cumcount() is same as:
df['cumcount1'] = df.groupby('cumsum')['orig'].cumcount()
print df
    orig   not1  cumsum  cumcount  cumcount1
0      5   True       1         0          0
1      1  False       1         1          1
2      3   True       2         0          0
3      4   True       3         0          0
4      1  False       3         1          1
5      1  False       3         2          2
6      2   True       4         0          0
7      3   True       5         0          0
8      1  False       5         1          1
9      1  False       5         2          2
10     1  False       5         3          3
11     4   True       6         0          0

或者：

print (s1 == 1) * (s1.groupby((s1 != s1.shift()).cumsum()).cumcount() + 1)
0     0
1     1
2     0
3     1
4     2
5     0
6     0
7     1
8     2
9     3
10    0
dtype: int64

解释：

df = pd.DataFrame(s1, columns=['orig'])
df['compare_shift'] = s1 != s1.shift()
df['cumsum'] = (s1 != s1.shift()).cumsum()
df['cumcount'] = s1.groupby((s1 != s1.shift()).cumsum()).cumcount() + 1
df['cumcount1'] = df.groupby('cumsum')['orig'].cumcount() + 1
df['is1'] = (s1 == 1)
#True in converted to 1, False to 0
df['fin'] = (s1 == 1) * (s1.groupby((s1 != s1.shift()).cumsum()).cumcount() + 1)
print df
    orig compare_shift  cumsum  cumcount  cumcount1    is1  fin
0      5          True       1         1          1  False    0
1      1          True       2         1          1   True    1
2      3          True       3         1          1  False    0
3      4          True       4         1          1  False    0
4      1          True       5         1          1   True    1
5      1         False       5         2          2   True    2
6      2          True       6         1          1  False    0
7      3          True       7         1          1  False    0
8      1          True       8         1          1   True    1
9      1         False       8         2          2   True    2
10     1         False       8         3          3   True    3
11     4          True       9         1          1  False    0

【讨论】：

我认为它需要完整地遍历(s1!=1) 的行、cumsum 的行、groupby 的行和cumcount 的行。与（假设的）pandas 方法相比，它需要进行 4 次通过的事实是否会减慢它的速度？（当然，我知道即使这样，它仍然比纯 python 循环快得多。）
我认为它更快/更好，因为使用熊猫功能，虽然它通过了 4 次。

【解决方案2】：

不是最漂亮的方式（也可能不是最佳方式），但以下方式可以完成工作（并且比其他循环答案快约 4.5 倍）：

s = pd.Series([5, 1, 4, 1, 1, 2, 3, 1, 1, 1, 4])

def consecutive_n(s, n=1):
    a = s[s==n].cumsum()[s.index].fillna(0) / n
    b = a[a.diff() > 1]
    c = (b.rank() - b)[s.index].fillna(0).cumsum()
    return (a + c).apply(lambda x: np.where(x<0, 0, x)).astype(int)

>>> consecutive_n(s, n=1)
0     0
1     1
2     0
3     1
4     2
5     0
6     0
7     1
8     2
9     3
10    0
dtype: int64

关于中间值的一些解释：
a：在整个系列中第 n 次出现 1。
c：必须向 a 添加多少才能“重置”出现次数当在 1（或 n）之间显示不同的数字时。返回值：应用 lambda 忽略负数，结果形式为 a + c。

编辑：稍微更改了代码，使其适用于任何正整数。示例：

>>> t = pd.Series([1, 2, 3, 1, 4, 2, 2, 3, 2, 2, 2, 1])
>>> consecutive_n(t, 2)
0     0
1     1
2     0
3     0
4     0
5     1
6     2
7     0
8     1
9     2
10    3
11    0
dtype: int64

【讨论】：