Pandas 数据帧在列中每次出现值（真）时拆分或分组数据帧答案

【问题标题】：Pandas dataframe split or groupby dataframe at each occurence of value (True) in columnPandas 数据帧在列中每次出现值（真）时拆分或分组数据帧
【发布时间】：2021-09-18 22:12:29
【问题描述】：

有一个像这样的df：

df = pd.DataFrame({'words':['hi', 'this', 'is', 'a', 'sentence', 'this', 'is', 'another', 'sentence'], 'indicator':[1,0,0,0,0,1,0,0,0]})

这给了我：

    words  indicator
0        hi          1
1      this          0
2        is          0
3         a          0
4  sentence          0
5      this          1
6        is          0
7   another          0
8  sentence          0

现在我想合并列 'words' 的所有值，这些值在指示符中的 '1' 之后，直到下一个 '1' 出现。这样的结果将是理想的结果：

                      words  indicator  counter
0     hi this is a sentence          1        5
1  this is another sentence          1        4

这并不容易解释，这就是我依赖这个例子的原因。我尝试了 groupby 和 split，但无法找到解决方案。最后一次尝试是设置某种 df.iterrows()，但我现在想避免这种情况，因为实际的 df 非常大。

提前感谢您的帮助！

【问题讨论】：

标签： python pandas dataframe group-by

【解决方案1】：

您可以获取指标的cumulative sum，然后将其分组以将所有单词连接到一个空格中并计算每个句子中的单词数。

df["indicator"] = df["indicator"].cumsum()
df = df.groupby(
    "indicator", as_index=False
).agg(
    words=("words", " ".join), 
    counter=("indicator", "size")
)
#    indicator                     words  counter
# 0          1     hi this is a sentence        5
# 1          2  this is another sentence        4

【讨论】：

非常感谢！执行累积总和对我来说是新的，这就是诀窍。
我刚刚更新了我的答案，因为我意识到您可以在一个 groupby 操作中获取大小并加入所有单词。让我知道如果其中任何一个没有意义