根据累积总和和另一个组创建分组答案

【问题标题】：Create grouping based on cumulative sum and another group根据累积总和和另一个组创建分组
【发布时间】：2020-04-04 00:08:00
【问题描述】：

这个问题几乎等同于： Create new group based on cumulative sum and group

但是，当我将接受的解决方案应用于我的数据时，它没有得到预期的结果。

简而言之，我有一个包含两个变量的数据：domain 和 value。 Domain 是一个具有多个观察值的组变量，value 是我想通过domain 和一个新的组变量newgroup 累积的一些连续值。主要有三个规则：

我只在每个domain 内累积。如果我到达domain 的末尾，则将重置累积。
如果累计总和至少为 1.0，则其值加起来至少为 1.0 的观测值将分配给 group1 的不同值。请注意，只需一次观察即可满足此规则。
如果domain 中的最后一个组的累积和小于 1.0，则将其与同一 domain 中的倒数第二组合并。这反映在变量group2

下面的数据已经过简化。数据通常由 10^5 - 10^6 行组成，因此矢量化解决方案是理想的。

示例数据

domain <- c(rep(1,5),rep(2,8))
value <- c(1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)


 domain value
      1   1.0
      1   0.0
      1   2.0
      1   2.5
      1   0.1
      2   0.1
      2   0.5
      2   0.0
      2   0.2
      2   0.6
      2   0.0
      2   0.0
      2   0.1

期望的输出

cumsum_val <- c(1,0,2,2.5,0.1,0.1,0.6,0.6,0.8,1.4,0,0,0.1)
group1 <- c(1,2,2,3,4,5,5,5,5,5,6,6,6)
group2 <- c(1,2,2,3,3,4,4,4,4,4,4,4,4) #Satisfies Rule #3
df_want <- data.frame(domain,value,cumsum_val,group1,group2)

 domain value cumsum_val group1 group2
      1   1.0        1.0      1      1
      1   0.0        0.0      2      2
      1   2.0        2.0      2      2
      1   2.5        2.5      3      3
      1   0.1        0.1      4      3
      2   0.1        0.1      5      4
      2   0.5        0.6      5      4
      2   0.0        0.6      5      4
      2   0.2        0.8      5      4
      2   0.6        1.4      5      4
      2   0.0        0.0      6      4
      2   0.0        0.0      6      4
      2   0.1        0.1      6      4

我使用了以下代码：

sum0 <- function(x, y) { if (x + y >= 1.0) 0 else x + y }
is_start <- function(x) head(c(TRUE, Reduce(sum0, init=0, x, acc = TRUE)[-1] == 0), -1)
cumsum(ave(df_raw$value, df_raw$domain, FUN = is_start))
## 1 2 3 4 5 6 6 6 6 6 7 8 9

但最后一行产生的值与上面的group1 不同。生成group1 输出是主要导致我出现问题的原因。有人可以帮我理解is_start 的功能以及它应该如何产生分组吗？

编辑 akrun 在 cmets 中为上面的简化示例提供了一些工作代码。但是，仍然存在一些不起作用的情况。例如，

domain <- c(rep(1,7),rep(2,8))
value <- c(1,0,1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)

输出如下所示，new 来自 akrun 的代码，group1 和 group2 是基于规则 #2 和 #3 的所需分组。 new 和 group2 之间的差异主要出现在前 3 行。

 domain value new group1 group2
      1   1.0   1      1      1
      1   0.0   2      2      2
      1   1.0   3      2      2
      1   0.0   4      3      3
      1   2.0   4      3      3
      1   2.5   5      4      4
      1   0.1   5      5      4
      2   0.1   6      6      5
      2   0.5   6      6      5
      2   0.0   6      6      5
      2   0.2   6      6      5
      2   0.6   6      6      5
      2   0.0   6      7      5
      2   0.0   6      7      5
      2   0.1   6      7      5

编辑 2 我已经更新了一个有效的答案。

【问题讨论】：

@akrun 是的，我已经更新了帖子以更清楚地说明我在问什么。我将数据示例更改为“交换”版本。
是的，最终，但生成 group1 是导致我出现问题的原因。
当它在第2行时，1 +0 = 1满足>=1，所以，在group1中为其分配了新ID，在第三行，不是还是0 +2 = 2吗？满足 >=1, -> group2 = 3?
可能是df_want %>% group_by(domain) %>% mutate(new = cumsum(c(0, abs(diff(value)))<= 1), new = if(n_distinct(new) == n()) 1 else new) %>% ungroup %>% mutate(new = rleid(new))
由于第 1 行符合规则 #2，因此第 2 行应具有 group1 和 group2 的新值。它不应该与第 1 行累积。第 2 行不满足任何条件，但是当它与第 3 行相加时，满足规则 #2。由于将第 2 行和第 3 行相加以满足规则 #2，因此它们应该位于相同的 group1 和 group2

标签： r dplyr cumsum

【解决方案1】：

这行得通！它结合使用 purrr 的 accumulate（类似于 cumsum 但更通用）和 cumsum 并适当使用 group_by 来获得您想要的东西。我添加了 cmets 来指示每个部分在做什么。我会注意到 next_group2 有点用词不当——它更像是 not_next_group2，但希望其余的都清楚。

library(tidyverse)

domain <- c(rep(1,5),rep(2,8))
value <- c(1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)

## Modified from: https://stackoverflow.com/questions/49076769/dplyr-r-cumulative-sum-with-reset
sum_reset_at = function(val_col, threshold, include.equals = TRUE) {
  if (include.equals) {
    purrr::accumulate({{val_col}}, ~if_else(.x>=threshold , .y, .x+.y))
  } else {
    purrr::accumulate({{val_col}}, ~if_else(.x>threshold , .y, .x+.y))
  }
}

df_raw %>% 
  group_by(domain) %>% 
  mutate(cumsum_val = sum_reset_at(value, 1)) %>% 
  mutate(next_group1 = ifelse(lag(cumsum_val) >= 1 | row_number() == 1, 1, 0)) %>% ## binary interpretation of whether there should be a new group
  ungroup %>% 
  mutate(group1 = cumsum(next_group1)) %>% ## generate new groups
  group_by(domain, group1) %>%
  mutate(next_group2 = ifelse(max(cumsum_val) < 1 & row_number() == 1, 1, 0)) %>% ## similar to above, but grouped by your new group1; we ask it only to transition at the first value of the group that doesn't reach 1
  ungroup %>% 
  mutate(group2 = cumsum(next_group1 - next_group2)) %>% ## cancel out the next_group1 binary if it meets the conditions of next_group2
  select(-starts_with("next_"))

按照说明，这会产生：

# A tibble: 13 x 5
   domain value cumsum_val group1 group2
    <dbl> <dbl>      <dbl>  <dbl>  <dbl>
 1      1   1          1        1      1
 2      1   0          0        2      2
 3      1   2          2        2      2
 4      1   2.5        2.5      3      3
 5      1   0.1        0.1      4      3
 6      2   0.1        0.1      5      4
 7      2   0.5        0.6      5      4
 8      2   0          0.6      5      4
 9      2   0.2        0.8      5      4
10      2   0.6        1.4      5      4
11      2   0          0        6      4
12      2   0          0        6      4
13      2   0.1        0.1      6      4

【讨论】：

【解决方案2】：

以下方案改编自Group vector on conditional sum。

辅助 Rcpp 函数

library(Rcpp)
cppFunction('
IntegerVector CreateGroup(NumericVector x, int cutoff) {
    IntegerVector groupVec (x.size());
    int group = 1;
    int threshid = 0;
    double runSum = 0;
    for (int i = 0; i < x.size(); i++) {
        runSum += x[i];
        groupVec[i] = group;

        if (runSum >= cutoff) {
            group++;
            runSum = 0;
        }
    }
    return groupVec;
}
')

主要功能

domain <- c(rep(1,7),rep(2,8))
value <- c(1,0,1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)

df_raw %>%
  group_by(domain) %>%
  mutate(group1 = CreateGroup(value,1),
         group1 = ifelse(group1==max(group1) & last(value) < 1,
                        max(group1)-1,group1)) %>%
  ungroup() %>%
  mutate(group2 = rleid(group1))

 domain value group1 group2
      1   1.0      1      1
      1   0.0      2      2
      1   1.0      2      2
      1   0.0      3      3
      1   2.0      3      3
      1   2.5      4      4
      1   0.1      4      4
      2   0.1      1      5
      2   0.5      1      5
      2   0.0      1      5
      2   0.2      1      5
      2   0.6      1      5
      2   0.0      1      5
      2   0.0      1      5
      2   0.1      1      5

【讨论】：