【问题标题】:dplyr: using column created by mutate in the mutation itselfdplyr:在突变本身中使用由 mutate 创建的列
【发布时间】:2018-01-02 12:16:22
【问题描述】:

我有一个看起来像这样的数据框:

> df
# A tibble: 5,427 x 3
    cond desired   inc
   <chr>   <dbl> <dbl>
 1  <NA>       0     0
 2  <NA>       5     5
 3     X      10     5
 4     X       7     7
 5  <NA>      16    16
 6  <NA>      21     5
 7  <NA>      26     5
 8  <NA>      31     5
 9     X      37     6
10  <NA>       5     5

这已经包含了我想要的输出。我想要做的是将inc 的值相加,但如果在前一行的cond 列中有X,则重置总和。因此,例如在9 行中,我会从前一行(31)中获取desired-值,并从9(6)行中添加inc-值,得到37。在@987654329 行中@我只取inc-value,因为前一行的cond-column 是X。我使用循环解决了这个问题,但我想使用矢量化解决方案。到目前为止,我得到了这个:

df$test <- 0
df <- df %>% mutate(test = ifelse(is.na(lag(df$cond)), lag(test) + inc, inc))

如果我得到这个后运行第二行:

> df
# A tibble: 5,427 x 4
    cond desired   inc  test
   <chr>   <dbl> <dbl> <dbl>
 1  <NA>       0     0    NA
 2  <NA>       5     5     5
 3     X      10     5     5
 4     X       7     7     7
 5  <NA>      16    16    16
 6  <NA>      21     5     5
 7  <NA>      26     5     5
 8  <NA>      31     5     5
 9     X      37     6     6
10  <NA>       5     5     5

第二次运行后是这样的:

> df
# A tibble: 5,427 x 4
    cond desired   inc  test
   <chr>   <dbl> <dbl> <dbl>
 1  <NA>       0     0    NA
 2  <NA>       5     5    NA
 3     X      10     5    10
 4     X       7     7     7
 5  <NA>      16    16    16
 6  <NA>      21     5    21
 7  <NA>      26     5    10
 8  <NA>      31     5    10
 9     X      37     6    11
10  <NA>       5     5     5
# ... with 5,417 more rows

第三次:

> df
# A tibble: 5,427 x 4
    cond desired   inc  test
   <chr>   <dbl> <dbl> <dbl>
 1  <NA>       0     0    NA
 2  <NA>       5     5    NA
 3     X      10     5    NA
 4     X       7     7     7
 5  <NA>      16    16    16
 6  <NA>      21     5    21
 7  <NA>      26     5    26
 8  <NA>      31     5    15
 9     X      37     6    16
10  <NA>       5     5     5

那么,第五次之后:

> df
# A tibble: 5,427 x 4
    cond desired   inc  test
   <chr>   <dbl> <dbl> <dbl>
 1  <NA>       0     0    NA
 2  <NA>       5     5    NA
 3     X      10     5    NA
 4     X       7     7     7
 5  <NA>      16    16    16
 6  <NA>      21     5    21
 7  <NA>      26     5    26
 8  <NA>      31     5    31
 9     X      37     6    37
10  <NA>       5     5     5

我在 mutate-command 本身中使用了我通过 mutate 创建的列,我猜这是导致这种行为/问题的原因。有什么方法可以达到我想要的结果吗?提前致谢!

数据框:

structure(list(cond = c(NA, NA, "X", "X", NA, NA, NA, NA, "X", 
NA, NA, NA, NA, NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, "X", 
NA, NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, "X", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, "X", NA, NA, NA, NA, NA, NA, NA, "X", NA, NA, "X", 
NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, NA, NA, NA, "X", NA, 
NA, NA, NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, NA, NA, NA, 
NA, "X", NA, NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, NA, NA, 
NA, NA, NA, "X", NA, NA, NA, "X", NA, NA, NA, NA, "X", NA, NA, 
NA, NA, NA, NA, NA, NA, "X", NA, NA, "X", NA, NA, NA, NA, "X", 
NA, NA, NA, NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, NA, NA, 
NA, "X", NA, "X", NA, NA, NA, NA, NA, NA, NA, NA, "X", NA, NA, 
NA, NA, NA, NA, NA, "X", NA, NA, NA, "X", "X", NA, NA, NA, NA, 
NA, NA, NA, NA, "X", "X", NA, "X", NA, NA, NA, NA, NA, NA, NA, 
NA, "X", NA, NA, NA, "X", NA, NA, NA, NA, NA, NA, NA, NA, "X", 
NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, "X", NA, NA, NA, NA, 
"X", NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, NA, NA, NA, NA, 
"X", NA, NA, NA, NA, NA, NA, "X", NA, NA, NA, NA, "X", NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "X", NA, "X", 
NA, "X", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, "X", NA, NA, NA), desired = c(0, 5, 10, 7, 16, 21, 26, 
31, 37, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 5, 10, 15, 20, 
30, 7, 15, 21, 25, 40, 45, 55, 12, 20, 25, 30, 35, 40, 45, 50, 
55, 60, 65, 70, 75, 5, 10, 15, 20, 22, 30, 35, 45, 50, 55, 60, 
65, 70, 75, 9, 14, 19, 24, 29, 34, 39, 44, 5, 7, 10, 2, 7, 12, 
17, 22, 27, 5, 10, 15, 20, 25, 30, 35, 38, 4, 7, 12, 17, 22, 
27, 32, 37, 39, 13, 18, 23, 28, 33, 38, 43, 48, 53, 5, 10, 15, 
20, 25, 30, 35, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 5, 10, 
15, 20, 2, 10, 15, 20, 25, 5, 10, 15, 20, 25, 30, 35, 40, 45, 
5, 8, 12, 5, 10, 14, 19, 24, 5, 10, 15, 20, 25, 30, 35, 40, 45, 
5, 10, 15, 20, 25, 28, 33, 38, 5, 11, 5, 10, 15, 20, 25, 30, 
35, 40, 45, 12, 17, 22, 27, 32, 37, 42, 47, 5, 10, 15, 20, 5, 
5, 10, 15, 20, 25, 30, 35, 40, 45, 5, 5, 10, 5, 10, 15, 20, 25, 
30, 35, 40, 45, 5, 10, 15, 20, 5, 10, 15, 20, 25, 30, 34, 39, 
44, 5, 10, 15, 20, 25, 30, 5, 10, 15, 20, 25, 5, 10, 15, 20, 
25, 5, 10, 15, 20, 25, 29, 5, 10, 15, 20, 23, 25, 30, 35, 40, 
5, 15, 20, 25, 30, 35, 40, 5, 10, 15, 20, 25, 5, 10, 15, 20, 
25, 28, 33, 38, 43, 48, 53, 58, 71, 76, 81, 5, 10, 5, 10, 5, 
10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 5, 
10, 15), inc = c(0, 5, 5, 7, 16, 5, 5, 5, 6, 5, 5, 5, 5, 5, 5, 
5, 5, 5, 5, 5, 5, 5, 5, 10, 7, 8, 6, 4, 15, 5, 10, 12, 8, 5, 
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 8, 5, 10, 5, 5, 
5, 5, 5, 5, 9, 5, 5, 5, 5, 5, 5, 5, 5, 2, 3, 2, 5, 5, 5, 5, 5, 
5, 5, 5, 5, 5, 5, 5, 3, 4, 3, 5, 5, 5, 5, 5, 5, 2, 13, 5, 5, 
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 
5, 5, 5, 5, 5, 5, 2, 8, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 
3, 4, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 
3, 5, 5, 5, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 12, 5, 5, 5, 5, 5, 
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 5, 5, 5, 
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 
5, 5, 5, 5, 3, 2, 5, 5, 5, 5, 10, 5, 5, 5, 5, 5, 5, 5, 5, 5, 
5, 5, 5, 5, 5, 5, 3, 5, 5, 5, 5, 5, 5, 13, 5, 5, 5, 5, 5, 5, 
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5)), .Names = c("cond", 
"desired", "inc"), row.names = c(NA, -300L), class = c("tbl_df", 
"tbl", "data.frame"))

【问题讨论】:

  • cond 列是 x,即使在行 9 中也是如此。因此,根据您规定的规则,总和也应设置在那里。为什么行9 与行34 不同?
  • X 影响下一行,因此第 9 行中的 X 重置总和,第 10 行 inc 成为总和。第 3 行和第 4 行也是如此:在第 4 行和第 5 行中,所需的值与该行的 inc 相同。

标签: r dplyr


【解决方案1】:

这是一个使用ave() 函数和上面的df 结构的示例。为了清楚起见,我将显示所有步骤,但如果需要,可以减少这些步骤。

library(dplyr)
df %>% 
  mutate(prevcond = lag(cond)) %>%
  mutate(flag = ifelse(is.na(prevcond) | prevcond !='X', 0, 1)) %>% 
  mutate(counter = cumsum(flag)) %>% 
  mutate(desired2 = ave(inc, counter, FUN = cumsum))

【讨论】:

    【解决方案2】:

    为了达到您想要的输出,我们必须首先创建一个分组列,每次上一行等于X 时都会重置。为此,我们将row_number()zoo::na.locf() 结合使用。那么我们就可以简单的使用cumsum()

    library(dplyr)
    library(zoo)
    df %>% group_by(grp = na.locf(row_number(cond), 
                                  fromLast = TRUE, 
                                  na.rm = FALSE)) %>%
      mutate(test = cumsum(inc))
    #    cond desired   inc   grp  test
    #   <chr>   <dbl> <dbl> <int> <dbl>
    # 1  <NA>       0     0     1     0
    # 2  <NA>       5     5     1     5
    # 3     X      10     5     1    10
    # 4     X       7     7     2     7
    # 5  <NA>      16    16     3    16
    # 6  <NA>      21     5     3    21
    # 7  <NA>      26     5     3    26
    # 8  <NA>      31     5     3    31
    # 9     X      37     6     3    37
    #10  <NA>       5     5     4     5
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2016-09-17
      • 2018-10-03
      • 1970-01-01
      • 2023-01-29
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多