【问题标题】:Math operations between groups with dplyr and tidyrdplyr 和 tidyr 组之间的数学运算
【发布时间】:2023-12-27 09:55:01
【问题描述】:

当我有像这个虚拟示例这样的整洁数据时:

    structure(list(year = c(2017L, 2018L, 2019L, 2020L, 2017L, 2018L, 
2019L, 2020L), figure = c("income", "income", "income", "income", 
"expenses", "expenses", "expenses", "expenses"), value = c(10, 
11, 10, 13, 5, 4, 4, 4)), row.names = c(NA, -8L), .Names = c("year", 
"figure", "value"), class = "data.frame")

即:

  year   figure value
1 2017   income    10
2 2018   income    11
3 2019   income    10
4 2020   income    13
5 2017 expenses     5
6 2018 expenses     4
7 2019 expenses     4
8 2020 expenses     4

我想计算每年的利润(收入-支出),我使用以下方法:

temp %>% 
spread(figure, value) %>% 
mutate(profit = income - expenses) %>% 
gather(figure, value, -year)

输出是:

   year   figure value
1  2017 expenses     5
2  2018 expenses     4
3  2019 expenses     4
4  2020 expenses     4
5  2017   income    10
6  2018   income    11
7  2019   income    10
8  2020   income    13
9  2017   profit     5
10 2018   profit     7
11 2019   profit     6
12 2020   profit     9

我将表格更改为宽格式,在列之间进行操作,然后再次将数据更改为长格式。

group_by() 有什么方法可以做同样的事情,但不更改为宽格式,然后更改为长格式?

编辑:

我有以下data.frame:

temp <- structure(list(year = c(2017L, 2018L, 2019L, 2020L, 2017L, 2018L, 
2019L, 2020L, 2017L, 2018L, 2019L, 2020L, 2017L, 2018L, 2019L, 
2020L), figure = c("income", "income", "income", "income", "expenses", 
"expenses", "expenses", "expenses", "income", "income", "income", 
"income", "expenses", "expenses", "expenses", "expenses"), value = c(10, 
11, 10, 13, 5, 4, 4, 4, 10, 11, 10, 13, 5, 4, 4, 4), company = c("A", 
"A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", 
"B", "B")), .Names = c("year", "figure", "value", "company"), row.names = c(NA, 
-16L), class = "data.frame")

我愿意:

temp %>% 
  filter(company == "A") %>% 
  group_by(year, company) %>% 
  summarise(value = value[figure == 'income'] - value[figure == 'expenses'], 
           figure = 'profit') %>%
  bind_rows(temp, .)

最终输出包含公司“A”和公司“B”,输出只能是“B”。例子表明,如果我们之前修改数据进行汇总,那么与原始 data.frame 绑定并不是一个好主意。

【问题讨论】:

    标签: r dplyr tidyr


    【解决方案1】:

    对于每个year,您可以用"expenses" 值减去"income" value 并将结果绑定到原始数​​据帧。

    library(dplyr)
    
    df %>%
      group_by(year) %>%
      summarise(value = value[figure == 'income'] - value[figure == 'expenses'], 
                figure = 'profit') %>%
      bind_rows(df, .)
    
    #   year   figure value
    #1  2017   income    10
    #2  2018   income    11
    #3  2019   income    10
    #4  2020   income    13
    #5  2017 expenses     5
    #6  2018 expenses     4
    #7  2019 expenses     4
    #8  2020 expenses     4
    #9  2017   profit     5
    #10 2018   profit     7
    #11 2019   profit     6
    #12 2020   profit     9
    

    我们还可以使用diff将数据按yearfigure排列后的值相减。

    df %>%
      arrange(year, figure) %>%
      group_by(year) %>%
      summarise(value = diff(value),
                figure = 'profit') %>%
      bind_rows(df, .)
    

    【讨论】:

    • 不错的方法,但是如果我在summarise() 之前有一些filter()bind_rows() 等,那么我无法将行与原始data.frame 绑定。
    • @FrancescVE filter 应该不是问题,因为filter 中的列数不会改变,summarise 的列数可能/可能不会改变,但 bind_rows应该不是问题,因为如果该列不存在,它会添加NA。 (见bind_rows(iris, mtcars))。您能否具体提及您正在尝试做的事情以及是否给您带来任何错误?
    • 我已经编辑了原始问题,给出了一个更好的例子。
    • @FrancescVE 更新后的数据帧的预期输出是什么?如果它与您的 gather 相同,那么它的列值混合(AB 和数字)也有重复的行。
    • @FrancescVE 我对现在的问题感到困惑。您的原始问题有输入以及明确的预期输出,我的答案可以正常工作。我不确定您在更新后的问题中要寻找什么。
    最近更新 更多