dplyr 和 tidyr 组之间的数学运算答案

【问题标题】：Math operations between groups with dplyr and tidyrdplyr 和 tidyr 组之间的数学运算
【发布时间】：2023-12-27 09:55:01
【问题描述】：

当我有像这个虚拟示例这样的整洁数据时：

    structure(list(year = c(2017L, 2018L, 2019L, 2020L, 2017L, 2018L, 
2019L, 2020L), figure = c("income", "income", "income", "income", 
"expenses", "expenses", "expenses", "expenses"), value = c(10, 
11, 10, 13, 5, 4, 4, 4)), row.names = c(NA, -8L), .Names = c("year", 
"figure", "value"), class = "data.frame")

即：

  year   figure value
1 2017   income    10
2 2018   income    11
3 2019   income    10
4 2020   income    13
5 2017 expenses     5
6 2018 expenses     4
7 2019 expenses     4
8 2020 expenses     4

我想计算每年的利润（收入-支出），我使用以下方法：

temp %>% 
spread(figure, value) %>% 
mutate(profit = income - expenses) %>% 
gather(figure, value, -year)

输出是：

   year   figure value
1  2017 expenses     5
2  2018 expenses     4
3  2019 expenses     4
4  2020 expenses     4
5  2017   income    10
6  2018   income    11
7  2019   income    10
8  2020   income    13
9  2017   profit     5
10 2018   profit     7
11 2019   profit     6
12 2020   profit     9

我将表格更改为宽格式，在列之间进行操作，然后再次将数据更改为长格式。

group_by() 有什么方法可以做同样的事情，但不更改为宽格式，然后更改为长格式？

编辑：

我有以下data.frame：

temp <- structure(list(year = c(2017L, 2018L, 2019L, 2020L, 2017L, 2018L, 
2019L, 2020L, 2017L, 2018L, 2019L, 2020L, 2017L, 2018L, 2019L, 
2020L), figure = c("income", "income", "income", "income", "expenses", 
"expenses", "expenses", "expenses", "income", "income", "income", 
"income", "expenses", "expenses", "expenses", "expenses"), value = c(10, 
11, 10, 13, 5, 4, 4, 4, 10, 11, 10, 13, 5, 4, 4, 4), company = c("A", 
"A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", 
"B", "B")), .Names = c("year", "figure", "value", "company"), row.names = c(NA, 
-16L), class = "data.frame")

我愿意：

temp %>% 
  filter(company == "A") %>% 
  group_by(year, company) %>% 
  summarise(value = value[figure == 'income'] - value[figure == 'expenses'], 
           figure = 'profit') %>%
  bind_rows(temp, .)

最终输出包含公司“A”和公司“B”，输出只能是“B”。例子表明，如果我们之前修改数据进行汇总，那么与原始 data.frame 绑定并不是一个好主意。

【问题讨论】：

标签： r dplyr tidyr

【解决方案1】：

对于每个year，您可以用"expenses" 值减去"income" value 并将结果绑定到原始数据帧。

library(dplyr)

df %>%
  group_by(year) %>%
  summarise(value = value[figure == 'income'] - value[figure == 'expenses'], 
            figure = 'profit') %>%
  bind_rows(df, .)

#   year   figure value
#1  2017   income    10
#2  2018   income    11
#3  2019   income    10
#4  2020   income    13
#5  2017 expenses     5
#6  2018 expenses     4
#7  2019 expenses     4
#8  2020 expenses     4
#9  2017   profit     5
#10 2018   profit     7
#11 2019   profit     6
#12 2020   profit     9

我们还可以使用diff将数据按year和figure排列后的值相减。

df %>%
  arrange(year, figure) %>%
  group_by(year) %>%
  summarise(value = diff(value),
            figure = 'profit') %>%
  bind_rows(df, .)

【讨论】：

不错的方法，但是如果我在summarise() 之前有一些filter()、bind_rows() 等，那么我无法将行与原始data.frame 绑定。
@FrancescVE filter 应该不是问题，因为filter 中的列数不会改变，summarise 的列数可能/可能不会改变，但 bind_rows应该不是问题，因为如果该列不存在，它会添加NA。（见bind_rows(iris, mtcars)）。您能否具体提及您正在尝试做的事情以及是否给您带来任何错误？
我已经编辑了原始问题，给出了一个更好的例子。
@FrancescVE 更新后的数据帧的预期输出是什么？如果它与您的 gather 相同，那么它的列值混合（A、B 和数字）也有重复的行。
@FrancescVE 我对现在的问题感到困惑。您的原始问题有输入以及明确的预期输出，我的答案可以正常工作。我不确定您在更新后的问题中要寻找什么。