比较 dplyr 中组内的列中的值答案

【问题标题】：Compare value in a column within groups in dplyr比较 dplyr 中组内的列中的值
【发布时间】：2017-06-27 22:26:15
【问题描述】：

我想使用 dplyr 比较分组 data.frame 中的值，并创建一个虚拟变量或类似的东西，指示哪个更大。想不通！

这是一些可重现的代码：

table <- structure(list(species = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("Adelophryne adiastola", 
"Adelophryne gutturosa"), class = "factor"), scenario = structure(c(3L, 
1L, 2L, 3L, 1L, 2L), .Label = c("future1", "future2", "present"
), class = "factor"), amount = c(5L, 3L, 2L, 50L, 60L, 40L)), .Names = c("species", 
"scenario", "amount"), class = "data.frame", row.names = c(NA, 
-6L))
> table
                species scenario amount
1 Adelophryne adiastola  present      5
2 Adelophryne adiastola  future1      3
3 Adelophryne adiastola  future2      2
4 Adelophryne gutturosa  present     50
5 Adelophryne gutturosa  future1     60
6 Adelophryne gutturosa  future2     40

我会将 df 按species 分组。我想创建一个新列，可以是increase_amount，其中每个“未来”的金额都与“现在”进行比较。当值增加时我可以得到 1，当它减少时我可以得到 0。

我一直在尝试使用 for 循环来抛出每个物种，但 df 包含超过 50,000 个物种，而且我必须重新执行操作的时间太长了......

有人知道方法吗？非常感谢！

【问题讨论】：

标签： r dplyr

【解决方案1】：

你可以这样做：

table %>% 
  group_by(species) %>% 
  mutate(tmp = amount[scenario == "present"]) %>% 
  mutate(increase_amount = ifelse(amount > tmp, 1, 0))
# Source: local data frame [6 x 5]
# Groups: species [2]
# 
#                 species scenario amount   tmp increase_amount
#                  <fctr>   <fctr>   <int> <int>           <dbl>
# 1 Adelophryne adiastola  present      5     5               0
# 2 Adelophryne adiastola  future1      3     5               0
# 3 Adelophryne adiastola  future2      2     5               0
# 4 Adelophryne gutturosa  present     50    50               0
# 5 Adelophryne gutturosa  future1     60    50               1
# 6 Adelophryne gutturosa  future2     40    50               0

【讨论】：

【解决方案2】：

我们可以使用来自base R 的ave 来做到这一点

table$increase_amount <-  with(table, as.integer(amount > ave(amount * 
         (scenario == "present"), species, FUN = function(x) x[x!=0])))
table$increase_amount
#[1] 0 0 0 0 1 0

【讨论】：

【解决方案3】：

听起来您可以使用lag() 来快速找到一段时间内的差异。我建议重组您的 scenario (time) 变量，以便可以使用 R 函数直观地对其重新排序（即，arrange() 将按字母顺序将您的 scenario 变量重新排序为 future1、future2、present，这在此不起作用案例）。

df <- data.frame(species=rep(letters,3),
                 scenario=rep(1:3,26),
                 amount=runif(78))
summary(df)
glimpse(df)
df %>% count(species,scenario)

df %>% 
  arrange(species,scenario) %>% # arrange scenario by ascending order
  group_by(species) %>% 
  mutate(diff1=amount-lag(amount), # calculate difference from time 1 -> 2, and time 2 -> 3
         diff2=amount-lag(amount,2)) # calculate difference from time 1 -> 3

lag() 的输出将为每个分组中的第一个scenario 值生成NA，但可以使用ifelse() 语句或filter() 轻松更改结果。

df %>% 
  arrange(species,scenario) %>% group_by(species) %>% 
  mutate(diff1=amount-lag(amount)) %>% 
  filter(diff1>0)

df %>% 
  arrange(species,scenario) %>% group_by(species) %>% 
  mutate(diff1=amount-lag(amount)) %>% 
  mutate(diff.incr=ifelse(diff1>0,'increase','no increase'))

【讨论】：