如何将多元向量值函数（具有可变长度输出）传递给聚合答案

【问题标题】：How to pass a multivariate vector valued function (with variable length output) to aggregate如何将多元向量值函数（具有可变长度输出）传递给聚合
【发布时间】：2019-07-07 18:35:08
【问题描述】：

我在 R 中有一个要聚合的数据框。我想应用于每个子集的汇总函数是一个自定义函数，它接受多个变量（列）作为输入，并返回一个向量或 可变长度 列表。作为输出，我想要一个数据框，其中有一列分组变量，另一列包含输出向量（长度可变）。

举一个模拟示例，假设我有以下数据框：

df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
 time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
 c("B","C","A","A")), energy = round(runif(12,0,10)))

> df
   particle time state energy
1         X    1     A      9
2         X    2     A      8
3         X    3     B      7
4         X    4     C      5
5         X    5     A      0
6         Y    1     A      1
7         Y    2     B      7
8         Y    3     B      7
9         Z    1     B      3
10        Z    2     C      9
11        Z    3     A      5
12        Z    4     A      6

我想为每个粒子获取它们每次改变状态时所拥有的能量的列表。我正在寻找的输出是这样的：

>
   particle      energy
1         X      c(9,7,5,0)
2         Y      c(1,7)
3         Z      c(3,9,5)

为此，我将定义如下函数：

myfun <- function(state, energy){
   tempstate <- state[1]
   energyvec <- energy[1]
   for(i in 2:length(state)){
      if(state[i] != tempstate){
         energyvec <- c(energyvec, energy[i])
         tempstate <- state[i]
      }
   }
   return(energyvec)
}

并尝试以某种方式将其传递给聚合

我为此尝试的两个数据结构是 data.frame 和 data.table。

在 data.frame 中，使用返回向量的自定义函数似乎给出了我正在寻找的正确输出格式，即输出列实际上是一个列表，并且每一行都包含一个列表，其中包含功能。但是，以这种方式聚合时，我似乎无法将几列传递给函数。

使用 data.table，在考虑多个变量的函数时，聚合更容易进行。但是，我似乎无法获得我正在寻找的输出。确实，

dt <- data.table(df)
dt[,myfun(state, energy), by= Particle]

只返回energyvec的第一个元素（而不是向量），并且

dt <- data.table(df)
dt[,as.list(myfun(state, energy)), by= Particle]

不起作用，因为输出的长度不同。

有没有其他方法可以做到这一点？

非常感谢您的所有帮助！

【问题讨论】：

标签： r dataframe data.table aggregate

【解决方案1】：

这是tidyverse 方法：

library(tidyverse)

df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
                  time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
                                                   c("B","C","A","A")), energy = round(runif(12,0,10)))

# Hard-code energy to make this reproducible
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)

df %>%
  group_by(particle) %>%
  mutate(
    changed_state = coalesce(state != lag(state, 1), TRUE)
  ) %>%
  filter(changed_state) %>%
  summarise(
    string = toString(energy)
  )
#> # A tibble: 3 x 2
#>   particle string    
#>   <fct>    <chr>     
#> 1 X        9, 7, 5, 0
#> 2 Y        1, 7      
#> 3 Z        3, 9, 5

我会单独运行管道的每一行。基本上，通过检查“this”状态是否与最后一个状态lag(state, 1) 匹配来创建changed_state 变量。因为我们只关心何时发生这种情况，所以我们 filter 其中这是 TRUE（更详细的行是 filter(changed_state == TRUE)。toString 函数根据需要折叠能量行，我们已经被 @“分组”了987654329@.

【讨论】：

我“借用”了你的样本数据作为我的答案;-)
非常感谢！这样可行。我只是将toString 函数更改为list()，因为它使我以后更容易操作它（特别是，长度是状态更改的数量）。值得注意的是，使用data.table 下面的其他解决方案也可以很好地工作

【解决方案2】：

data.table 接近

样本数据

#stolen from JasonAizkalns's answer
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
                  time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
                                                   c("B","C","A","A")), energy = round(runif(12,0,10)))

df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)

代码

library( data.table )
#create data.table
dt <- as.data.table(df)

#use `uniqlist` to get rownumbers where the value of `state` changes, 
# then get these rows into a subset
result <- dt[ data.table:::uniqlist(dt[, c("particle", "state")]), ]

#split the resulting `energy`-column by the contents of the `particle`-column
l <- split( result$energy, result$particle)
# $X
# [1] 9 7 5 0
# 
# $Y
# [1] 1 7
# 
# $Z
# [1] 3 9 5

#craete final output
data.table( particle = names(l), energy = l )
#    particle  energy
# 1:        X 9,7,5,0
# 2:        Y     1,7
# 3:        Z   3,9,5

【讨论】：

【解决方案3】：

另一种可能的data.table 方法

library(data.table)
setDT(DF)[, .(energy=.(.SD[, first(energy), by=.(rleid(state))]$V1)), by=.(particle)]

输出：

   particle  energy
1:        X 9,4,6,9
2:        Y     2,9
3:        Z   7,6,1

数据：

set.seed(0L)
DF <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
    time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
        c("B","C","A","A")), energy = round(runif(12,0,10)))
DF
#    particle time state energy
# 1         X    1     A      9
# 2         X    2     A      3
# 3         X    3     B      4
# 4         X    4     C      6
# 5         X    5     A      9
# 6         Y    1     A      2
# 7         Y    2     B      9
# 8         Y    3     B      9
# 9         Z    1     B      7
# 10        Z    2     C      6
# 11        Z    3     A      1
# 12        Z    4     A      2

【讨论】：