按时间和按组聚合函数答案

【问题标题】：aggregate function by time and by group按时间和按组聚合函数
【发布时间】：2016-09-11 14:54:31
【问题描述】：

我正在尝试使用每年的时间和类型来构建一个堆叠的条形图。我的数据库垫（头）看起来像

head(mat)

  year flights.type flights.duration
1 2000         HR20         01:12:00
2 2000         HR20         02:00:00
3 2000           L4         00:54:00
4 2000           L4         00:42:00
5 2000           L4         00:22:00
6 2000         HR20         00:24:00

我想按年份和类型对 flight.duration 求和，然后构建一个堆叠条形图。

我尝试使用函数聚合，但随着时间的推移它无法正常工作。谁能帮我？我按年份和类型的总和看起来像：

aggregate(mat$flights.duration,format(.POSIXct(mat$flights.duration,tz="GMT"), "%H:%M:%S"),FUN=sum, by=list(mat$year))

【问题讨论】：

您的问题之一是您没有正确地将“01:12:00”和类似的分解为正确的时间分量。我使用的两种方法是在所有持续时间都小于 24 小时时提供日期，并使用 posix 函数作为与午夜的差异，或者将这个变量分开并自己执行计算。时间序列包可能有一种更简洁的方法。
感谢大家的精彩评论和支持 :)

标签： r time aggregate

【解决方案1】：

使用data.table 包和as.difftime() 函数的解决方案：

library(data.table)
setDT(mat)[, .(flights.duration.minutes = sum(as.difftime(as.character(flights.duration)))), 
              .(year, flights.type)]

   year flights.type flights.duration.minutes
1: 2000         HR20                 216 mins
2: 2000           L4                 118 mins

【讨论】：

【解决方案2】：

您可以将flights.duration 列转换为数字分钟值，如下所示：

df$flights.duration <- apply(df, 1, function(x) {
                               sum(as.numeric(unlist(strsplit(x[3], ':'))) * c(60, 1, 0))
                         })

然后，使用分组函数，例如 dplyr 包中的一个，如下所示：

library(dplyr)
df %>% group_by(year, flights.type) %>% summarise(flights.duration = sum(flights.duration))

输出如下：

Source: local data frame [2 x 3]
Groups: year [?]

   year flights.type flights.duration
  <int>        <chr>            <dbl>
1  2000         HR20              216
2  2000           L4              118

编辑：添加另一个选项可能使用 tidyr 包的 separate 而不是上面的 apply 函数会更快，该函数循环遍历每一行：

library(tidyr)
library(dplyr)
df %>%
  separate(flights.duration, c('hours', 'mins', 'seconds'), ':') %>%
  group_by(year, flights.type) %>%
  summarise(flights.duration = sum(60 * as.numeric(hours) +
                                   as.numeric(mins) +
                                   as.numeric(seconds) / 60))

结果和之前一样：

Source: local data frame [2 x 3]
Groups: year [?]

   year flights.type flights.duration
  <int>        <chr>            <dbl>
1  2000         HR20              216
2  2000           L4              118

【讨论】：

【解决方案3】：

lubridate 包被广泛认为是 R 中可用的最佳日期/时间包。它基于 R Date 和 POSIXct 类型，并添加了自己的 Interval、Duration、和Period 类型。

最适合纯 hh:mm:ss 次的数据类型是 Period 类型。从理论上讲，应该可以将您的字符串时间解析为Period 值，然后将sum() 与aggregate() 进行直接分组。

不幸的是，这比人们希望的要困难得多。我最终得到了它，有点，但它需要一些扭曲。

首先，这是将字符串时间解析为Period 值的方法。 lubridate 提供了一个方便的 hms() 方法来做到这一点：

df <- data.frame(year=c(2000L,2000L,2000L,2000L,2000L,2000L),flights.type=c('HR20','HR20','L4','L4','L4','HR20'),flights.duration=c('01:12:00','02:00:00','00:54:00','00:42:00','00:22:00','00:24:00'),stringsAsFactors=F);

library(lubridate);
df$flights.duration <- hms(df$flights.duration);

df;
##   year flights.type flights.duration
## 1 2000         HR20        1H 12M 0S
## 2 2000         HR20         2H 0M 0S
## 3 2000           L4           54M 0S
## 4 2000           L4           42M 0S
## 5 2000           L4           22M 0S
## 6 2000         HR20           24M 0S

第二，不幸的是，lubridate 似乎没有为Period 类型提供sum() 方法：

sum(df$flights.duration);
## [1] 0

（如果您想知道为什么它返回零，Period 类型是通过将秒字段存储为向量的有效负载来实现的，它是双精度类型，其余字段（分钟、小时、天, 月, 年）存储为槽，也是双精度类型。df$flights.duration 中的所有值都为零秒，而基本的sum() 函数只看到向量有效负载，因此它的总和为零。）

我自己尝试使用 S3 方法来填补这个空白，但很快发现它不起作用，因为 Period 类型是 S4 类型。所以我写了这个S4方法：

setMethod('sum',signature(x='Period',na.rm='logical'),function(x,na.rm=FALSE) period(seconds=sum(as.double(x),na.rm=na.rm),minutes=sum(x@minute,na.rm=na.rm),hours=sum(x@hour,na.rm=na.rm),days=sum(x@day,na.rm=na.rm),months=sum(x@month,na.rm=na.rm),years=sum(x@year,na.rm=na.rm)));
## [1] "sum"

sum(df$flights.duration);
## [1] "3H 154M 0S"

不幸的是，还有一个问题：aggregate() 默认尝试简化聚合结果，这会将 S4 结果扁平化为非 S4 对象，丢失槽并损坏数据：

res <- aggregate(flights.duration~year+flights.type,df,sum);
res;
## Error in paste(x@year, "y ", x@month, "m ", x@day, "d ", x@hour, "H ",  :
##   trying to get slot "year" from an object (class "Period") that is not an S4 object
traceback();
## 8: paste(x@year, "y ", x@month, "m ", x@day, "d ", x@hour, "H ",
##        x@minute, "M ", x@.Data, "S", sep = "")
## 7: format.Period(x[[i]], ..., justify = justify)
## 6: format(x[[i]], ..., justify = justify)
## 5: format.data.frame(x, digits = digits, na.encode = FALSE)
## 4: as.matrix(format.data.frame(x, digits = digits, na.encode = FALSE))
## 3: print.data.frame(list(year = c(2000L, 2000L), flights.type = c("HR20",
##    "L4"), flights.duration = c(0, 0)))
## 2: print(list(year = c(2000L, 2000L), flights.type = c("HR20", "L4"
##    ), flights.duration = c(0, 0)))
## 1: print(list(year = c(2000L, 2000L), flights.type = c("HR20", "L4"
##    ), flights.duration = c(0, 0)))
res$flights.duration;
## [1] 0 0
## attr(,"class")
## [1] "Period"
## attr(,"class")attr(,"package")
## [1] "lubridate"
isS4(res$flights.duration);
## [1] FALSE

如您所见，aggregate() 调用成功，但对象已损坏。 print.data.frame() 方法在该列上失败，因为它恰好在其上调用了 format()，该方法调度到 S3 方法 format.Period()，这是 lubridate 命名空间下的私有方法。它在损坏的对象上失败。

我们可以防止简化：

res <- aggregate(flights.duration~year+flights.type,df,sum,simplify=F);
res;
##   year flights.type flights.duration
## 1 2000         HR20                0
## 2 2000           L4                0
res$flights.duration;
## $`1`
## [1] "3H 36M 0S"
##
## $`4`
## [1] "118M 0S"
##

所以从技术上讲它是有效的，但是该列现在是列表类型，这并不理想。它也不再显示得很好；当显示为 data.frame 的一部分时，我们只会看到一个零。

我们可以通过手动转换列来组合列表组件来解决这个问题。不幸的是，unlist() 或 do.call(c,...) 的明显方法不起作用：

res <- transform(aggregate(flights.duration~year+flights.type,df,sum,simplify=F),flights.duration=do.call(c,flights.duration));
res;
##   year flights.type flights.duration
## 1 2000         HR20                0
## 2 2000           L4                0
res$flights.duration;
## [1] 0 0
isS4(res$flights.duration);
## [1] FALSE

Period 值列表被展平为纯向量，类似于 aggregate() 所做的简化效果。

问题似乎出在列表名称上，导致 c() 调用无法按预期运行。我们可以用unname() 解决这个问题。所以这是最终的解决方案：

res <- transform(aggregate(flights.duration~year+flights.type,df,sum,simplify=F),flights.duration=do.call(c,unname(flights.duration)));
res;
##   year flights.type flights.duration
## 1 2000         HR20        3H 36M 0S
## 2 2000           L4          118M 0S

因此，尽管我们最终到达了那里，但我不推荐此解决方案。 R 生态系统的不同派系之间存在太多复杂性、功能差距和不协调的交互。

【讨论】：