【问题标题】:Hourly mean of multiple variables in R data.frame?R data.frame中多个变量的每小时平均值?
【发布时间】:2020-12-10 01:04:57
【问题描述】:

我有以下代码,并试图找到每个variables (i.e., X,Y, and Z) 中的hourly mean。我的输出应该是 data.framehourlyDate 列和所有 variables 中的 mean hourly data。任何前进的方式将不胜感激。

library(lubridate)

set.seed(123)

T <- data.frame(Datetime = seq(ymd_hms("2011-01-01 00:00:00"), to= ymd_hms("2011-12-31 00:00:00"), by = "5 min"),
                X = runif(104833, 5,10),Y = runif(104833, 5,10), Z = runif(104833, 5,10))
T$Date <- format(T$Datetime, format="%Y-%m-%d")
T$Hour <- format(T$Datetime, format = "%H")
T$Mints <- format(T$Datetime, format = "%M")

【问题讨论】:

    标签: r dataframe aggregate mean lubridate


    【解决方案1】:

    试试:

    library(lubridate)
    library(dplyr)
    
    set.seed(123)
    
    T <- data.frame(Datetime = seq(ymd_hms("2011-01-01 00:00:00"), to= ymd_hms("2011-12-31 00:00:00"), by = "5 min"),
                    X = runif(104833, 5,10),Y = runif(104833, 5,10), Z = runif(104833, 5,10))
    
    
    
    T %>% mutate(hourlyDate = floor_date(Datetime,unit='hour')) %>%
          select(-Datetime) %>% group_by(hourlyDate) %>% 
          summarize(across(everything(),mean)) %>%
          ungroup()
    #> `summarise()` ungrouping output (override with `.groups` argument)
    #> # A tibble: 8,737 x 4
    #>    hourlyDate              X     Y     Z
    #>    <dttm>              <dbl> <dbl> <dbl>
    #>  1 2011-01-01 00:00:00  8.00  7.90  6.90
    #>  2 2011-01-01 01:00:00  7.93  7.47  7.90
    #>  3 2011-01-01 02:00:00  7.83  6.89  7.67
    #>  4 2011-01-01 03:00:00  6.61  7.92  7.18
    #>  5 2011-01-01 04:00:00  7.27  7.20  6.48
    #>  6 2011-01-01 05:00:00  7.88  6.80  7.69
    #>  7 2011-01-01 06:00:00  7.07  8.05  7.52
    #>  8 2011-01-01 07:00:00  7.40  7.92  6.99
    #>  9 2011-01-01 08:00:00  7.97  7.76  7.26
    #> 10 2011-01-01 09:00:00  7.57  7.47  6.94
    #> # ... with 8,727 more rows
    

    reprex package (v0.3.0) 于 2020 年 8 月 20 日创建

    【讨论】:

      【解决方案2】:

      这是一个 tidyverse 方法:

      library(dplyr)
      
      group_by(T, Date, Hour) %>% 
        summarize(X = mean(X), Y = mean(Y), Z = mean(Z)) %>%
        transmute(Date = as.POSIXct(paste0(Date, " ", Hour, ":00:00")), X, Y, Z)
      
      #> # A tibble: 8,737 x 4
      #> # Groups:   Date [8,714]
      #>    Date                    X     Y     Z
      #>    <dttm>              <dbl> <dbl> <dbl>
      #>  1 2011-01-01 00:00:00  8.00  7.90  6.90
      #>  2 2011-01-01 01:00:00  7.93  7.47  7.90
      #>  3 2011-01-01 02:00:00  7.83  6.89  7.67
      #>  4 2011-01-01 03:00:00  6.61  7.92  7.18
      #>  5 2011-01-01 04:00:00  7.27  7.20  6.48
      #>  6 2011-01-01 05:00:00  7.88  6.80  7.69
      #>  7 2011-01-01 06:00:00  7.07  8.05  7.52
      #>  8 2011-01-01 07:00:00  7.40  7.92  6.99
      #>  9 2011-01-01 08:00:00  7.97  7.76  7.26
      #> 10 2011-01-01 09:00:00  7.57  7.47  6.94
      #> # ... with 8,727 more rows
      

      【讨论】:

        【解决方案3】:

        lubridate 有一个floor_date 函数,可将您的日期时间列修剪为指定的单位。

        然后只需按您想要的变量的每小时时间戳进行汇总

        library(dplyr)
        library(lubridate)
        
        T %>%
            group_by(hourlyDate = lubridate::floor_date(Datetime, unit = 'hours')) %>%
            summarise(across(.cols = c(X,Y,Z), .fns = ~mean(.x, na.rm=TRUE), .names = "meanHourlyData_{.col}"))
        

        顺便说一句,我建议不要使用 T 作为变量名,因为这也是 TRUE 的简写,可能会导致一些意外行为...

        【讨论】:

        • 此解决方案出现错误Error: Problem with summarise()` 输入..1。 x 胶水不能将函数插入字符串。 * 对象 '.col' 是一个函数。 i 输入..1across(.cols = c(X, Y, Z), .fns = ~mean(.x, na.rm = TRUE), .names = "meanHourlyData_{.col}")。 i 组 1 中发生的错误:hourlyDate = 2011-01-01。运行rlang::last_error() 以查看发生错误的位置。`
        • 也许您使用的是旧版本的dplyrsummarise_at(.vars = vars(X,Y,Z), .funs = ~mean(.x, na.rm=TRUE)) 有效吗?
        【解决方案4】:

        三个基本的R 解决方案是使用splittapplyrowsum 结合table。后者特别快(比 dplyr 答案之一快 9 倍)。

        tl;dr 是您得到以下计算时间

        #R> Unit: milliseconds
        #R>            expr   min    lq  mean median    uq   max neval
        #R>  split + sapply 563.9 577.4 636.1  649.8 680.7 697.1    10
        #R> tapply + sapply 108.0 117.3 134.0  120.2 124.4 205.1    10
        #R>  rowsum + table  21.3  21.3  21.5   21.3  21.6  21.9    10
        #R>           dplyr 172.4 176.6 182.3  180.9 185.9 203.4    10
        

        这里是解决方案

        # create date-hour column
        T$DateH <-  format(T$Datetime, format="%Y-%m-%d-%H")
        
        # using split + sapply
        options(digits = 3)
        out_1 <- sapply(split(T[, c("X", "Y", "Z")], T$DateH), colMeans) 
        head(t(out_1), 5)
        #R>                  X    Y    Z
        #R> 2011-01-01-00 8.00 7.90 6.90
        #R> 2011-01-01-01 7.93 7.47 7.90
        #R> 2011-01-01-02 7.83 6.89 7.67
        #R> 2011-01-01-03 6.61 7.92 7.18
        #R> 2011-01-01-04 7.27 7.20 6.48
        
        # using tapply + sapply
        out_2 <- sapply(c("X", "Y", "Z"), 
                        function(var) c(tapply(T[[var]], T$DateH, mean)))
        head(out_2)
        #R>                  X    Y    Z
        #R> 2011-01-01-00 8.00 7.90 6.90
        #R> 2011-01-01-01 7.93 7.47 7.90
        #R> 2011-01-01-02 7.83 6.89 7.67
        #R> 2011-01-01-03 6.61 7.92 7.18
        #R> 2011-01-01-04 7.27 7.20 6.48
        
        # check that we get the same
        all.equal(t(out_1), out_2, check.attributes = FALSE)
        #R> [1] TRUE
        
        # with rowsum + table
        out_3 <- as.matrix(rowsum(T[, c("X", "Y", "Z")], group = T$DateH)) / 
          rep(table(T$DateH), 3)
        
        # check that we get the same
        all.equal(out_2, out_3)
        #R> [2] TRUE
        
        # compare with dplyr solution
        library(dplyr)
        out_3 <- group_by(T, Date, Hour) %>% 
          summarize(X = mean(X), Y = mean(Y), Z = mean(Z)) %>%
          transmute(Date = as.POSIXct(paste0(Date, " ", Hour, ":00:00")), X, Y, Z)
        
        
        # check that we get the same
        all.equal(out_2, as.matrix(out_3[, c("X", "Y", "Z")]),
                  check.attributes = FALSE)
        #R> [1] TRUE
        
        # check computation time
        library(microbenchmark)
        microbenchmark(
          `split + sapply` = 
            sapply(split(T[, c("X", "Y", "Z")], T$DateH), colMeans), 
          `tapply + sapply` = 
            sapply(c("X", "Y", "Z"), 
                   function(var) c(tapply(T[[var]], T$DateH, mean))), 
          `rowsum + table` = 
            as.matrix(rowsum(T[, c("X", "Y", "Z")], group = T$DateH)) / 
            rep(table(T$DateH), 3),
          `dplyr` = 
            group_by(T, Date, Hour) %>% 
            summarize(X = mean(X), Y = mean(Y), Z = mean(Z)) %>%
            transmute(Date = as.POSIXct(paste0(Date, " ", Hour, ":00:00")), 
                      X, Y, Z), times = 10)
        #R> Unit: milliseconds
        #R>            expr   min    lq  mean median    uq   max neval
        #R>  split + sapply 563.9 577.4 636.1  649.8 680.7 697.1    10
        #R> tapply + sapply 108.0 117.3 134.0  120.2 124.4 205.1    10
        #R>  rowsum + table  21.3  21.3  21.5   21.3  21.6  21.9    10
        #R>           dplyr 172.4 176.6 182.3  180.9 185.9 203.4    10
        

        我认为使用data.table 也可以快速获得结果。最后,不要使用T 作为变量名。 TTRUE 的简写!

        【讨论】:

          猜你喜欢
          • 2022-01-13
          • 1970-01-01
          • 2021-11-08
          相关资源
          最近更新 更多