【问题标题】:How to calculate mean and standard deviation in a dataframe/matrix grouped by column name如何计算按列名分组的数据框/矩阵中的平均值和标准差
【发布时间】:2021-11-02 00:50:16
【问题描述】:

示例数据:

sun sun sun sky sky
1 1.0 2.0 1.1 4.0 9.8
2 3.7 1.0 1.0 3.3 NA
3 1.5 0.4 2.1 3.3 6.0
4 3.7 NA 3.6 3.1 5.6
5 2.9 1.1 10.0 7.1 7.7
6 7.0 4.9 6.9 5.4 4.9

我想计算每个唯一列名的平均值和标准差(忽略 NA)以获得如下输出:

mean sd
sun 3.170588235 2.677630647
sky 5.472727273 2.102422845

可重现的数据:

df <- data.frame(c(1, 3.7, 1.5, 3.7, 2.9, 7),
                 c(2, 1, 0.4, NA, 1.1, 4.9),
                 c(1.1, 1, 2.1, 3.6, 10, 6.9),
                 c(4, 3.3, 3.3, 3.1, 7.1, 5.4),
                 c(9.8, NA, 6, 5.6, 7.7, 4.9))
names(df) <- c("sun", "sun", "sun", "sky", "sky")

我得到的最接近的是

#for mean
sapply(split.default(df, names(df)), rowMeans, na.rm = TRUE) 

#for sd
sapply(split.default(df, names(df)), function(x) apply(x, 1, sd, na.rm=TRUE))

我从这个post 得到的,但我不知道如何调整它以获得我想要的。我知道我可以取 rowmeans 的平均值来获得每个组的平均值,但这不适用于标准差。

【问题讨论】:

    标签: r dataframe mean standard-deviation


    【解决方案1】:

    我们可以使用

    t(sapply(split.default(df, names(df)), function(x)  {
        x1 <- unlist(x)
        data.frame(mean = mean(x1, na.rm = TRUE), sd = sd(x1, na.rm = TRUE))}))
    

    -输出

           mean     sd      
    sky 5.472727 2.102423
    sun 3.170588 2.677631
    

    或使用data.table

    library(data.table)
    melt(setDT(df), measure = patterns("^sun", "^sky"), 
      value.name = c("sun", "sky"))[, c(list(categ = c("mean", "sd")), 
        lapply(.SD, function(x) c(mean = mean(x, na.rm = TRUE), 
         sd = sd(x, na.rm = TRUE)))), .SDcols = sun:sky]
       categ      sun      sky
    1:  mean 3.170588 5.472727
    2:    sd 2.677631 2.102423
    

    【讨论】:

      【解决方案2】:

      这里是tidyverse 解决方案

      library(tidyverse)
      

      样本数据

      df <- data.frame(c(1, 3.7, 1.5, 3.7, 2.9, 7),
                       c(2, 1, 0.4, NA, 1.1, 4.9),
                       c(1.1, 1, 2.1, 3.6, 10, 6.9),
                       c(4, 3.3, 3.3, 3.1, 7.1, 5.4),
                       c(9.8, NA, 6, 5.6, 7.7, 4.9))
      names(df) <- c("sun", "sun", "sun", "sky", "sky")
      

      代码

      df %>%
        #Pivotting data
        pivot_longer(cols = everything()) %>%
        #Grouping by sun/sky
        group_by(name) %>% 
        #Caluclating mean and sg grouped by sun/sky
        summarise(
          mean = mean(value,na.rm = T),
          sd = sd(value,na.rm = T)
        )
      

      输出

        name   mean    sd
        <chr> <dbl> <dbl>
      1 sky    5.47  2.10
      2 sun    3.17  2.68
      

      【讨论】:

        【解决方案3】:

        这是dplyr 中的另一种方法,将相似命名列的值放在一个列中,然后对它们计算meansd

        library(dplyr)
        library(tidyr)
        
        df %>%
          pivot_longer(cols = everything(), 
                       names_to = '.value') %>%
          summarise(across(.fns = list(mean = ~mean(., na.rm = TRUE), 
                                      sd = ~sd(., na.rm = TRUE))))
        
        #  sun_mean sun_sd sky_mean sky_sd
        #     <dbl>  <dbl>    <dbl>  <dbl>
        #1     3.17   2.68     5.47   2.10
        

        如果您希望 meansd 值在单独的列中,您可以添加到上述答案 -

         %>% pivot_longer(cols = everything(), names_to = c('col', '.value'), names_sep = '_')
        
        #  col    mean    sd
        #  <chr> <dbl> <dbl>
        #1 sun    3.17  2.68
        #2 sky    5.47  2.10
        

        【讨论】:

          【解决方案4】:

          您可以使用以下解决方案:

          t(as.data.frame(split.default(df, names(df)) |>
            sapply(\(x) {unlist(data.frame(mean = mean(unlist(x), na.rm = TRUE),
                                    sd = sd(unlist(x), na.rm = TRUE)))}))) |> 
            as.data.frame()
          
                  mean       sd
          sky 5.472727 2.102423
          sun 3.170588 2.677631
          

          【讨论】:

            【解决方案5】:
            df = data.frame(c(1, 3.7, 1.5, 3.7, 2.9, 7),
            c(2, 1, 0.4, NA, 1.1, 4.9),
            c(1.1, 1, 2.1, 3.6, 10, 6.9),
            c(4, 3.3, 3.3, 3.1, 7.1, 5.4),
            c(9.8, NA, 6, 5.6, 7.7, 4.9))
            names(df) <- c("sun1", "sun2", "sun3", "sky1", "sky2") # it's good to have unique names
            

            我们需要做一些重塑(变长)。作为 base-r 的倡导者,我会使用stats::reshape

            但是,我们需要在 data.frame 中再添加一个sky columnNAs),这样reshape 才能工作,但这不会对稍后的计算产生任何影响,因为我们'将使用na.rm=T

            df[, 'sky3'] = rep(NA, nrow(df))
            
            df_long = reshape(df, direction = 'long', varying = c(1:3, 4:6), sep="", times=1:3)
            
            df_long 
            
                time  sun sky  id
            1.1    1  1.0 4.0  1
            2.1    1  3.7 3.3  2
            3.1    1  1.5 3.3  3
            4.1    1  3.7 3.1  4
            5.1    1  2.9 7.1  5
            6.1    1  7.0 5.4  6
            1.2    2  2.0 9.8  1
            2.2    2  1.0  NA  2
            3.2    2  0.4 6.0  3
            4.2    2   NA 5.6  4
            5.2    2  1.1 7.7  5
            6.2    2  4.9 4.9  6
            1.3    3  1.1  NA  1
            2.3    3  1.0  NA  2
            3.3    3  2.1  NA  3
            4.3    3  3.6  NA  4
            5.3    3 10.0  NA  5
            6.3    3  6.9  NA  6
            
            
            lapply(df_long[, c('sun', 'sky')],
             \(x, na.rm=T) list(mean=mean(x, na.rm=na.rm), sd=sd(x, na.rm=na.rm))) |> 
            do.call(what = rbind)
                mean     sd      
            sun 3.170588 2.677631
            sky 5.472727 2.102423
            

            【讨论】:

              猜你喜欢
              • 1970-01-01
              • 2019-10-30
              • 1970-01-01
              • 2016-04-17
              • 2014-03-21
              • 2018-07-20
              • 2020-05-28
              相关资源
              最近更新 更多