按列分组，然后计算 R 中每一列的均值和标准差答案

【问题标题】：Group by columns, then compute mean and sd of every other column in R按列分组，然后计算 R 中每一列的均值和标准差
【发布时间】：2016-09-24 06:13:42
【问题描述】：

如何按列分组，然后计算 R 中每列的均值和标准差？

以著名的 Iris 数据集为例。我想做一些类似于按物种分组的事情，然后计算花瓣/萼片长度/宽度测量值的平均值和标准差。我知道这与拆分应用组合有关，但我不确定如何从那里开始。

我能想到的：

require(plyr)

x <- ddply(iris, .(Species), summarise,
    Sepal.Length.Mean = mean(Sepal.Length),
    Sepal.Length.Sd = sd(Sepal.Length),
    Sepal.Width.Mean = mean(Sepal.Width),
    Sepal.Width.Sd = sd(Sepal.Width),
    Petal.Length.Mean = mean(Petal.Length),
    Petal.Length.Sd = sd(Petal.Length),
    Petal.Width.Mean = mean(Petal.Width),
    Petal.Width.Sd = sd(Petal.Width))

     Species Sepal.Length.Mean Sepal.Length.Sd Sepal.Width.Mean Sepal.Width.Sd
1     setosa             5.006       0.3524897            3.428      0.3790644
2 versicolor             5.936       0.5161711            2.770      0.3137983
3  virginica             6.588       0.6358796            2.974      0.3224966
  Petal.Length.Mean Petal.Length.Sd Petal.Width.Mean Petal.Width.Sd
1             1.462       0.1736640            0.246      0.1053856
2             4.260       0.4699110            1.326      0.1977527
3             5.552       0.5518947            2.026      0.2746501

期望的输出：

z <- data.frame(setosa = c(5.006, 0.3524897, 3.428, 0.3790644,
                           1.462, 0.1736640, 0.246, 0.1053856),
                versicolor = c(5.936, 0.5161711, 2.770, 0.3137983,
                               4.260, 0.4699110, 1.326, 0.1977527),
                virginica = c(6.588, 0.6358796, 2.974, 0.3225966,
                              5.552, 0.5518947, 2.026, 0.2746501))
rownames(z) <- c('Sepal.Length.Mean', 'Sepal.Length.Sd',
                 'Sepal.Width.Mean', 'Sepal.Width.Sd',
                 'Petal.Length.Mean', 'Petal.Length.Sd',
                 'Petal.Width.Mean', 'Petal.Width.Sd')
                     setosa versicolor virginica
Sepal.Length.Mean 5.0060000  5.9360000 6.5880000
Sepal.Length.Sd   0.3524897  0.5161711 0.6358796
Sepal.Width.Mean  3.4280000  2.7700000 2.9740000
Sepal.Width.Sd    0.3790644  0.3137983 0.3225966
Petal.Length.Mean 1.4620000  4.2600000 5.5520000
Petal.Length.Sd   0.1736640  0.4699110 0.5518947
Petal.Width.Mean  0.2460000  1.3260000 2.0260000
Petal.Width.Sd    0.1053856  0.1977527 0.2746501

【问题讨论】：

我认为单元格“setosa”/“Sepal.Length.Mean”的值应该是 5.006，而不是“期望输出”中的 0.5006（看起来像错字）。如果没有人反对，我将编辑问题以解决此问题

标签： r split-apply-combine

【解决方案1】：

我们可以试试dplyr

library(dplyr)
res <- iris %>% 
         group_by(Species) %>% 
         summarise_each(funs(mean, sd))
`colnames<-`(t(res[-1]), as.character(res$Species))
#                     setosa versicolor virginica
#Sepal.Length_mean 5.0060000  5.9360000 6.5880000
#Sepal.Width_mean  3.4280000  2.7700000 2.9740000
#Petal.Length_mean 1.4620000  4.2600000 5.5520000
#Petal.Width_mean  0.2460000  1.3260000 2.0260000
#Sepal.Length_sd   0.3524897  0.5161711 0.6358796
#Sepal.Width_sd    0.3790644  0.3137983 0.3224966
#Petal.Length_sd   0.1736640  0.4699110 0.5518947
#Petal.Width_sd    0.1053856  0.1977527 0.2746501

或者如cmets中提到的@Steven Beaupre，可以通过spreadreshaping得到输出

library(tidyr)
iris %>% 
   group_by(Species) %>% 
   summarise_each(funs(mean, sd)) %>% 
   gather(key, value, -Species) %>% 
   spread(Species, value)

【讨论】：

为了避免矩阵转置，你可以这样做：iris %>% group_by(Species) %>% summarise_each(funs(mean, sd)) %>% gather(key, value, -Species) %>% spread(Species, value)

【解决方案2】：

这是传统的plyr 方法。它使用colwise 计算所有列的汇总统计信息。

means <- ddply(iris, .(Species), colwise(mean))
sds <- ddply(iris, .(Species), colwise(sd))
merge(means, sds, by = "Species", suffixes = c(".mean", ".sd"))

【讨论】：

【解决方案3】：

如果您出于性能原因想使用data.table，您可以试试这个（不要害怕 - cmets 多于代码 ;-) 我已尝试优化所有性能关键点。

library(data.table)
dt <- as.data.table(iris)

# Helper function similar to "colwise" of package "plyr":
# Apply a function "func" to each column of the data.table "data"
# and append the "suffix" string to the result column name.
colwise.dt <- function( data, func, suffix )
{
  result <- lapply(data, func)                                      # apply the function to each column of the data table
  setDT(result)                                                     # convert the result list into a data table efficiently ("by ref")
  setnames(result, names(result), paste0(names(result), suffix))    # append suffix to each column name efficiently ("by ref"). "setnames" requires a data.table
}

wide.result <- dt[, c(colwise.dt(.SD, mean, ".mean"), colwise.dt(.SD, sd, ".sd")), by=.(Species)]
# Note: .SD is a data.table containing the subset of dt's data for each group (Species), excluding any columns used in "by" (here: Species column)

# Now transpose the result
long.result <- melt(wide.result, id.vars="Species")

# Now transform into one column per group
final.result <- dcast(long.result, variable ~ Species)

wide.result 是：

      Species Sepal.Length.mean Sepal.Width.mean Petal.Length.mean Petal.Width.mean Sepal.Length.sd Sepal.Width.sd Petal.Length.sd Petal.Width.sd
1:     setosa             5.006            3.428             1.462            0.246       0.3524897      0.3790644       0.1736640      0.1053856
2: versicolor             5.936            2.770             4.260            1.326       0.5161711      0.3137983       0.4699110      0.1977527
3:  virginica             6.588            2.974             5.552            2.026       0.6358796      0.3224966       0.5518947      0.2746501

long.result 是：

       Species          variable     value
 1:     setosa Sepal.Length.mean 5.0060000
 2: versicolor Sepal.Length.mean 5.9360000
 3:  virginica Sepal.Length.mean 6.5880000
 4:     setosa  Sepal.Width.mean 3.4280000
 5: versicolor  Sepal.Width.mean 2.7700000
 6:  virginica  Sepal.Width.mean 2.9740000
 7:     setosa Petal.Length.mean 1.4620000
 8: versicolor Petal.Length.mean 4.2600000
 9:  virginica Petal.Length.mean 5.5520000
10:     setosa  Petal.Width.mean 0.2460000
11: versicolor  Petal.Width.mean 1.3260000
12:  virginica  Petal.Width.mean 2.0260000
13:     setosa   Sepal.Length.sd 0.3524897
14: versicolor   Sepal.Length.sd 0.5161711
15:  virginica   Sepal.Length.sd 0.6358796
16:     setosa    Sepal.Width.sd 0.3790644
17: versicolor    Sepal.Width.sd 0.3137983
18:  virginica    Sepal.Width.sd 0.3224966
19:     setosa   Petal.Length.sd 0.1736640
20: versicolor   Petal.Length.sd 0.4699110
21:  virginica   Petal.Length.sd 0.5518947
22:     setosa    Petal.Width.sd 0.1053856
23: versicolor    Petal.Width.sd 0.1977527
24:  virginica    Petal.Width.sd 0.2746501

final.result 是：

            variable    setosa versicolor virginica
1: Sepal.Length.mean 5.0060000  5.9360000 6.5880000
2:  Sepal.Width.mean 3.4280000  2.7700000 2.9740000
3: Petal.Length.mean 1.4620000  4.2600000 5.5520000
4:  Petal.Width.mean 0.2460000  1.3260000 2.0260000
5:   Sepal.Length.sd 0.3524897  0.5161711 0.6358796
6:    Sepal.Width.sd 0.3790644  0.3137983 0.3224966
7:   Petal.Length.sd 0.1736640  0.4699110 0.5518947
8:    Petal.Width.sd 0.1053856  0.1977527 0.2746501

与所需输出的唯一区别是final 结果在名为variable 的第一列中包含值名称，而不是将其存储在行名称中。这可以通过将行名设置为第一列并删除第一列来完成...

【讨论】：

【解决方案4】：

受到答案的启发，我想出了一个同样有效的解决方案，仅使用 dplyr 和 tidyr 函数。

require(tidyr)
require(dplyr)

x <- iris %>%
    gather(var, value, -Species)
print(tbl_df(x))

# Compute the mean and sd for each dimension
x <- x %>%
    group_by(Species, var) %>%
    summarise(mean = mean(value), sd = sd(value)) %>%
    ungroup
print(tbl_df(x))

# Convert the data frame from wide form to long form
x <- x %>%
    gather(stat, value, mean:sd)
print(tbl_df(x))

# Combine the variables "var" and "stat" into a single variable
x <- x %>%
    unite(var, var, stat, sep = '.')
print(tbl_df(x))

# Convert the data frame from long form to wide form
x <- x %>%
    spread(Species, value)
print(tbl_df(x))

【讨论】：