【问题标题】:Means of blocks of n columns in a data.framedata.frame 中 n 列块的平均值
【发布时间】:2018-11-27 20:00:45
【问题描述】:

我一直在搜索并尝试多种不同的方法来平均 data.frame 中的每 10 列。数据集为 52 行 x 60 列。 data.frame,标题为 data,前 2 行如下所示:

X1  X2  X3  X4  X5  X6  X7  X8  X9  X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38 X39 X40 X41 X42 X43 X44 X45 X46 X47 X48 X49 X50 X51 X52 X53 X54 X55 X56 X57 X58 X59 X60
4   14.7637 14.2117 14.1237 13.6637 12.9837 13.3237 13.8877 15.0997 15.5717 16.5157 15.0597 13.5317 13.6957 13.2637 13.5117 13.4237 14.1277 13.8437 12.8357 13.6277 13.2077 14.9837 16.1277 15.6197 15.7517 16.8557 15.9757 15.9677 16.1677 17.1557 16.1157 16.3557 16.2037 16.8077 16.6757 16.4837 16.7877 16.1037 16.3117 16.0637 16.1077 16.2477 17.1917 18.1236 18.5036 18.2956 20.9516 18.0636 18.5516 19.1756 19.5996 19.2036 18.1996 16.7117 16.7037 16.7877 16.5837 17.6636 18.8596 18.3356
5   16.9597 15.9037 15.3917 15.6797 15.6797 15.8397 17.1517 18.0796 18.6236 20.4796 18.8796 16.2877 16.7997 15.6157 16.9917 16.8317 16.9917 17.5356 16.3517 15.1357 16.5437 17.4077 18.4316 17.0557 17.3117 19.1676 18.2396 16.7037 17.2157 19.1676 18.2076 16.7677 18.7196 19.4236 18.2716 17.5356 18.7196 17.8876 17.2477 16.9597 17.2797 18.3996 19.5516 19.2636 20.0956 20.4476 21.5356 18.4316 20.7356 22.1436 21.6636 20.7676 19.7436 18.5596 17.9516 17.8876 18.1116 19.2956 20.3516 19.4876

(第 4 行和第 5 行以及顶行只是文件中的占位符。

正在从.txt 文件中读取和提取数据,我想平均每 10 列将其从 60 列更改为 6 列。以下是我之前看到人们要求的一些额外信息:

> class(data)
[1] "data.frame"

> str(data)
'data.frame':   52 obs. of  60 variables:
$ X1 : Factor w/ 53 levels "0","0.0319994",..: 31 32 34 30 51 48 45 39 36 28 ...
$ X2 : Factor w/ 48 levels "0","0.0319994",..: 27 30 29 26 46 42 39 31 23 19 ...

最近我尝试过:

dataMean <- data.frame(Means=rowMeans(data), ncol=10)

dataMean <- rowMeans(data.frame(data, ncol=10))

并且两者都给出关于“x”必须是数字的相同错误。有人可以提供的任何帮助将不胜感激。

提前致谢!

编辑:所需的结果将是这样的,其中列数已减少,并且每 10 列计算算术平均值:

X1 X2 X3 X4 X5 X6
4 14.4145   13.6921 15.7813 16.3909 18.12123    17.86484
5 16.97887  16.74208    17.72446    17.97403    19.78841    19.382

编辑2:

 > dput(df)
 structure(list(X1X2X3X4X5X6X7X8X9X10X11X12X13X14X15X16X17X18X19X20X21X22X23X24X25X26X27X28X29X30X31X32X33X34X35X36X37X38X39X40X41X42X43X44X45X46X47X48X49X50X51X52X53X54X55X56X57X58X59X60 = c("414.763714.211714.123713.663712.983713.323713.887715.099715.571716.515715.059713.531713.695713.263713.511713.423714.127713.843712.835713.627713.207714.983716.127715.619715.751716.855715.975715.967716.167717.155716.115716.355716.203716.807716.675716.483716.787716.103716.311716.063716.107716.247717.191718.123618.503618.295620.951618.063618.551619.175619.599619.203618.199616.711716.703716.787716.583717.663618.859618.3356", 

 ="516.959715.903715.391715.679715.679715.839717.151718.079618.623620.479618.879616.287716.799715.615716.991716.831716.991717.535616.351715.135716.543717.407718.431617.055717.311719.167618.239616.703717.215719.167618.207616.767718.719619.423618.271617.535618.719617.887617.247716.959717.279718.399619.551619.263620.095620.447621.535618.431620.735622.143621.663620.767619.743618.559617.951617.887618.111619.295620.351619.4876"
)), class = "data.frame", row.names = c(NA, -2L))

【问题讨论】:

  • 我认为这是重复的。您使用“[”在 j 位置选择带有类似 c(rep(FALSE,n-1),TRUE) 的第 n 列。并且回收规则适用,因此它应该重复到数据帧的整个长度。
  • 我不认为这是重复的,但也许那是因为我不确定 OP 在问什么。你能给我们一个你正在寻找的行为的明确例子吗? “平均每 10 列”是什么意思?
  • 但是,由于示例中缺少正确结果,因此无法知道您是否要说您想要一次分组 10 列的平均值。

标签: r dataframe mean


【解决方案1】:

我们可以使用splitrowMeans

as.data.frame(sapply(
  split(seq_along(df),(seq_along(df)-1) %/%10),
  function(x) rowMeans(df[x])
))
#          0        1        2        3        4        5
# 4 14.41450 13.69210 15.78130 16.39090 18.12123 17.86484
# 5 16.97887 16.74208 17.72446 17.97403 19.78841 19.38200

数据

df <- read.table(header=TRUE,stringsAsFactors=FALSE,text="X1  X2  X3  X4  X5  X6  X7  X8  X9  X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38 X39 X40 X41 X42 X43 X44 X45 X46 X47 X48 X49 X50 X51 X52 X53 X54 X55 X56 X57 X58 X59 X60
4   14.7637 14.2117 14.1237 13.6637 12.9837 13.3237 13.8877 15.0997 15.5717 16.5157 15.0597 13.5317 13.6957 13.2637 13.5117 13.4237 14.1277 13.8437 12.8357 13.6277 13.2077 14.9837 16.1277 15.6197 15.7517 16.8557 15.9757 15.9677 16.1677 17.1557 16.1157 16.3557 16.2037 16.8077 16.6757 16.4837 16.7877 16.1037 16.3117 16.0637 16.1077 16.2477 17.1917 18.1236 18.5036 18.2956 20.9516 18.0636 18.5516 19.1756 19.5996 19.2036 18.1996 16.7117 16.7037 16.7877 16.5837 17.6636 18.8596 18.3356
           5   16.9597 15.9037 15.3917 15.6797 15.6797 15.8397 17.1517 18.0796 18.6236 20.4796 18.8796 16.2877 16.7997 15.6157 16.9917 16.8317 16.9917 17.5356 16.3517 15.1357 16.5437 17.4077 18.4316 17.0557 17.3117 19.1676 18.2396 16.7037 17.2157 19.1676 18.2076 16.7677 18.7196 19.4236 18.2716 17.5356 18.7196 17.8876 17.2477 16.9597 17.2797 18.3996 19.5516 19.2636 20.0956 20.4476 21.5356 18.4316 20.7356 22.1436 21.6636 20.7676 19.7436 18.5596 17.9516 17.8876 18.1116 19.2956 20.3516 19.4876")

【讨论】:

  • 我仍然收到相同的错误“rowMeans(df[x]) 中的错误:'x' 必须是数字”
  • df[] &lt;- lapply(df,function(x) as.numeric(as.character(x)))开头
  • 但最好在上游处理问题,您将双精度数存储为因子,这意味着有人在某处搞砸了导入或重新格式化步骤
  • 报错还是一样,你觉得这是之前的代码有问题吗?我没有写那部分代码,写的人警告我代码做得不好,但可以按他们的要求做。我应该尝试将这些数据重新格式化为数字矩阵还是尝试解决这个问题?
  • 请将dput(df)的输出添加到问题中,这样会容易得多
【解决方案2】:

这是tidyverse 的可能性

library(tidyverse)
df %>%
    rowid_to_column("row") %>%
    gather(k, v, -row) %>%
    mutate(group = (as.numeric(sub("X", "", k)) - 1) %/% 10) %>%
    group_by(group, row) %>%
    summarise(v.mean = mean(v)) %>%
    spread(group, v.mean) %>%
    select(-row)
## A tibble: 2 x 6
#    `0`   `1`   `2`   `3`   `4`   `5`
#  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1  14.4  13.7  15.8  16.4  18.1  17.9
#2  17.0  16.7  17.7  18.0  19.8  19.4

更新

如果您的行数超过 2 行,则同样有效。这是一个使用 50x60 data.frame 的示例。

ncol <- 60;
nrow <- 50;
df <- data.frame(matrix(runif(nrow * ncol), ncol = ncol))

df %>%
    rowid_to_column("row") %>%
    gather(k, v, -row) %>%
    mutate(group = (as.numeric(sub("X", "", k)) - 1) %/% 10) %>%
    group_by(group, row) %>%
    summarise(v.mean = mean(v)) %>%
    spread(group, v.mean) %>%
    select(-row)
## A tibble: 50 x 6
#     `0`   `1`   `2`   `3`   `4`   `5`
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0.372 0.514 0.400 0.565 0.489 0.412
# 2 0.344 0.465 0.625 0.421 0.602 0.519
# 3 0.393 0.389 0.465 0.607 0.504 0.539
# 4 0.545 0.599 0.530 0.552 0.661 0.568
# 5 0.589 0.456 0.590 0.557 0.441 0.494
# 6 0.588 0.602 0.362 0.524 0.526 0.644
# 7 0.432 0.624 0.457 0.539 0.530 0.481
# 8 0.494 0.519 0.661 0.568 0.709 0.610
# 9 0.397 0.413 0.398 0.370 0.720 0.570
#10 0.639 0.495 0.551 0.717 0.721 0.496
## ... with 40 more rows

样本数据

df <- read.table(text =
    "X1  X2  X3  X4  X5  X6  X7  X8  X9  X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38 X39 X40 X41 X42 X43 X44 X45 X46 X47 X48 X49 X50 X51 X52 X53 X54 X55 X56 X57 X58 X59 X60
4   14.7637 14.2117 14.1237 13.6637 12.9837 13.3237 13.8877 15.0997 15.5717 16.5157 15.0597 13.5317 13.6957 13.2637 13.5117 13.4237 14.1277 13.8437 12.8357 13.6277 13.2077 14.9837 16.1277 15.6197 15.7517 16.8557 15.9757 15.9677 16.1677 17.1557 16.1157 16.3557 16.2037 16.8077 16.6757 16.4837 16.7877 16.1037 16.3117 16.0637 16.1077 16.2477 17.1917 18.1236 18.5036 18.2956 20.9516 18.0636 18.5516 19.1756 19.5996 19.2036 18.1996 16.7117 16.7037 16.7877 16.5837 17.6636 18.8596 18.3356
5   16.9597 15.9037 15.3917 15.6797 15.6797 15.8397 17.1517 18.0796 18.6236 20.4796 18.8796 16.2877 16.7997 15.6157 16.9917 16.8317 16.9917 17.5356 16.3517 15.1357 16.5437 17.4077 18.4316 17.0557 17.3117 19.1676 18.2396 16.7037 17.2157 19.1676 18.2076 16.7677 18.7196 19.4236 18.2716 17.5356 18.7196 17.8876 17.2477 16.9597 17.2797 18.3996 19.5516 19.2636 20.0956 20.4476 21.5356 18.4316 20.7356 22.1436 21.6636 20.7676 19.7436 18.5596 17.9516 17.8876 18.1116 19.2956 20.3516 19.4876", header = T)

【讨论】:

  • 收到错误“Error in library(tidyverse) : there is no package called ‘tidyverse’”,类似于 zoo 包,需要单独下载这个包吗?
  • @TCBatUGA 是的,最简单的方法是install.packages("tidyverse") 安装相关的tidyverse 软件包。 zoo 同上:install.packages("zoo")
  • 因此,当我在安装软件包后运行此程序时,我得到:#A tibble: 52 x 1 &lt;NA&gt; 1 NA 2 NA 其次是每行更多的 NA
  • @TCBatUGA 我的示例是完全可重现的,并且基于您提供的示例数据。 您应该确认您可以根据示例数据重现输出。听起来您的实际数据可能与您提供的示例数据不同(您的示例数据有 60 列,这些列被折叠成 6列数据集;因此您在上一条评论中提到的 52 列敲响了警钟)。
  • 感谢您的努力,每当我使用示例数据运行您的代码时,它都能正常工作。但是,当我引入其他 50 行(总共 52 行)时,它会中断并给出 NA 结果
【解决方案3】:

来自zoorollmean 在这里可能会有所帮助:

library(zoo)

m <- apply(df,1,rollmean,10) 
t(m[seq(nrow(m)) %% 10 ==1,])

#         X5      X15      X25      X35      X45      X55
# 4 14.41450 13.69210 15.78130 16.39090 18.12123 17.86484
# 5 16.97887 16.74208 17.72446 17.97403 19.78841 19.38200

我从 Moody_Mudskipper 的回答中重复使用了 df

【讨论】:

  • 当我尝试这个时,我得到了一个错误:“库中的错误(动物园):没有名为‘动物园’的包”。我必须单独下载那个包吗?
  • 是的,install.packages('zoo')
  • 当我现在尝试这段代码时,我得到了一个找不到函数'transpose'的错误
  • 应该是t()
【解决方案4】:

我昨天晚上发布了这个问题,但最终通过更多搜索找到了解决方案。我发现我必须将data.frame 转换为matrix,然后将transpose 该矩阵转换为10 行中的每一列的平均值。然后我将数据重新转换回我想要的形状。

y <- apply(as.matrix(data), 2, as.numeric)
z <- t(y)
n=10
MatrixMeanD <- colMeans(matrix(z, nrow=10))   
#dont know why but rowMeans didnt work for me, while colMeans did?

x <- t(MatrixMeanD)
MatrixMean <- t(matrix(x,,52))
write.csv(MatrixMean,"file")

感谢所有给我建议并试图帮助我修复代码的人!

【讨论】:

  • 更正,rowMean 有效,但没有给我正确的平均值。例如,第一次平均没有给我 14.4145,而是给了我 15.1667。
猜你喜欢
  • 1970-01-01
  • 2015-09-05
  • 1970-01-01
  • 2021-11-08
相关资源
最近更新 更多