汇总已在 r 中分组的数据答案

【问题标题】：Summarize data already grouped in r汇总已在 r 中分组的数据
【发布时间】：2013-08-09 17:30:46
【问题描述】：

在 R 中使用以下数据集 ID=Custid

ID Geo Channel Brand Neworstream RevQ112 RevQ212 RevQ312
1  NA  On-line  1      New         5         0       1
1  NA  On-line  1      Stream      5         0       1
3  EU  Tele     2       Stream     5         1       0

我想将数据集转换为这种格式的列

ID Geo Brand Neworstream OnlineRevQ112 TeleRevQ112 OnlineRevQ212 TeleRevQ212

执行此操作的最佳方法是什么？无法找出 R 中的最佳命令。

提前致谢

【问题讨论】：

你可以使用来自基本 R 的reshape 或来自reshape2package 的reshape2

标签： r reshape summary

【解决方案1】：

您可以使用reshape2 包及其melt 和dcast 函数来重构您的数据。

data <- structure(list(ID = c(1L, 1L, 3L), Geo = structure(c(NA, NA, 
1L), .Label = "EU", class = "factor"), Channel = structure(c(1L, 
1L, 2L), .Label = c("On-line", "Tele"), class = "factor"), Brand = c(1L, 
1L, 2L), Neworstream = structure(c(1L, 2L, 2L), .Label = c("New", 
"Stream"), class = "factor"), RevQ112 = c(5L, 5L, 5L), RevQ212 = c(0L, 
0L, 1L), RevQ312 = c(1L, 1L, 0L)), .Names = c("ID", "Geo", "Channel", 
"Brand", "Neworstream", "RevQ112", "RevQ212", "RevQ312"), class = "data.frame", row.names = c(NA, 
-3L)) 

library(reshape2)
## melt data
df_long<-melt(data,id.vars=c("ID","Geo","Channel","Brand","Neworstream"))

## recast in combinations of channel and time frame
dcast(df_long,... ~Channel+variable,sum)

【讨论】：

对，我错过了。我在本地导入时使用了不同的名称。
+1 虽然我不确定reshape 是否正在开发中。也许你应该使用reshape2。
+1。正如@Arun 所提到的，可能建议对 reshape2 进行更新，尽管此解决方案所需的唯一区别是将 cast 更改为 dcast。
@Arun，我根据您的建议将依赖项更改为 reshape2。

【解决方案2】：

更新/捂脸

您数据集中的“NA”可能不是NA 值，而是北美的缩写“NA”或类似的东西。

如果您在读入数据时使用了na.strings，那么使用reshape 应该没有问题，正如我最初指出的那样：

mydf <- read.table(header = TRUE, na.strings = "", 
text = 'ID Geo Channel Brand Neworstream RevQ112 RevQ212 RevQ312
1  NA  On-line  1      New         5         0       1
1  NA  On-line  1      Stream      5         0       1
3  EU  Tele     2       Stream     5         1       0')

reshape(mydf, direction = "wide",
        idvar = c("ID", "Geo", "Brand", "Neworstream"),
        timevar = "Channel")

（不过，为了便于阅读和减少混淆，我可能会建议您更改缩写！）

原始答案（因为那里还有一些关于 `reshape` 的有趣之处）

应该这样做：

reshape(mydf, direction = "wide", 
        idvar = c("ID", "Geo", "Brand", "Neworstream"), 
        timevar = "Channel")
#   ID  Geo Brand Neworstream RevQ112.On-line RevQ212.On-line RevQ312.On-line
# 1  1 <NA>     1         New               5               0               1
# 3  3   EU     2      Stream              NA              NA              NA
#   RevQ112.Tele RevQ212.Tele RevQ312.Tele
# 1           NA           NA           NA
# 3            5            1            0

更新（试图挽救一点答案）

正如@Arun 指出的那样，上述内容并不完全正确。这里的罪魁祸首是interaction()，当指定了多个ID变量时，reshape()使用它来创建一个新的临时ID变量。

这是来自reshape() 的行以及应用于我们的“mydf”对象时的样子：

data[, tempidname] <- interaction(data[, idvar], drop = TRUE)
interaction(mydf[c(1, 2, 4, 5)], drop = TRUE)
# [1] <NA>          <NA>          3.EU.2.Stream
# Levels: 3.EU.2.Stream

嗯。这似乎简化为两个 ID，NA 和 3.EU.2.Stream。

如果我们将NA 替换为"" 会发生什么？

mydf$Geo <- as.character(mydf$Geo)
mydf$Geo[is.na(mydf$Geo)] <- ""
interaction(mydf[c(1, 2, 4, 5)], drop = TRUE)
# [1] 1..1.New      1..1.Stream   3.EU.2.Stream
# Levels: 1..1.New 1..1.Stream 3.EU.2.Stream

啊啊。这样好一点。我们现在有了三个唯一的 ID...reshape() 似乎可以工作。

reshape(mydf, direction = "wide", 
        idvar=names(mydf)[c(1, 2, 4, 5)], 
        timevar="Channel")
#   ID Geo Brand Neworstream RevQ112.On-line RevQ212.On-line
# 1  1         1         New               5               0
# 2  1         1      Stream               5               0
# 3  3  EU     2      Stream              NA              NA
#   RevQ312.On-line RevQ112.Tele RevQ212.Tele RevQ312.Tele
# 1               1           NA           NA           NA
# 2               1           NA           NA           NA
# 3              NA            5            1            0

【讨论】：

阿南达，这不是缺少ID=1, Neworstream=Stream吗？还是我错过了什么？
@Arun，不，你不是。让我再看一遍。
@Arun，我认为这是因为“地理”列中的 NA 值。
哦哇..这很有趣。在使用base reshape之前没有遇到过这个问题...
@Arun，快速浏览一下就会发现 interaction 是这里的罪魁祸首。

更新/捂脸

原始答案（因为那里还有一些关于 reshape 的有趣之处）

更新（试图挽救一点答案）

原始答案（因为那里还有一些关于 `reshape` 的有趣之处）