不使用ddply和merge计算“组特征”答案

【问题标题】：Calculate "group characteristics" without ddply and merge不使用ddply和merge计算“组特征”
【发布时间】：2013-03-17 22:49:55
【问题描述】：

我想知道是否有比我通常采用的方法更直接的方法来计算某种类型的变量......

下面的例子可能解释得最好。我有一个包含 2 列的数据框（水果以及水果是否腐烂）。我想为每一行添加例如同一类别的腐烂水果的百分比。例如，apple 有 4 个条目，其中 2 个是 rotten，因此 apple 的每一行应为 0.5。目标值（仅作为说明）包含在“期望结果”列中。

我之前曾通过以下方式解决过这个问题 * 在fruit变量上使用“ddply”命令（以 sum/lenght 作为函数），创建一个新的 3*2 数据帧 * 使用“合并”命令将这些值链接回旧数据框。

这感觉像是一种迂回的方式，我想知道是否有更好/更快的方式来做到这一点！理想情况下是一种通用方法，如果需要确定一个而不是百分比，则很容易调整。所有的水果都烂了，任何水果都烂了，等等等等....

非常感谢，

    Fruit Rotten Desired_Outcome_PercRotten
1   Apple      1                        0.5
2   Apple      1                        0.5
3   Apple      0                        0.5
4   Apple      0                        0.5
5    Pear      1                       0.75
6    Pear      1                       0.75
7    Pear      1                       0.75
8    Pear      0                       0.75
9  Cherry      0                          0
10 Cherry      0                          0
11 Cherry      0                          0

#create example datagram; desired outcome columns are purely inserted as illustrative of target outcomes
Fruit=c(rep("Apple",4),rep("Pear",4),rep("Cherry",3))
Rotten=c(1,1,0,0,1,1,1,0,0,0,0)
Desired_Outcome_PercRotten=c(0.5,0.5,0.5,0.5,0.75,0.75,0.75,0.75,0,0,0)
df=as.data.frame(cbind(Fruit,Rotten,Desired_Outcome_PercRotten))        
df

【问题讨论】：

关于你问题第一部分的相关讨论：stackoverflow.com/q/11562656/636656。下面的答案更好，因为它们将 split-apply-combine 操作与合并结合在一个步骤中。
user1885116，使用df <- data.frame(Fruit, Rotten, Desired_Outcome_PercRotten) 从头开始创建data.frame，而不是使用as.data.frame 和cbind。它将列 Rotten 作为因子，这是不可取的。

标签： r merge plyr

【解决方案1】：

您只需使用ddply 和mutate 即可：

# changed summarise to transform on joran's suggestion
# changed transform to mutate on mnel's suggestion :)
ddply(df, .(Fruit), mutate, Perc = sum(Rotten)/length(Rotten))

#     Fruit Rotten Perc
# 1   Apple      1 0.50
# 2   Apple      1 0.50
# 3   Apple      0 0.50
# 4   Apple      0 0.50
# 5  Cherry      0 0.00
# 6  Cherry      0 0.00
# 7  Cherry      0 0.00
# 8    Pear      1 0.75
# 9    Pear      1 0.75
# 10   Pear      1 0.75
# 11   Pear      0 0.75

【讨论】：

我还建议mutate（transform 的plyr 实现，它允许您引用创建的列，例如ddply(df ,.(Fruit), mutate, percR = sum(Rotten) / length(Rotten), pp = Rotten *percR) 与ddply(dd ,.(Fruit), transform, percR = sum(Rotten) / length(Rotten), pp = Rotten *percR) 相比

【解决方案2】：

data.table 非常快，因为它通过引用进行更新。怎么用？

library(data.table)

dt=data.table(Fruit,Rotten,Desired_Outcome_PercRotten)

dt[,test:=sum(Rotten)/.N,by="Fruit"]
#dt
#     Fruit Rotten Desired_Outcome_PercRotten test
# 1:  Apple      1                       0.50 0.50
# 2:  Apple      1                       0.50 0.50
# 3:  Apple      0                       0.50 0.50
# 4:  Apple      0                       0.50 0.50
# 5:   Pear      1                       0.75 0.75
# 6:   Pear      1                       0.75 0.75
# 7:   Pear      1                       0.75 0.75
# 8:   Pear      0                       0.75 0.75
# 9: Cherry      0                       0.00 0.00
#10: Cherry      0                       0.00 0.00
#11: Cherry      0                       0.00 0.00

【讨论】：

【解决方案3】：

base R 中的一个解决方案是使用ave。

within(df, {
  ## Because of how you've created your data.frame
  ##   Rotten is actually a factor. So, we need to
  ##   convert it to numeric before we can use mean
  Rotten <- as.numeric(as.character(Rotten))
  NewCol <- ave(Rotten, Fruit)
})
    Fruit Rotten Desired_Outcome_PercRotten NewCol
1   Apple      1                        0.5   0.50
2   Apple      1                        0.5   0.50
3   Apple      0                        0.5   0.50
4   Apple      0                        0.5   0.50
5    Pear      1                       0.75   0.75
6    Pear      1                       0.75   0.75
7    Pear      1                       0.75   0.75
8    Pear      0                       0.75   0.75
9  Cherry      0                          0   0.00
10 Cherry      0                          0   0.00

或更短：

transform(df, desired = ave(Rotten == 1, Fruit))

ave 应用的默认函数是mean，因此这里没有包含它。但是，如果您想做不同的事情，可以通过附加 FUN = some-function-here 来指定不同的函数。

【讨论】：

【解决方案4】：

由于ave 已经发布，让我使用我选择的基本 R 函数添加一个解决方案：aggregate。

您可以通过以下方式获得所需的数据：

aggregate(as.numeric(as.character(Rotten)) ~ Fruit, df, mean)

但是，之后您仍需要merge（或一件）：

merge(df, aggregate(as.numeric(as.character(Rotten)) ~ Fruit, df, mean))

【讨论】：