使用 ddply 汇总数据帧时计数错误答案

【问题标题】：Wrong count when summarizing dataframe with ddply使用 ddply 汇总数据帧时计数错误
【发布时间】：2014-01-03 19:45:52
【问题描述】：

在我的数据集中，我有一个连续变量mag，我从中派生了一个分类变量mag.cat，它有四个类别：1 代表mag 的0 和1 之间的值，2 代表1 和2 之间的值对于mag，对于mag，对于介于2 和3 之间的值是3，对于mag，对于大于3 的值是4。数据子集如下所示：

   location mag depth mag.cat
1     Assen 1.8   1.0       2
2 Hooghalen 2.5   1.5       3
3 Purmerend 0.7   1.2       1
4     Emmen 2.2   3.0       3
5 Geelbroek 3.6   3.0       4
6   Eleveld 2.7   3.0       3

我想将此数据框总结为一个新的数据框，每个位置只有一行。

我这样做了：

df.new <- ddply(df, .(location), summarise, n.tot = as.numeric(length(location)), 
                gem.mag = round(mean(mag),1), 
                n.1 = as.numeric(length(location[mag == 1])),
                n.2 = as.numeric(length(location[mag == 2])),
                n.3 = as.numeric(length(location[mag == 3])),
                n.4 = as.numeric(length(location[mag == 4]))
                )

n.1、n.2、n.3 和 n.4 变量应该包含每个位置的每个类别的计数。这些变量的总和在逻辑上应该等于n.tot，但事实并非如此。这可以在新数据框的头部看到：

      location  n.tot gem.mag n.1 n.2 n.3 n.4
1   Achterdiep      5     1.1   2   0   0   0
2      Alkmaar      4     3.2   0   0   1   0
3       Altena      1     1.3   0   0   0   0
4 Amelanderwad      2     1.8   0   0   0   0
5         Amen      6     1.1   0   0   0   0
6     Amerbrug      1     0.9   0   0   0   0

我期待的是这样的：

      location  n.tot gem.mag n.1 n.2 n.3 n.4
1   Achterdiep      5     1.1   2   2   0   1
2      Alkmaar      4     3.2   0   3   1   0
3       Altena      1     1.3   0   1   0   0
4 Amelanderwad      2     1.8   0   1   1   0
5         Amen      6     1.1   3   2   0   1
6     Amerbrug      1     0.9   1   0   0   0

我做错了什么？

【问题讨论】：

它可能应该是：n.1 = as.numeric(length(location[mag.cat == 1])) 等等？
也许我误会了你，但是为什么你在计算n.x时得到mag而不是mag.cat的长度。 mag 仍然包含小数点值，不是吗？
...另外，我不知道你为什么觉得 as.numeric 在这里是必要的。
@Rguy @Mark Arghh，我现在看到了我的错误。使用mag.cat 可以得到想要的结果:-)

标签： r plyr

【解决方案1】：

为什么不：

res <- xtabs( ~ location + mag.cat, data=df)
res

如果您想将总计作为一列，那么 cbind(tot.n= rowSums(res), res)。

mag 方式：with(df, tapply(mag, location, mean))

一切：

 cbind( gem.mag= with(df, tapply(mag, location, mean)),
        tot.n= rowSums(res), 
        res)

如果没有 plyr 版本，我想答案是不完整的：

 require(plyr)
 df.new <- ddply(df, .(location), summarise, n.tot = as.numeric(length(location)), 
                gem.mag = round(mean(mag),1), 
                n.1 = sum(mag.cat==1),
                n.2 = sum(mag.cat==2),
                n.3 = sum(mag.cat==3),
                n.4 = sum(mag.cat==4)
                )
> df.new
   location n.tot gem.mag n.1 n.2 n.3 n.4
1     Assen     1     1.8   0   1   0   0
2   Eleveld     1     2.7   0   0   1   0
3     Emmen     1     2.2   0   0   1   0
4 Geelbroek     1     3.6   0   0   0   1
5 Hooghalen     1     2.5   0   0   1   0
6 Purmerend     1     0.7   1   0   0   0

【讨论】：