多重聚合和变量计算答案

【问题标题】：Multiple aggregation and variable calculation多重聚合和变量计算
【发布时间】：2021-01-17 13:01:53
【问题描述】：

我有这样的数据

mydata=structure(list(id = c(15010124001, 15010153006, 15010169005, 
15010228019, 15010229028, 6010001012, 6010012023, 6010014015, 
6010015008, 6020001014, 6020002037), sqr = c("14", "9", "2", 
"21", "13", "26", "17,2", "21,7", "4,7", "32,2", "36,1"), por = c("alpin", 
"alpin", "alpin", "alpin", "alpin", "Yornik birch", "Yornik birch", 
"Yornik birch", "Yornik birch", "Yornik birch", "Yornik birch"
), zap = c("2100", "1100", "1700", "1000", "1300", "200", "197,6744186", 
"170,5069124", "212,7659574", "301,242236", "398,8919668"), zappor = c("1260", 
"330", "850", "1000", "910", "200", "197,6744186", "170,5069124", 
"212,7659574", "301,242236", "398,8919668"), zapvyd = c(2940L, 
990L, 340L, 2100L, 1690L, 520L, 340L, 370L, 100L, 970L, 1440L
), coef = c(6L, 3L, 5L, 10L, 7L, 10L, 10L, 10L, 10L, 10L, 10L
), age = c(130L, 100L, 130L, 150L, 120L, 15L, 15L, 10L, 15L, 
20L, 20L), vys = c(21L, 17L, 19L, 17L, 18L, 2L, 2L, 1L, 2L, 2L, 
2L), diam = c(26L, 18L, 24L, 28L, 22L, 2L, 2L, 2L, 2L, 2L, 2L
), polnot = c("0,6", "0,4", "0,6", "0,4", "0,5", "0,7", "0,8", 
"0,7", "0,7", "0,5", "0,6"), BON = c(4L, 4L, 4L, 5L, 4L, 4L, 
4L, 4L, 5L, 4L, 4L), clust = c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 
2L, 2L, 2L)), class = "data.frame", row.names = c(NA, -11L))

我需要每个por（分类变量）的每个集群按总和聚合sqr。我当然可以这样做

ag <- aggregate(sqr~clust+por , data = mydat, sum)

但不是那么简单，因为然后我需要通过 por 为每个集群计算sqr 的百分比。例如，当我手动执行时

por clust   sum
alpin   1   25(14+9+2)
alpin   2   34(21+13
Yornik birch    1   43,2
Yornik birch    2   94,7

但是我需要一个更复杂的聚合，我不明白该怎么做。因此，我需要计算每个集群在特定类别的por 变量中占总sqr 的百分比。例如por=alpin 的第一个集群。 sqr= 25，cluster1 中的观察总数为 3(obs.)

3/25 = 0.12 (12%)

作为输出表

por   clust sum       percent
alpin   1   25(14+9+2)  12

之后我需要计算新变量。计算所有por 类别和所有clusters 的sqr

然后除以这个总和，每个por 的每个聚类中的观察数。例如对于第一个集群 alpin 类别= 3(obs.in the first cluster) / 169.9 = 0.017657446 (1.7%) 最终的表会是这个样子（例如alpine的第一个集群por）确实是这个期望的输出

por clust sum percent percent1
alpin 1     25  12      1.7

我如何进行这样的转换？

【问题讨论】：

标签： r dplyr tidyr

【解决方案1】：

我认为如果您将问题分解为多个步骤并使用 dplyr 对每个步骤进行编码会更容易。

需要创建数值
需要按组执行计算
按组计算
为了计算总和，我们需要取消分组
计算第二个百分比


mydata %>%
  mutate(sqr = as.numeric(gsub(",", ".", sqr))) %>% # --> convert to numeric as it is string
  group_by(por, clust) %>% # --> group by what you want
  mutate(
    pct = length(sqr) / sum(sqr), # --> create first percentage
    pct2 = length(id) # --> create second percentage, incomplete for now
  ) %>%
  ungroup() %>% # --> no need to have anything grouped now
  mutate(pct2 = pct2 / sum(sqr)) # --> update second percentage with actual calc

【讨论】：