【问题标题】:Multiple aggregation and variable calculation多重聚合和变量计算
【发布时间】:2021-01-17 13:01:53
【问题描述】:

我有这样的数据

mydata=structure(list(id = c(15010124001, 15010153006, 15010169005, 
15010228019, 15010229028, 6010001012, 6010012023, 6010014015, 
6010015008, 6020001014, 6020002037), sqr = c("14", "9", "2", 
"21", "13", "26", "17,2", "21,7", "4,7", "32,2", "36,1"), por = c("alpin", 
"alpin", "alpin", "alpin", "alpin", "Yornik birch", "Yornik birch", 
"Yornik birch", "Yornik birch", "Yornik birch", "Yornik birch"
), zap = c("2100", "1100", "1700", "1000", "1300", "200", "197,6744186", 
"170,5069124", "212,7659574", "301,242236", "398,8919668"), zappor = c("1260", 
"330", "850", "1000", "910", "200", "197,6744186", "170,5069124", 
"212,7659574", "301,242236", "398,8919668"), zapvyd = c(2940L, 
990L, 340L, 2100L, 1690L, 520L, 340L, 370L, 100L, 970L, 1440L
), coef = c(6L, 3L, 5L, 10L, 7L, 10L, 10L, 10L, 10L, 10L, 10L
), age = c(130L, 100L, 130L, 150L, 120L, 15L, 15L, 10L, 15L, 
20L, 20L), vys = c(21L, 17L, 19L, 17L, 18L, 2L, 2L, 1L, 2L, 2L, 
2L), diam = c(26L, 18L, 24L, 28L, 22L, 2L, 2L, 2L, 2L, 2L, 2L
), polnot = c("0,6", "0,4", "0,6", "0,4", "0,5", "0,7", "0,8", 
"0,7", "0,7", "0,5", "0,6"), BON = c(4L, 4L, 4L, 5L, 4L, 4L, 
4L, 4L, 5L, 4L, 4L), clust = c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 
2L, 2L, 2L)), class = "data.frame", row.names = c(NA, -11L))

我需要每个por(分类变量)的每个集群按总和聚合sqr。 我当然可以这样做

ag <- aggregate(sqr~clust+por , data = mydat, sum)

但不是那么简单,因为 然后我需要通过 por 为每个集群计算sqr 的百分比。 例如,当我手动执行时

por clust   sum
alpin   1   25(14+9+2)
alpin   2   34(21+13
Yornik birch    1   43,2
Yornik birch    2   94,7

但是我需要一个更复杂的聚合,我不明白该怎么做。因此,我需要计算每个集群在特定类别的por 变量中占总sqr 的百分比。 例如por=alpin 的第一个集群。 sqr= 25,cluster1 中的观察总数为 3(obs.)

3/25 = 0.12 (12%)

作为输出表

por   clust sum       percent
alpin   1   25(14+9+2)  12

之后我需要计算新变量。计算所有por 类别和所有clusterssqr

    14
    9
    2
    21
    13
    26
    17,2
    21,7
    4,7
    32,2
    36,1
sum 169,9

然后除以这个总和,每个por 的每个聚类中的观察数。例如对于第一个集群 alpin 类别= 3(obs.in the first cluster) / 169.9 = 0.017657446 (1.7%) 最终的表会是这个样子(例如alpine的第一个集群por) 确实是这个期望的输出

por clust sum percent percent1
alpin 1     25  12      1.7

我如何进行这样的转换?

【问题讨论】:

    标签: r dplyr tidyr


    【解决方案1】:

    我认为如果您将问题分解为多个步骤并使用 dplyr 对每个步骤进行编码会更容易。

    1. 需要创建数值
    2. 需要按组执行计算
    3. 按组计算
    4. 为了计算总和,我们需要取消分组
    5. 计算第二个百分比
    
    mydata %>%
      mutate(sqr = as.numeric(gsub(",", ".", sqr))) %>% # --> convert to numeric as it is string
      group_by(por, clust) %>% # --> group by what you want
      mutate(
        pct = length(sqr) / sum(sqr), # --> create first percentage
        pct2 = length(id) # --> create second percentage, incomplete for now
      ) %>%
      ungroup() %>% # --> no need to have anything grouped now
      mutate(pct2 = pct2 / sum(sqr)) # --> update second percentage with actual calc
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-03-26
      • 1970-01-01
      • 1970-01-01
      • 2018-06-24
      • 2023-01-04
      • 1970-01-01
      • 2023-04-09
      • 1970-01-01
      相关资源
      最近更新 更多