计算变量在多个组中出现的百分比答案

【问题标题】：Calculate the percent occurrence of a variable in multiple groups计算变量在多个组中出现的百分比
【发布时间】：2018-03-25 13:02:33
【问题描述】：

样本数据

set.seed(123)
df <- data.frame(loc.id = rep(1:1000, each = 35), year = rep(1980:2014,times = 1000),month.id = sample(c(1:4,8:10,12),35*1000,replace = T))

数据框有一个名为month.id 的变量的 1000 个位置 X 35 年的数据，它基本上是一年中的一个月。对于每一年，我想计算每个月的发生百分比。例如1980 年，

month.vec <- df[df$year == 1980,]
table(month.vec$month.id)
1   2   3   4   8   9  10  12 
106 132 116 122 114 130 141 139

计算月份出现的百分比：

table(month.vec$month.id)/length(month.vec$month.id) * 100
1    2    3    4    8    9   10   12 
10.6 13.2 11.6 12.2 11.4 13.0 14.1 13.9

我想要一张这样的桌子：

    year month percent
    1980   1    10.6
    1980   2    13.2
    1980   3    11.6
    1980   4    12.2
    1980   5    NA
    1980   6    NA
    1980   7    NA
    1980   8    11.4    
    1980   9    13
    1980   10   14.1
    1980   11   NA
    1980   12   13.9

由于缺少 5、6、7、11 个月，我只想为这些月份添加带有 NA 的附加行。如果可能的话，我会像这样的 dplyr 解决方案：

   library(dplyr)
   df %>% group_by(year) %>% summarise(percentage.contri = table(month.id)/length(month.id)*100)

【问题讨论】：

标签： r dplyr data.table tidyverse purrr

【解决方案1】：

使用dplyr 和tidyr 的解决方案

# To get month as integer use (or add as.integer to mutate):
# df$month.id <- as.integer(df$month.id)

library(dplyr)
library(tidyr)
df %>%
    group_by(year, month.id) %>% 
    # Count occurrences per year & month
    summarise(n = n()) %>%
    # Get percent per month (year number is calculated with sum(n))
    mutate(percent = n / sum(n) * 100) %>%
    # Fill in missing months
    complete(year, month.id = 1:12, fill = list(percent = 0)) %>%
    select(year, month.id, percent)

    year month.id percent
   <int>    <dbl>   <dbl>
 1  1980     1.00    10.6
 2  1980     2.00    13.2
 3  1980     3.00    11.6
 4  1980     4.00    12.2
 5  1980     5.00     0  
 6  1980     6.00     0  
 7  1980     7.00     0  
 8  1980     8.00    11.4
 9  1980     9.00    13.0
10  1980    10.0     14.1
11  1980    11.0      0  
12  1980    12.0     13.9

【讨论】：

【解决方案2】：

基础 R 解决方案：

tab <- table(month.vec$year, factor(month.vec$month.id, levels = 1:12))/length(month.vec$month.id) * 100
dfnew <- as.data.frame(tab)

给出：

> dfnew
   Var1 Var2 Freq
1  1980    1 10.6
2  1980    2 13.2
3  1980    3 11.6
4  1980    4 12.2
5  1980    5  0.0
6  1980    6  0.0
7  1980    7  0.0
8  1980    8 11.4
9  1980    9 13.0
10 1980   10 14.1
11 1980   11  0.0
12 1980   12 13.9

或者data.table:

library(data.table)

setDT(month.vec)[, .N, by = .(year, month.id)
                 ][.(year = 1980, month.id = 1:12), on = .(year, month.id)
                   ][, N := 100 * N/sum(N, na.rm = TRUE)][]

【讨论】：