R中不同变量分类变量分组的百分比答案

【问题标题】：Percentage of categorical variable grouping by different variables in RR中不同变量分类变量分组的百分比
【发布时间】：2022-01-14 11:28:02
【问题描述】：

我有一个这样的数据集：

id = rep(paste("id",1:7,sep=""),c(5,2,4,2,3,1,4))
county=rep(c("A","B","C","B","A","D","C"),c(5,2,4,2,3,1,4))
year =c(2011:2015,2012:2013,2011:2014,2013:2014,2011:2013,2014,2012:2015)
age= sample(30:50,size=length(id),replace=T)
race=sample(c("white","Black","Other"),size=length(id),replace=T)
df=data.frame(id=id,county=county,year=year,age=age,race=race)

    id county year age  race
1  id1      A 2011  32 white
2  id1      A 2012  48 Black
3  id1      A 2013  50 Other
4  id1      A 2014  37 white
5  id1      A 2015  32 white
6  id2      B 2012  48 Black
7  id2      B 2013  48 Other
8  id3      C 2011  40 Other
9  id3      C 2012  33 Other
10 id3      C 2013  42 white
11 id3      C 2014  33 Other
12 id4      B 2013  43 Other
13 id4      B 2014  33 Black
14 id5      A 2011  50 Black
15 id5      A 2012  43 Other
16 id5      A 2013  41 white
17 id6      D 2014  37 Black
18 id7      C 2012  32 white
19 id7      C 2013  31 Other
20 id7      C 2014  34 Other
21 id7      C 2015  35 Other

我想按县、年分组并获得平均年龄和种族类别的百分比。对于年龄，它可以通过

library(dplyr)
df %>% group_by(county,year) %>% summarise(avgage=mean(age))

但是如何按组获得 3 类种族的百分比？

TIA！

【问题讨论】：

标签： r dplyr

【解决方案1】：

首先我建议使用pivot_wider为每场比赛创建一个变量。然后我们可以使用summarise 和mean 作为年龄平均值。对于种族的比例，使用scales::percent(mean(variable))across all race。

library(tidyr)
library(dplyr)

df %>% pivot_wider(names_from = race,
                   values_from = race,
                   values_fn = \(x)1,
                   values_fill = 0)%>%
    group_by(county, year)%>%
    summarise(mean_age = mean(age),
              across(Black:Other, ~scales::percent(mean(.x), 0.01)))

# A tibble: 14 × 6
# Groups:   county [4]
   county  year mean_age Black   white   Other  
   <chr>  <dbl>    <dbl> <chr>   <chr>   <chr>  
 1 A       2011     41   50.00%  0.00%   50.00% 
 2 A       2012     43   0.00%   100.00% 0.00%  
 3 A       2013     42   0.00%   0.00%   100.00%
 4 A       2014     36   0.00%   0.00%   100.00%
 5 A       2015     49   0.00%   100.00% 0.00%  
 6 B       2012     36   0.00%   0.00%   100.00%
 7 B       2013     42   50.00%  0.00%   50.00% 
 8 B       2014     45   0.00%   0.00%   100.00%
 9 C       2011     41   100.00% 0.00%   0.00%  
10 C       2012     39   0.00%   50.00%  50.00% 
11 C       2013     42   0.00%   50.00%  50.00% 
12 C       2014     34.5 0.00%   50.00%  50.00% 
13 C       2015     33   0.00%   0.00%   100.00%
14 D       2014     36   0.00%   0.00%   100.00%

我们可以通过将pivot_wider 替换为另一个包中的虚拟函数来缩短第一步。这里我用fastDummies：

library(fastDummies)
library(dplyr)

df %>%
    dummy_cols('race', remove_selected_columns = TRUE)%>%
    group_by(county, year)%>%
    summarise(mean_age = mean(age),
              across(starts_with('race_'),
                     ~scales::percent(mean(.x), 0.01)))

【讨论】：