【问题标题】:Percentage of categorical variable grouping by different variables in RR中不同变量分类变量分组的百分比
【发布时间】:2022-01-14 11:28:02
【问题描述】:

我有一个这样的数据集:

id = rep(paste("id",1:7,sep=""),c(5,2,4,2,3,1,4))
county=rep(c("A","B","C","B","A","D","C"),c(5,2,4,2,3,1,4))
year =c(2011:2015,2012:2013,2011:2014,2013:2014,2011:2013,2014,2012:2015)
age= sample(30:50,size=length(id),replace=T)
race=sample(c("white","Black","Other"),size=length(id),replace=T)
df=data.frame(id=id,county=county,year=year,age=age,race=race)

    id county year age  race
1  id1      A 2011  32 white
2  id1      A 2012  48 Black
3  id1      A 2013  50 Other
4  id1      A 2014  37 white
5  id1      A 2015  32 white
6  id2      B 2012  48 Black
7  id2      B 2013  48 Other
8  id3      C 2011  40 Other
9  id3      C 2012  33 Other
10 id3      C 2013  42 white
11 id3      C 2014  33 Other
12 id4      B 2013  43 Other
13 id4      B 2014  33 Black
14 id5      A 2011  50 Black
15 id5      A 2012  43 Other
16 id5      A 2013  41 white
17 id6      D 2014  37 Black
18 id7      C 2012  32 white
19 id7      C 2013  31 Other
20 id7      C 2014  34 Other
21 id7      C 2015  35 Other

我想按县、年分组并获得平均年龄和种族类别的百分比。 对于年龄,它可以通过

library(dplyr)
df %>% group_by(county,year) %>% summarise(avgage=mean(age))

但是如何按组获得 3 类种族的百分比?

TIA!

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    首先我建议使用pivot_wider为每场比赛创建一个变量。 然后我们可以使用summarisemean 作为年龄平均值。对于种族的比例,使用scales::percent(mean(variable))across all race。

    library(tidyr)
    library(dplyr)
    
    df %>% pivot_wider(names_from = race,
                       values_from = race,
                       values_fn = \(x)1,
                       values_fill = 0)%>%
        group_by(county, year)%>%
        summarise(mean_age = mean(age),
                  across(Black:Other, ~scales::percent(mean(.x), 0.01)))
    
    # A tibble: 14 × 6
    # Groups:   county [4]
       county  year mean_age Black   white   Other  
       <chr>  <dbl>    <dbl> <chr>   <chr>   <chr>  
     1 A       2011     41   50.00%  0.00%   50.00% 
     2 A       2012     43   0.00%   100.00% 0.00%  
     3 A       2013     42   0.00%   0.00%   100.00%
     4 A       2014     36   0.00%   0.00%   100.00%
     5 A       2015     49   0.00%   100.00% 0.00%  
     6 B       2012     36   0.00%   0.00%   100.00%
     7 B       2013     42   50.00%  0.00%   50.00% 
     8 B       2014     45   0.00%   0.00%   100.00%
     9 C       2011     41   100.00% 0.00%   0.00%  
    10 C       2012     39   0.00%   50.00%  50.00% 
    11 C       2013     42   0.00%   50.00%  50.00% 
    12 C       2014     34.5 0.00%   50.00%  50.00% 
    13 C       2015     33   0.00%   0.00%   100.00%
    14 D       2014     36   0.00%   0.00%   100.00%
    

    我们可以通过将pivot_wider 替换为另一个包中的虚拟函数来缩短第一步。这里我用fastDummies

    library(fastDummies)
    library(dplyr)
    
    df %>%
        dummy_cols('race', remove_selected_columns = TRUE)%>%
        group_by(county, year)%>%
        summarise(mean_age = mean(age),
                  across(starts_with('race_'),
                         ~scales::percent(mean(.x), 0.01)))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-12-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-07-17
      • 1970-01-01
      相关资源
      最近更新 更多