【问题标题】:Running multiple chi-squared tests for different categories针对不同类别运行多个卡方检验
【发布时间】:2020-05-20 12:19:17
【问题描述】:

我有二进制数据,具体取决于个人是否通过/未通过测试,以及 df(data) 中的特征信息(例如性别)和他们所属的部门(例如 x、y、z)

head(data,9)
department  gender   pass 
x           Male     1               
y           Female   1             
y           Male     0         
y           Male     1              
x           Female   1              
z           Female   0            
z           Male     1
x           Male     0
z           Female   0

我可以轻松地对性别和传递之间的关系进行卡方检验:

chisq.test(data$gender, data$pass)

但是有没有一种方法可以为“部门”(x,y,z) 中的值单独运行,而不必每次都手动对数据进行子集化?

我可以创建一个新的数据框,使用 tapply 分解每个部门的整体通过率:

as.data.frame(tapply(data$pass, data$department,mean))

但是有没有办法我可以添加一个新变量来指示上述测试的结果(比如说 p 值)?

【问题讨论】:

    标签: r for-loop chi-squared


    【解决方案1】:

    使用broomdplyr 是一种优雅的方法。首先,我们按部门变量分组并嵌套我们的数据框。然后我们对每个“子集”运行chisq.test。最后,为了获得相关统计数据(例如p.value),我们利用broom::tidy。由于这些都与每个子集嵌套,因此我们取消嵌套我们最终希望看到的任何组件。

    详情请见this vignette

    library(tidyverse)
    library(broom)
    
    df <- data.frame(
      stringsAsFactors = FALSE,
            department = c("x", "y", "y", "y", "x", "z", "z", "x", "z"),
                gender = c("Male","Female","Male",
                           "Male","Female","Female","Male","Male","Female"),
                  pass = c(1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L)
    )
    
    
    df %>%
      group_by(department) %>%
      nest() %>% 
      mutate(
        chi_test = map(data, ~ chisq.test(.$gender, .$pass)),
        tidied = map(chi_test, tidy)
      ) %>% 
      unnest(tidied)
    
    #> # A tibble: 3 x 7
    #> # Groups:   department [3]
    #>   department data      chi_test statistic p.value parameter method              
    #>   <chr>      <list>    <list>       <dbl>   <dbl>     <int> <chr>               
    #> 1 x          <tibble ~ <htest>   4.62e-32   1.00          1 Pearson's Chi-squar~
    #> 2 y          <tibble ~ <htest>   4.62e-32   1.00          1 Pearson's Chi-squar~
    #> 3 z          <tibble ~ <htest>   1.88e- 1   0.665         1 Pearson's Chi-squar~
    

    reprex package (v0.3.0) 于 2020 年 5 月 20 日创建

    如果你想使用基础 R,你可以利用 splitlapply 像这样:

    lapply(split(df, df$department), function(x) { chisq.test(x$gender, x$pass)$p.value })
    

    【讨论】:

      【解决方案2】:

      不是对您的问题的完全不同的答案,而是如果您尝试回答不同的问题时的答案。 @JasonAizkalns 为您提供了每个部门的优雅答案,但如果您有兴趣将部门相互比较,则需要调整多重比较。所以它可能看起来像这样。

      library(dplyr)
      library(rcompanion)
      
      df <- data.frame(
        stringsAsFactors = FALSE,
        department = c("x", "y", "y", "y", "x", "z", "z", "x", "z"),
        gender = c("Male","Female","Male",
                   "Male","Female","Female","Male","Male","Female"),
        pass = c(1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L)
      )
      
      df %>%
        group_by(department, gender) %>%
        summarise(Freq = n()) %>%
        xtabs(formula = Freq ~ ., data = .) %>% 
        pairwiseNominalIndependence(x = ., method = "holm", gtest = FALSE)
      
      #> Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
      
      #> Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
      
      #> Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
      #>   Comparison p.Fisher p.adj.Fisher p.Chisq p.adj.Chisq
      #> 1      x : y        1            1       1           1
      #> 2      x : z        1            1       1           1
      #> 3      y : z        1            1       1           1
      

      【讨论】:

        【解决方案3】:

        是的,有!使用by

        res <- do.call(rbind, by(dat, dat$department, function(x) {
          c(M=unname(tapply(x$pass, x$department, mean)),
            p=chisq.test(x$gender, x$pass)$p.value)
        }))
        res
        #           M            p
        # x 0.6788732 1.484695e-18
        # y 0.6516517 3.045009e-22
        # z 0.3205128 7.945768e-69
        

        数据:

        dat <- read.table(text="department  gender   pass 
        x           Male     1               
        y           Female   1             
        y           Male     0         
        y           Male     1              
        x           Female   1              
        z           Female   0            
        z           Male     1
        x           Male     0
        z           Female   0", header=T)
        set.seed(42)
        dat <- dat[sample(1:nrow(dat), 1000, replace=T), ]
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2017-07-25
          • 2022-12-11
          • 2020-10-15
          • 1970-01-01
          • 2021-12-12
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多