【问题标题】:Aggregating a categorical table in R (With percentages)在 R 中聚合分类表(带百分比)
【发布时间】:2018-08-25 07:59:24
【问题描述】:

我在 R 中有下表:

Sample             Cluster  CellType  Condition  Genotype  Lane
Sample1            1        A         Mut        XXXX      1
Sample2            2        B         Mut        YYYY      1
Sample3            2        A         Mut        YYYY      2
Sample4            1        A         Mut        ZZZZ      1
Sample5            2        B         Mut        YYYY      3
Sample6            1        B         Mut        YYYY      1
Sample7            1        A         Mut        XXXX      2

我想:

  • 按簇列聚合表,
  • 每个其他列产生与集群相关的主导值
  • 以及“置信度”,表示与同一集群相关的值的优势百分比

像这样:

Cluster      CellType  Condition  Genotype     Lane
1            A (75%)   Mut (100%) XXXX (50%)   1 (75%)
2            B (66%)   Mut (100%) YYYY (100%)  1 (33%)

我尝试使用聚合函数如下,它产生了接近的结果,但它还没有完全实现:

Mode <- function(x) {
 ux <- unique(x)
 ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr)
aggregate(. ~ Cluster, clustering_report, Mode)

【问题讨论】:

    标签: r dplyr aggregation


    【解决方案1】:

    这是一个基本的 R 解决方案,

    m1 <- do.call(rbind, 
            lapply(split(df, df$Cluster), 
                   function(i) sapply(i[3:6], 
                                      function(j) {
                                        t1 <- prop.table(table(j)); 
                                        nms <- names(t1[which.max(t1)]); 
                                        paste0(nms, ' (' ,round(max(t1)*100), '%', ')')
                                        })))
    
    cbind.data.frame(unique(df[2]), m1)
    

    给出,

    Cluster CellType  Condition    Genotype    Lane
    1       1  A (75%) Mut (100%)  XXXX (50%) 1 (75%)
    2       2  B (67%) Mut (100%) YYYY (100%) 1 (33%)
    

    【讨论】:

      【解决方案2】:
      library(dplyr)
      
      df %>%
        group_by(Cluster) %>%
        summarise_at(vars(CellType:Lane), funs(val=names(which(table(.) == max(table(.)))[1]),
                                               rate=(max(table(.))[1]/n())*100))
      

      输出为:

        Cluster CellType_val Condition_val Genotype_val Lane_val CellType_rate Condition_rate Genotype_rate Lane_rate
      1       1 A            Mut           XXXX         1                 75.0            100          50.0      75.0
      2       2 B            Mut           YYYY         1                 66.7            100         100        33.3
      

      或许

      df %>%
        group_by(Cluster) %>%
        summarise_at(vars(CellType:Lane), funs(paste0(names(which(table(.) == max(table(.)))[1]), 
                                                      " (",
                                                      rate=round((max(table(.))[1]/n())*100), 
                                                      "%)")))
      
      #  Cluster CellType Condition  Genotype    Lane   
      #1       1 A (75%)  Mut (100%) XXXX (50%)  1 (75%)
      #2       2 B (67%)  Mut (100%) YYYY (100%) 1 (33%)
      

      样本数据:

      df <- structure(list(Sample = c("Sample1", "Sample2", "Sample3", "Sample4", 
      "Sample5", "Sample6", "Sample7"), Cluster = c(1L, 2L, 2L, 1L, 
      2L, 1L, 1L), CellType = c("A", "B", "A", "A", "B", "B", "A"), 
          Condition = c("Mut", "Mut", "Mut", "Mut", "Mut", "Mut", "Mut"
          ), Genotype = c("XXXX", "YYYY", "YYYY", "ZZZZ", "YYYY", "YYYY", 
          "XXXX"), Lane = c(1L, 1L, 2L, 1L, 3L, 1L, 2L)), .Names = c("Sample", 
      "Cluster", "CellType", "Condition", "Genotype", "Lane"), class = "data.frame", row.names = c(NA, 
      -7L))
      

      【讨论】:

      • @Chen 只是想知道您是否在使用这种方法时遇到了问题?
      猜你喜欢
      • 2022-10-05
      • 2017-01-19
      • 1970-01-01
      • 2019-01-08
      • 2017-01-03
      • 2015-03-07
      • 2014-12-24
      • 2015-01-16
      • 1970-01-01
      相关资源
      最近更新 更多