在 R 中聚合分类表（带百分比）答案

【问题标题】：Aggregating a categorical table in R (With percentages)在 R 中聚合分类表（带百分比）
【发布时间】：2018-08-25 07:59:24
【问题描述】：

我在 R 中有下表：

Sample             Cluster  CellType  Condition  Genotype  Lane
Sample1            1        A         Mut        XXXX      1
Sample2            2        B         Mut        YYYY      1
Sample3            2        A         Mut        YYYY      2
Sample4            1        A         Mut        ZZZZ      1
Sample5            2        B         Mut        YYYY      3
Sample6            1        B         Mut        YYYY      1
Sample7            1        A         Mut        XXXX      2

我想：

按簇列聚合表，
每个其他列产生与集群相关的主导值
以及“置信度”，表示与同一集群相关的值的优势百分比

像这样：

Cluster      CellType  Condition  Genotype     Lane
1            A (75%)   Mut (100%) XXXX (50%)   1 (75%)
2            B (66%)   Mut (100%) YYYY (100%)  1 (33%)

我尝试使用聚合函数如下，它产生了接近的结果，但它还没有完全实现：

Mode <- function(x) {
 ux <- unique(x)
 ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr)
aggregate(. ~ Cluster, clustering_report, Mode)

【问题讨论】：

标签： r dplyr aggregation

【解决方案1】：

这是一个基本的 R 解决方案，

m1 <- do.call(rbind, 
        lapply(split(df, df$Cluster), 
               function(i) sapply(i[3:6], 
                                  function(j) {
                                    t1 <- prop.table(table(j)); 
                                    nms <- names(t1[which.max(t1)]); 
                                    paste0(nms, ' (' ,round(max(t1)*100), '%', ')')
                                    })))

cbind.data.frame(unique(df[2]), m1)

给出，

Cluster CellType  Condition    Genotype    Lane
1       1  A (75%) Mut (100%)  XXXX (50%) 1 (75%)
2       2  B (67%) Mut (100%) YYYY (100%) 1 (33%)

【讨论】：

【解决方案2】：

library(dplyr)

df %>%
  group_by(Cluster) %>%
  summarise_at(vars(CellType:Lane), funs(val=names(which(table(.) == max(table(.)))[1]),
                                         rate=(max(table(.))[1]/n())*100))

输出为：

  Cluster CellType_val Condition_val Genotype_val Lane_val CellType_rate Condition_rate Genotype_rate Lane_rate
1       1 A            Mut           XXXX         1                 75.0            100          50.0      75.0
2       2 B            Mut           YYYY         1                 66.7            100         100        33.3

或许

df %>%
  group_by(Cluster) %>%
  summarise_at(vars(CellType:Lane), funs(paste0(names(which(table(.) == max(table(.)))[1]), 
                                                " (",
                                                rate=round((max(table(.))[1]/n())*100), 
                                                "%)")))

#  Cluster CellType Condition  Genotype    Lane   
#1       1 A (75%)  Mut (100%) XXXX (50%)  1 (75%)
#2       2 B (67%)  Mut (100%) YYYY (100%) 1 (33%)

样本数据：

df <- structure(list(Sample = c("Sample1", "Sample2", "Sample3", "Sample4", 
"Sample5", "Sample6", "Sample7"), Cluster = c(1L, 2L, 2L, 1L, 
2L, 1L, 1L), CellType = c("A", "B", "A", "A", "B", "B", "A"), 
    Condition = c("Mut", "Mut", "Mut", "Mut", "Mut", "Mut", "Mut"
    ), Genotype = c("XXXX", "YYYY", "YYYY", "ZZZZ", "YYYY", "YYYY", 
    "XXXX"), Lane = c(1L, 1L, 2L, 1L, 3L, 1L, 2L)), .Names = c("Sample", 
"Cluster", "CellType", "Condition", "Genotype", "Lane"), class = "data.frame", row.names = c(NA, 
-7L))

【讨论】：

@Chen 只是想知道您是否在使用这种方法时遇到了问题？