确定每个集群的每列中值的百分比答案

【问题标题】：Determining the percentage of values in each column for each cluster确定每个集群的每列中值的百分比
【发布时间】：2020-10-28 16:42:28
【问题描述】：

我需要确定每个具有条件的集群的每列中值的百分比。可重现的示例如下。我有一张这样的桌子：

> tab
            GI     RT     TR    VR Cluster_number
1   1000086986 0.5814 0.5814 0.628              1
10  1000728257 0.5814 0.5814 0.628              1
13  1000074769 0.7879 0.7879 0.443              2
14  1000498642 0.7879 0.7879 0.443              2
22  1000074765 0.7941 0.3600 0.533              3
26  1000597385 0.7941 0.3600 0.533              3
31  1000502373 0.5000 0.5000 0.607              4
32  1000532631 0.6875 0.7059 0.607              4
33  1000597694 0.5000 0.5000 0.607              4
34  1000598724 0.5000 0.5000 0.607              4

我需要这样的表格：

> tab1
   Cluster_number RT_cond TR_cond VR_cond
1               1 0        0        100
2               2 100      100      0  
3               3 100      0        0
4               4 25       25       100

其中对应列中的值表示对应簇中 GI 的百分比，其中 RT >= 0.6、TR >= 0.6 和 VR >= 0.6。即在第一个簇中，所有的RT = 0.6，所以最终表中对应的值是 25。我该怎么做呢？

【问题讨论】：

标签： r dataframe dplyr tibble summarize

【解决方案1】：

您可以group_byCluster_number 并使用across 来计算百分比：

library(dplyr)
df %>%
  group_by(Cluster_number) %>%
  summarise(across(RT:VR, ~mean(. >= 0.6) * 100, .names = '{col}_cond'))
  #In older version of dplyr use summarise_at
  #summarise_at(vars(RT:VR), ~mean(. >= 0.6) * 100)


#  Cluster_number RT_cond TR_cond VR_cond
#           <int>   <dbl>   <dbl>   <dbl>
#1              1       0       0     100
#2              2     100     100       0
#3              3     100       0       0
#4              4      25      25     100

在base R中，我们可以使用aggregate：

aggregate(cbind(RT, TR, VR)~Cluster_number, df, function(x) mean(x >= 0.6) * 100)

数据

df <- structure(list(GI = c(1000086986L, 1000728257L, 1000074769L, 
1000498642L, 1000074765L, 1000597385L, 1000502373L, 1000532631L, 
1000597694L, 1000598724L), RT = c(0.5814, 0.5814, 0.7879, 0.7879, 
0.7941, 0.7941, 0.5, 0.6875, 0.5, 0.5), TR = c(0.5814, 0.5814, 
0.7879, 0.7879, 0.36, 0.36, 0.5, 0.7059, 0.5, 0.5), VR = c(0.628, 
0.628, 0.443, 0.443, 0.533, 0.533, 0.607, 0.607, 0.607, 0.607
), Cluster_number = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L)), 
class = "data.frame", row.names = c("1", "10", "13", "14", "22", 
 "26", "31", "32", "33", "34"))

【讨论】：

该死的，我需要在这 20 秒内取消我的回答。你总是像闪电一样快：D

【解决方案2】：

使用dplyr 包，您可以使用group_by 语句后跟summarise，然后使用新的rename_with 函数重命名感兴趣的列

library(dplyr)

tab %>% 
  group_by(Cluster_number) %>% 
  summarise(across(c(RT, TR, VR), ~mean(. >= 0.6)*100)) %>% 
  rename_with(~paste0(., "_cond"), c(RT, TR, VR))

# A tibble: 4 x 4
#   Cluster_number RT_cond TR_cond VR_cond
#            <int>   <dbl>   <dbl>   <dbl>
# 1              1       0       0     100
# 2              2     100     100       0
# 3              3     100       0       0
# 4              4      25      25     100

【讨论】：