【问题标题】:Determining the percentage of values in each column for each cluster确定每个集群的每列中值的百分比
【发布时间】:2020-10-28 16:42:28
【问题描述】:

我需要确定每个具有条件的集群的每列中值的百分比。可重现的示例如下。我有一张这样的桌子:

> tab
            GI     RT     TR    VR Cluster_number
1   1000086986 0.5814 0.5814 0.628              1
10  1000728257 0.5814 0.5814 0.628              1
13  1000074769 0.7879 0.7879 0.443              2
14  1000498642 0.7879 0.7879 0.443              2
22  1000074765 0.7941 0.3600 0.533              3
26  1000597385 0.7941 0.3600 0.533              3
31  1000502373 0.5000 0.5000 0.607              4
32  1000532631 0.6875 0.7059 0.607              4
33  1000597694 0.5000 0.5000 0.607              4
34  1000598724 0.5000 0.5000 0.607              4

我需要这样的表格:

> tab1
   Cluster_number RT_cond TR_cond VR_cond
1               1 0        0        100
2               2 100      100      0  
3               3 100      0        0
4               4 25       25       100  

其中对应列中的值表示对应簇中 GI 的百分比,其中 RT >= 0.6、TR >= 0.6 和 VR >= 0.6。即在第一个簇中,所有的RT = 0.6,所以最终表中对应的值是 25。我该怎么做呢?

【问题讨论】:

    标签: r dataframe dplyr tibble summarize


    【解决方案1】:

    您可以group_byCluster_number 并使用across 来计算百分比:

    library(dplyr)
    df %>%
      group_by(Cluster_number) %>%
      summarise(across(RT:VR, ~mean(. >= 0.6) * 100, .names = '{col}_cond'))
      #In older version of dplyr use summarise_at
      #summarise_at(vars(RT:VR), ~mean(. >= 0.6) * 100)
    
    
    #  Cluster_number RT_cond TR_cond VR_cond
    #           <int>   <dbl>   <dbl>   <dbl>
    #1              1       0       0     100
    #2              2     100     100       0
    #3              3     100       0       0
    #4              4      25      25     100
    

    在base R中,我们可以使用aggregate

    aggregate(cbind(RT, TR, VR)~Cluster_number, df, function(x) mean(x >= 0.6) * 100)
    

    数据

    df <- structure(list(GI = c(1000086986L, 1000728257L, 1000074769L, 
    1000498642L, 1000074765L, 1000597385L, 1000502373L, 1000532631L, 
    1000597694L, 1000598724L), RT = c(0.5814, 0.5814, 0.7879, 0.7879, 
    0.7941, 0.7941, 0.5, 0.6875, 0.5, 0.5), TR = c(0.5814, 0.5814, 
    0.7879, 0.7879, 0.36, 0.36, 0.5, 0.7059, 0.5, 0.5), VR = c(0.628, 
    0.628, 0.443, 0.443, 0.533, 0.533, 0.607, 0.607, 0.607, 0.607
    ), Cluster_number = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L)), 
    class = "data.frame", row.names = c("1", "10", "13", "14", "22", 
     "26", "31", "32", "33", "34"))
    

    【讨论】:

    • 该死的,我需要在这 20 秒内取消我的回答。你总是像闪电一样快:D
    【解决方案2】:

    使用dplyr 包,您可以使用group_by 语句后跟summarise,然后使用新的rename_with 函数重命名感兴趣的列

    library(dplyr)
    
    tab %>% 
      group_by(Cluster_number) %>% 
      summarise(across(c(RT, TR, VR), ~mean(. >= 0.6)*100)) %>% 
      rename_with(~paste0(., "_cond"), c(RT, TR, VR))
    
    # A tibble: 4 x 4
    #   Cluster_number RT_cond TR_cond VR_cond
    #            <int>   <dbl>   <dbl>   <dbl>
    # 1              1       0       0     100
    # 2              2     100     100       0
    # 3              3     100       0       0
    # 4              4      25      25     100
    

    【讨论】:

      猜你喜欢
      • 2022-01-24
      • 1970-01-01
      • 2021-10-08
      • 2013-12-27
      • 1970-01-01
      • 2018-12-06
      • 2020-03-11
      • 2011-04-29
      • 2019-10-26
      相关资源
      最近更新 更多