【问题标题】:Count unique occurrences of factor levels and numeric values with dplyr, on data in a long format使用 dplyr 在长格式数据上计算因子水平和数值的唯一出现次数
【发布时间】:2025-12-24 02:10:07
【问题描述】:

我有 8 名患者的重复测量数据,每个患者对相同变量的重复测量次数各不相同。测量变量是性别、血压 (sys_bp) 以及一个人接受了多少次 CT 扫描:

library(dplyr)
library(magrittr)

questiondata <- structure(list(id = c(2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 
4, 7, 7, 8, 8, 8, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 20, 
20, 20), time = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 
1L, 2L, 3L, 4L, 5L, 1L, 6L, 1L, 2L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 
2L, 3L, 4L, 5L, 1L, 2L, 4L), .Label = c("T0", "T1M0", "T1M6", 
"T1M12", "T2M0", "FU1"), class = "factor"), sys_bp = c(116, 125.8, 
NA, NA, NA, 113.2, NA, NA, NA, NA, 146, NA, NA, NA, NA, NA, NA, 
125, NA, NA, 164.5, NA, NA, NA, NA, 150.5, NA, NA, NA, NA, 158, 
NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L), .Label = c("female", "male"), class = "factor"), 
    ct_amount = c(4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
    5L, 5L, 5L, 2L, 2L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
    5L, 5L, 5L, 3L, 3L, 3L)), row.names = c(NA, -32L), class = c("tbl_df", 
"tbl", "data.frame"))

questiondata

      id time  sys_bp sex    ct_amount
   <dbl> <fct>  <dbl> <fct>      <int>
 1     2 T0      116  female         4
 2     2 T1M0    126. female         4
 3     2 T1M6     NA  female         4
 4     2 T1M12    NA  female         4
 5     3 T0       NA  female         5
 6     3 T1M0    113. female         5
 7     3 T1M6     NA  female         5
 8     3 T1M12    NA  female         5
 9     3 T2M0     NA  female         5
10     4 T0       NA  male           5
11     4 T1M0    146  male           5
12     4 T1M6     NA  male           5
13     4 T1M12    NA  male           5
14     4 T2M0     NA  male           5
15     7 T0       NA  female         2
16     7 FU1      NA  female         2
17     8 T0       NA  female         3
18     8 T1M0    125  female         3
19     8 T2M0     NA  female         3
20    13 T0       NA  female         5
21    13 T1M0    164. female         5
22    13 T1M6     NA  female         5
23    13 T1M12    NA  female         5
24    13 T2M0     NA  female         5
25    14 T0       NA  male           5
26    14 T1M0    150. male           5
27    14 T1M6     NA  male           5
28    14 T1M12    NA  male           5
29    14 T2M0     NA  male           5
30    20 T0       NA  female         3
31    20 T1M0    158  female         3
32    20 T1M12    NA  female         3

我正在尝试计算 (1) 是男性/女性 (2) 接受 1/2/3/4/5 次 CT 扫描的人数。

因此输出将是 (1) 6 名女性和 2 名男性,以及 (2) 1 人有 2 个 CT,2 人有 3 个 CT,1 人有 4 个 CT,4 人有 5 个 CT。

我尝试了很多 group_bysummarisecount 的组合,但似乎无法正确使用。有什么帮助吗?

【问题讨论】:

    标签: r dplyr count summarize


    【解决方案1】:

    您可以首先只保留每个 id 的唯一行。然后使用count 获取输出。

    library(dplyr)
    
    unique_data <- questiondata %>% distinct(id, .keep_all = TRUE)
    
    unique_data %>% count(sex)
    # A tibble: 2 x 2
    #  sex        n
    #  <fct>  <int>
    #1 female     6
    #2 male       2
    
    unique_data %>% count(ct_amount)
    
    # A tibble: 4 x 2
    #  ct_amount     n
    #      <int> <int>
    #1         2     1
    #2         3     2
    #3         4     1
    #4         5     4
    

    【讨论】:

    • 谢谢!这真的很有帮助。您是否还知道如何分别为每个time 计算这个?因此,例如对于 T0,将有 1 人有 2 个 CT,2 人有 3 个 CT,1 人有 4 个 CT,4 人有 5 个 CT。
    • 每个time 的值都不同id 所以也许你需要questiondata %&gt;% count(id, time)
    • 哦..我想你的意思是questiondata %&gt;% count(time, ct_amount)
    • @RonakShah:如果你有时间可以看看这个问题吗? *.com/questions/68386199/… 非常感谢!
    【解决方案2】:

    我们可以使用duplicatedfilter

    library(dplyr)
    questiondata %>%
         filter(!duplicated(id)) %>%
         count(ct_amount)
    # A tibble: 4 x 2
      ct_amount     n
          <int> <int>
    1         2     1
    2         3     2
    3         4     1
    4         5     4
    

    【讨论】: