【问题标题】:Count number of each factor grouping by another factor按另一个因素分组的每个因素的计数
【发布时间】:2019-09-06 22:01:27
【问题描述】:

我知道这个问题的答案很简单,但我已经广泛搜索了论坛,但我一直无法找到解决方案。

我有一个名为 Data_source 的列,这是我想要对变量进行分组的一个因素。

我有一系列symptom* 变量,我希望根据Data_source 进行计数。

由于某种原因,我无法弄清楚如何做到这一点。正常的group_by 函数似乎无法正常工作。

这是有问题的数据框

 df <- wrapr::build_frame(
   "Data_source"  , "Sex"   , "symptoms_decLOC", "symptoms_nausea_vomitting" |
     "1"          , "Female", NA_character_    , NA_character_               |
     "1"          , "Female", NA_character_    , NA_character_               |
     "1"          , "Female", "No"             , NA_character_               |
     "1"          , "Female", "Yes"            , "No"                        |
     "1"          , "Female", "Yes"            , "No"                        |
     "1"          , "Female", "Yes"            , "No"                        |
     "1"          , "Male"  , "Yes"            , "No"                        |
     "1"          , "Female", "Yes"            , "No"                        |
     "2"          , "Female", NA_character_    , NA_character_               |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Female", "Yes"            , "No"                        |
     "2"          , "Female", "Yes"            , "No"                        |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Female", NA_character_    , NA_character_               |
     "2"          , "Female", NA_character_    , NA_character_               |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Female", NA_character_    , NA_character_               )

请注意,性别和症状变量都是包括 NA 在内的所有因素。我尝试了以下方法

df %>% na.omit() %>% group_by(Data_source) %>% count("symptoms_decLOC")

这不起作用并且不是最佳的,因为我必须为每一列重复它。理想的情况是使用类似于lapply(df, count) 的东西,但这并没有为我提供每个组的描述。

编辑

针对以下问题,我添加了预期的输出。我在 excel 中对此进行了编辑,为清楚起见对group_by 进行了颜色编码。

请注意我是如何对每个可能的答案进行细分的。当我使用dplyr 运行它时,这是输出。

> df %>% na.omit() %>% group_by(Data_source) %>% count("symptoms_decLOC")
# A tibble: 2 x 3
# Groups:   Data_source [2]
  Data_source `"symptoms_decLOC"`     n
  <chr>       <chr>               <int>
1 1           symptoms_decLOC         5
2 2           symptoms_decLOC         2

【问题讨论】:

  • 你想要的输出是什么?
  • 感谢您的评论。我应该把它放在原来的问题中。我进行了编辑以进一步阐明我在寻找什么

标签: r count dplyr factors


【解决方案1】:

这主要是:还没有弄清楚如何包含零计数组......据说添加.drop=FALSE takes care of this,但它对我不起作用(使用dplyr v. 0.8.0.9001)。

library(dplyr)
library(tidyr)
(df
    %>% tidyr::gather(var,val,-Data_source)
    %>% count(Data_source,var,val, .drop=FALSE)
    %>% na.omit()
)

结果:

  Data_source var                       val        n
  <chr>       <chr>                     <chr>  <int>
1 1           Sex                       Female     7
2 1           Sex                       Male       1
3 1           symptoms_decLOC           No         1
4 1           symptoms_decLOC           Yes        5
5 1           symptoms_nausea_vomitting No         5
6 2           Sex                       Female     6
7 2           Sex                       Male       6
8 2           symptoms_decLOC           Yes        2
9 2           symptoms_nausea_vomitting No         2

【讨论】:

  • 使用这种语法,我得到的输出与你不同。我得到Error in count(., Data_source, var, val, .drop = FALSE) : unused arguments (val, .drop = FALSE)
  • find("count")(和 packageVersion("dplyr")`)的结果是什么?
  • Here is a link to the output&gt; find("count") [1] "package:plyr" "package:dplyr" &gt; packageVersion("dplyr") [1] ‘0.8.0.1’
  • 好吧,既然两个包都在加载,如果我将你的代码更改为dplyr::count,我会得到预期的输出!谢谢
  • 当您在dplyr 之后加载plyr 时,您可能会收到一条警告,表明某些功能被屏蔽了...
【解决方案2】:

使用@Ben Bolker 的答案获取每个组的计数,使用spreadgather 包括零计数组。

dplyr

library(dplyr)
library(tidyr)

# Count number of occurences by Data_source 
df2 <- 
  df %>% 
  gather(variable, value, -Data_source) %>% 
  count(Data_source, variable, value, name = "counter") %>%
  na.omit() 

# For variable = "Sex", leave as is
# For everything else, in this case symptom* convert into factor to include zero count group
# Then spread with dataframe will NAs filled with 0, re-convert back to long to bind rows
bind_rows(df2 %>%
            filter(variable == "Sex"), 

          df2 %>%
            filter(variable != "Sex") %>%
            mutate(value = factor(value, levels = c("Yes", "No"))) %>%
            spread(key = value, value = counter, fill = 0) %>%
            gather(value, counter, -Data_source, -variable))  %>%

  arrange(Data_source, variable)

data.table

library(data.table)
dt <- data.table(df)

# Melt data by Data source
dt_melt <- melt(dt, id.vars = "Data_source", value.factor = FALSE, variable.factor = FALSE)

# Add counter, if NA then 0 else 1
dt_melt[, counter := 0]
dt_melt[!is.na(value), counter := 1]

# Sum number of occurrences
dt_count <- dt_melt[,list(counter = sum(counter)), by = c("Data_source", "variable", "value")]

# Split into two dt
dt2a <- dt_count[variable == "Sex", ]
dt2b <- dt_count[variable != "Sex" ,]

# only on symptoms variables
# Convert into factor variable
dt2b$value <- factor(dt2b$value, levels = c("Yes", "No"))
dt2b_dcast <- dcast(data = dt2b, formula = Data_source + variable ~ value, value.var = "counter", fill = 0, drop = FALSE)
dt2b_melt <- melt(dt2b_dcast, id.vars = c("Data_source", "variable"), variable.name = "value", value.name = "counter") 

# combine
combined_d <- rbind(dt2a, dt2b_melt)
combined_d[order(Data_source, variable), ]

【讨论】:

  • 是的,我仍然无法使 dplyr 方法正常工作。 Data.table 确实有效,但它添加了大量代码。
【解决方案3】:

我不太明白你在问什么,但我假设你想计算每个 symptom_* 列中非 NA 值的数量。

这是data.table 解决方案:

# load library

library(data.table)

# Suppose the table is called "dt". Convert it to a data.table:

setDT(dt)

# convert the wide table to a long one, filter the values that
# aren't NA and count both, by Data_source and by variable
# (variable is the created column with the symptom_* names)

melt(dt, id.vars = 1:2)[!is.na(value), 
                        .N, 
                         by = .(Data_source, variable)]

代码各部分的作用:

melt(dt, id.vars = 1:2)dt 从宽转换为长,并将第 1 列和第 2 列(Data_source 和 sex)保持不变。

!is.na(value) 过滤不是NA 的值(之前位于每个symptom_* 标头下)。

.N 计算行数。

by = .(Data_source, variable) 是我们用来计数的分组。 variablesymptom_* 在重塑期间所在的列的名称。

【讨论】:

  • 当我运行它时,我得到以下 Data_source 变量 N 1:1 症状_decLOC 6 2:2 症状_decLOC 2 3:1 症状_恶心呕吐 5 4:2 症状_恶心呕吐 2 这没有给我每个个体响应的统计数据。
【解决方案4】:

当然,难的是保留数据中不存在的组合......这是一个分两步的解决方案:

1.准备一个不计其数的数据库

你可以做任何你想做的事,但我选择计算两个块,因为变量Sex 的模式不同。无需在此处绑定这些块。

chunk1 <- expand.grid(
  Data_source = c("1", "2"),
  name = c("symptoms_decLOC", "symptoms_nausea_vomitting"),
  value = c("Yes", "No"),
  stringsAsFactors = FALSE
)

chunk2 <- expand.grid(
  Data_source = c("1", "2"),
  name = "Sex",
  value = c("Female", "Male"),
  stringsAsFactors = FALSE
)

2。完成要求的工作

library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(cols = c("Sex", "symptoms_decLOC", "symptoms_nausea_vomitting"))%>%
  group_by(Data_source, name, value) %>%
  summarise(count = n()) %>%
  right_join(bind_rows(chunk1, chunk2), by = c("Data_source", "name", "value")) %>%
  arrange(Data_source, name) %>%
  mutate(count = zoo::na.fill(count, 0))

等等

# A tibble: 12 x 4
# Groups:   Data_source, name [6]
   Data_source name                      value  count
   <chr>       <chr>                     <chr>  <int>
 1 1           Sex                       Female     7
 2 1           Sex                       Male       1
 3 1           symptoms_decLOC           Yes        5
 4 1           symptoms_decLOC           No         1
 5 1           symptoms_nausea_vomitting Yes        0
 6 1           symptoms_nausea_vomitting No         5
 7 2           Sex                       Female     6
 8 2           Sex                       Male       6
 9 2           symptoms_decLOC           Yes        2
10 2           symptoms_decLOC           No         0
11 2           symptoms_nausea_vomitting Yes        0
12 2           symptoms_nausea_vomitting No         2

它不是那么短,但它使用简单的功能。该过程类似于在 Excel 中可以执行的操作,即准备结构,然后完成计数。

我希望它可以帮助;-)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2020-09-29
    • 2012-11-15
    • 1970-01-01
    • 2018-05-19
    • 2021-02-21
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多