【问题标题】:Use dplyr to calculate percentage and frequency of occurrence of two groups使用 dplyr 计算两组出现的百分比和频率
【发布时间】:2020-10-29 04:40:40
【问题描述】:

我正在学习 dplyr,并已从类似的帖子中搜索过解决方案,但没有找到与这些问题组合有关的解决方案。

这是一个示例数据框:

set.seed(1)
    df <- data.frame(sampleID = c(rep("sample1",2),
                                 rep("sample2",3),
                                 rep("sample3",4)),
                     species = c("clover","nettle",
                                 "clover","nettle","vine",
                                 "clover","clover","nettle","vine"),
                     type = c("vegetation","seed",
                              "vegetation","vegetation","vegetation",
                              "seed","vegetation","seed","vegetation"),
                     mass = sample(1:9))

    > df
  sampleID species       type mass
1  sample1  clover vegetation    9
2  sample1  nettle       seed    4
3  sample2  clover vegetation    7
4  sample2  nettle vegetation    1
5  sample2    vine vegetation    2
6  sample3  clover       seed    6
7  sample3  clover vegetation    3
8  sample3  nettle       seed    8
9  sample3    vine vegetation    5

我需要返回一个数据框来计算每个独特物种/类型组合的质量百分比,并且我需要 sampleIDs 中物种/类型出现的百分比频率

所以这个例子中葡萄树/植物的种类/类型的解决方案是 质量百分比 = (5+2)/(总和(质量)) 并且百分比频率为 2/3,因为样本 1 中没有出现这种组合。

首先我尝试了不同的组合,例如:

df %>%
  group_by(species,type) %>%
  summarize(totmass = sum(mass))  %>%
  mutate(percmass = totmass/sum(totmass))

但这给了 100% 的藤蔓/植物质量?此外,我不知道从那里去哪里获取基于 sampleID 的百分比频率。

【问题讨论】:

  • 我不清楚你的分母是什么。当您说“每种独特物种/类型组合的质量百分比”并给出5 + 2 示例时,分子很清楚。分母是质量总和......在样本中吗?在同一类型内?在同一物种内?在整个数据框中?
  • “百分比频率”同上。您说 “并且百分比频率将是 2/3,因为该组合没有出现在 sample1 中。” - 百分比频率也是 具有特定物种的样本数量:类型组合/总数样本数?对吗?

标签: r dplyr tidyverse percentage


【解决方案1】:

不确定我是否正确,但也许这就是你要找的:

set.seed(1)
df <- data.frame(sampleID = c(rep("sample1",2),
                              rep("sample2",3),
                              rep("sample3",4)),
                 species = c("clover","nettle",
                             "clover","nettle","vine",
                             "clover","clover","nettle","vine"),
                 type = c("vegetation","seed",
                          "vegetation","vegetation","vegetation",
                          "seed","vegetation","seed","vegetation"),
                 mass = sample(1:9))

library(dplyr)

df %>%
  # Add total mass
  add_count(wt = mass, name = "sum_mass") %>%
  # Add total number of samples
  add_count(nsamples = n_distinct(sampleID)) %>%
  # Add sum_mass and nsamples to group_by
  group_by(species, type, sum_mass, nsamples) %>%
  summarize(nsample = n_distinct(sampleID), 
            totmass = sum(mass), .groups = "drop")  %>%
  mutate(percmass = totmass / sum_mass,
         percfreq = nsample / nsamples)
#> # A tibble: 5 x 8
#>   species type       sum_mass nsamples nsample totmass percmass percfreq
#>   <chr>   <chr>         <int>    <int>   <int>   <int>    <dbl>    <dbl>
#> 1 clover  seed             45        3       1       6   0.133     0.333
#> 2 clover  vegetation       45        3       3      19   0.422     1    
#> 3 nettle  seed             45        3       2      12   0.267     0.667
#> 4 nettle  vegetation       45        3       1       1   0.0222    0.333
#> 5 vine    vegetation       45        3       2       7   0.156     0.667

【讨论】:

  • 感谢工作。您如何将标准错误包含到 tibble 中?
猜你喜欢
  • 2014-02-23
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2015-04-13
  • 2021-08-17
  • 1970-01-01
  • 1970-01-01
  • 2018-06-20
相关资源
最近更新 更多