根据R中另一列的值计算一列的值答案

【问题标题】：Count the values of a column based on the values of another column in R根据R中另一列的值计算一列的值
【发布时间】：2022-01-23 06:30:29
【问题描述】：

使用我想创建一个新的数据框，其中将包含 Zip、Name 和一个名为 Count 的列，其中将包括每个 Zip 的名称计数。

Zip<-c("123245","12345","123245","123456","123456","12345")
Name<-c("Bob","Bob","Bob","Jack","Jack","Mary"),
df<-data.frame(Zip,Name,Name2)

library(dplyr)
df %>%
  group_by(Zip) %>%
  mutate(Name = cumsum(Name))

预计

Zip Name Count
1 123245  Bob     2
2  12345  Bob     1
3  12345 Mary     1
4 123456 Jack     2

【问题讨论】：

你能显示预期的输出吗？不清楚您需要汇总计数、运行计数还是唯一计数。
我添加了exp输出
为什么Bob 不是连续的 zip 123245 中的计数为 2，然后另一行具有相同的 zip，计数为 1？仍然很难理解你想要什么。
我编辑错字了

标签： r

【解决方案1】：

我们可以使用count 的name 参数。

count 基本上总结了group_by 和summarise：

library(dplyr)
df %>% 
  count(Zip, Name, name= "Count")

     Zip Name Count
1 123245  Bob     2
2  12345  Bob     1
3  12345 Mary     1
4 123456 Jack     2

【讨论】：

【解决方案2】：

这能解决您的问题吗？

Zip<-c("123245","12345","123245","123456","123456","12345")
Name<-c("Bob","Bob","Bob","Jack","Jack","Mary")
df<-data.frame(Zip,Name)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df %>%
  group_by(Zip, Name) %>%
  summarise(Count = n())
#> `summarise()` has grouped output by 'Zip'. You can override using the `.groups` argument.
#> # A tibble: 4 × 3
#> # Groups:   Zip [3]
#>   Zip    Name  Count
#>   <chr>  <chr> <int>
#> 1 123245 Bob       2
#> 2 12345  Bob       1
#> 3 12345  Mary      1
#> 4 123456 Jack      2

^{由reprex package (v2.0.1) 于 2021 年 12 月 22 日创建}

快速速度基准：

library(tidyverse)
library(microbenchmark)

Zip<-c("123245","12345","123245","123456","123456","12345")
Name<-c("Bob","Bob","Bob","Jack","Jack","Mary")
df<-data.frame(Zip,Name)

JM <- function(df){
  df %>%
  group_by(Zip, Name) %>%
  summarise(Count = n())
}
JM(df)
#> `summarise()` has grouped output by 'Zip'. You can override using the `.groups` argument.
#> # A tibble: 4 × 3
#> # Groups:   Zip [3]
#>   Zip    Name  Count
#>   <chr>  <chr> <int>
#> 1 123245 Bob       2
#> 2 12345  Bob       1
#> 3 12345  Mary      1
#> 4 123456 Jack      2

TarJae <- function(df){
  df %>% 
    count(Zip, Name, name= "Count")
}

TIC <- function(df){
  aggregate(cbind(Count = Zip) ~ Zip + Name, df, length)
}
TIC(df)
#>      Zip Name Count
#> 1 123245  Bob     2
#> 2  12345  Bob     1
#> 3 123456 Jack     2
#> 4  12345 Mary     1

res <- microbenchmark(JM(df), TIC(df), TarJae(df))
autoplot(res)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

^{由reprex package (v2.0.1) 于 2021 年 12 月 22 日创建}

【讨论】：

不是预期的输出？
看来有问题的预期输出有错字；如果有错字，并且应该删除最下面的一行，有很多方法可以解决问题，例如你的df %>% count(Zip, Name, name= "Count")（清晰简单的@TarJae，+1）
我编辑错字了

【解决方案3】：

使用aggregte 的基本 R 选项

> aggregate(cbind(Count = Zip) ~ Zip + Name, df, length)
     Zip Name Count
1 123245  Bob     2
2  12345  Bob     1
3 123456 Jack     2
4  12345 Mary     1

【讨论】：

不错的解决方案！它会比 tidyverse 方法快很多
@jared_mamrot 谢谢。我没有测试速度，但希望它和你预测的一样:)
做了一个快速的速度测试 - 使用示例数据集明显更快:)
@jared_mamrot 有趣的基准测试！感谢您的努力。