在`dplyr`中计算频率列表[重复]答案

【问题标题】：Compute frequency list in `dplyr` [duplicate]在`dplyr`中计算频率列表[重复]
【发布时间】：2021-07-24 09:37:00
【问题描述】：

对于dplyr 的 ppl 来说，这可能是一个简单的问题 - 我想计算数据帧中字符数据的频率列表：

玩具数据：

df <- data.frame(
  id = sample(1:5, 100, replace = TRUE),
  v1 = sample(c(NA, rnorm(10)), 100, replace = TRUE),
  v2 = sample(LETTERS, 100, replace = TRUE)
)

到目前为止我的尝试：

假设df 首先需要针对多个变量进行过滤。一旦完成，我就可以计算频率列表但输出不显示相应的字符值，所以我不知道哪个值具有哪个频率：

library(dplyr)
df %>%
  filter(!is.na(v1) & !id == lag(id)) %>%
  summarise(freq = sort(prop.table(table(v2)), decreasing = TRUE)*100)
       freq
1  7.692308
2  6.410256
3  5.128205
4  5.128205
5  5.128205
6  5.128205
7  5.128205
8  5.128205
9  5.128205
10 5.128205
output clipped ...

所以我需要得到第二列，显示频率所属的值A、B、C 等。如何实现？

编辑：

哎呀，我想我明白了：

df %>%
  filter(!is.na(v1) & !id == lag(id)) %>%
  summarise(freq = sort(prop.table(table(v2)), decreasing = TRUE)*100,
            value = names(sort(prop.table(table(v2)), decreasing = TRUE)))

【问题讨论】：

标签： r dplyr

【解决方案1】：

更多dplyr - 方式是：

library(dplyr)

df %>%
  filter(!is.na(v1) & id != lag(id)) %>%
  count(v2, name = 'freq', sort = TRUE) %>%
  mutate(freq = prop.table(freq) * 100)

#   v2     freq
#1   M 9.090909
#2   Q 7.792208
#3   K 6.493506
#4   R 6.493506
#5   T 6.493506
#6   B 5.194805
#7   C 5.194805
#8   F 5.194805
#9   I 5.194805
#10  U 5.194805
#11  G 3.896104
#12  J 3.896104
#13  S 3.896104
#14  V 3.896104
#15  W 3.896104
#16  A 2.597403
#17  N 2.597403
#18  X 2.597403
#19  D 1.298701
#20  E 1.298701
#21  H 1.298701
#22  L 1.298701
#23  O 1.298701
#24  P 1.298701
#25  Y 1.298701
#26  Z 1.298701

【讨论】：

【解决方案2】：

df %>%
  filter(!is.na(v1) & !id == lag(id)) %>% 
  mutate(n_total = n()) %>% 
  group_by(v2) %>% 
  summarise(freq = n(), n_total = max(n_total)) %>% 
  mutate(freq = 100*freq/n_total) %>% 
  select(-n_total) %>% 
  arrange(-freq)

【讨论】：

【解决方案3】：

更多dplyr，还有一点janitor：

library(janitor)

df %>%
  filter(!is.na(v1) & !id == lag(id)) %>%
  tabyl(v2) %>%
  rename(freq = percent) %>%
  mutate(freq = freq * 100) %>%
  select(-n) %>%
  arrange(desc(freq))


 v2     freq
  M 8.641975
  W 7.407407
  A 6.172840
  K 6.172840
  N 6.172840
  U 6.172840
  G 4.938272
  S 4.938272
  T 4.938272
  Y 4.938272
  D 3.703704
  F 3.703704
  H 3.703704
  J 3.703704
  P 3.703704
  V 3.703704
  C 2.469136
  L 2.469136
  O 2.469136
  Q 2.469136
  X 2.469136
  E 1.234568
  I 1.234568
  R 1.234568
  Z 1.234568

【讨论】：

你可能也用过janitor::adorn_pct_formatting(digits = 2)来节省1-2步
@AnilGoyal 是的，谢谢，听起来很棒。