【发布时间】:2021-05-06 11:25:34
【问题描述】:
我有一个大型数据框,其中包含变量 sizes 的话语:
df <- structure(list(size = c(2, 2, 3, 3, 4, 4, 3, 3),
w1 = c("come", "why", "er", "well", "she", "well", "er", "well"),
w2 = c("on","that", "i", "not", "'s", "thanks", "super", "she"),
w3 = c(NA, NA, "can", "today", "going", "they", "cool", "can"),
w4 = c(NA,NA, NA, NA, "on", "can", NA, NA)),
row.names = c(NA, -8L), class = "data.frame")
我想将w1 中的话语初始词 与其他w 列中的所有话语结束词 与 进行比较带有计数和比例的频率列表。我可以计算出话语起始词的频率列表:
library(dplyr)
df %>%
group_by(w1) %>%
summarise(n = n()) %>%
mutate(prop = n / sum(n)) %>%
arrange(desc(prop))
# A tibble: 5 x 3
w1 n prop
<chr> <int> <dbl>
1 well 3 0.375
2 er 2 0.25
3 come 1 0.125
4 she 1 0.125
5 why 1 0.125
但是当它们在不同的w 列中时,如何计算最终话语的列表?
预期:
# A tibble: 5 x 3
w_last n prop
<chr> <int> <dbl>
1 can 3 0.375
2 on 2 0.25
3 cool 1 0.125
4 that 1 0.125
5 today 1 0.125
终于有另一个解决方案了:
df %>%
mutate(w_last = c(apply(., 1, function(x) tail(na.omit(x), 1)))) %>%
group_by(w_last) %>%
summarise(n = n()) %>%
mutate(prop = n / sum(n)) %>%
arrange(desc(prop))
【问题讨论】:
-
我只找到一次
can。还有lot和does作为最后一句话。 -
是的,你是对的。我已经更新了示例!