计算可变长度话语中最终单词的频率列表答案

【问题标题】：Compute frequency list of final words in utterances of variable length计算可变长度话语中最终单词的频率列表
【发布时间】：2021-05-06 11:25:34
【问题描述】：

我有一个大型数据框，其中包含变量 sizes 的话语：

df <- structure(list(size = c(2, 2, 3, 3, 4, 4, 3, 3), 
                     w1 = c("come", "why", "er", "well", "she", "well", "er", "well"), 
                     w2 = c("on","that", "i", "not", "'s", "thanks", "super", "she"), 
                     w3 = c(NA, NA, "can", "today", "going", "they", "cool", "can"), 
                     w4 = c(NA,NA, NA, NA, "on", "can", NA, NA)), 
                row.names = c(NA, -8L), class = "data.frame")

我想将w1 中的话语初始词 与其他w 列中的所有话语结束词 与 进行比较带有计数和比例的频率列表。我可以计算出话语起始词的频率列表：

library(dplyr)
df %>%
  group_by(w1) %>%
  summarise(n = n()) %>%
  mutate(prop = n / sum(n)) %>%
  arrange(desc(prop))
# A tibble: 5 x 3
  w1        n  prop
  <chr> <int> <dbl>
1 well      3 0.375
2 er        2 0.25 
3 come      1 0.125
4 she       1 0.125
5 why       1 0.125

但是当它们在不同的w 列中时，如何计算最终话语的列表？

预期：

# A tibble: 5 x 3
  w_last    n  prop
  <chr> <int> <dbl>
1 can       3 0.375
2 on        2 0.25 
3 cool      1 0.125
4 that      1 0.125
5 today     1 0.125

终于有另一个解决方案了：

df %>%
  mutate(w_last = c(apply(., 1, function(x) tail(na.omit(x), 1)))) %>%
  group_by(w_last) %>%
  summarise(n = n()) %>%
  mutate(prop = n / sum(n)) %>%
  arrange(desc(prop))

【问题讨论】：

我只找到一次can。还有lot 和does 作为最后一句话。
是的，你是对的。我已经更新了示例！

标签： r dplyr

【解决方案1】：

tidyverse 语法风格的三个方法

1您可以在不同的列中提取 final_word 并在其上创建prop.table。（仅限dplyr）

df %>% rowwise() %>%
  mutate(final_word = get(paste0('w', size))) %>%
  janitor::tabyl(final_word)

final_word n percent
        can 3   0.375
       cool 1   0.125
         on 2   0.250
       that 1   0.125
      today 1   0.125

2稍微重构一下数据。

pivoted 格式。
只保留size 与word_number 匹配的那些行
使用janitor::tabyl() 生成您的prop.table（可以在管理员中以有用的方式进一步格式化）

df <- structure(list(size = c(2, 2, 3, 3, 4, 4, 3, 3), 
                     w1 = c("come", "why", "er", "well", "she", "well", "er", "well"), 
                     w2 = c("on","that", "i", "not", "'s", "thanks", "super", "she"), 
                     w3 = c(NA, NA, "can", "today", "going", "they", "cool", "can"), 
                     w4 = c(NA,NA, NA, NA, "on", "can", NA, NA)), 
                row.names = c(NA, -8L), class = "data.frame")


df
#>   size   w1     w2    w3   w4
#> 1    2 come     on  <NA> <NA>
#> 2    2  why   that  <NA> <NA>
#> 3    3   er      i   can <NA>
#> 4    3 well    not today <NA>
#> 5    4  she     's going   on
#> 6    4 well thanks  they  can
#> 7    3   er  super  cool <NA>
#> 8    3 well    she   can <NA>
library(tidyverse)
library(janitor)

df %>% pivot_longer(!size, values_drop_na = T) %>%
  filter(as.numeric(substr(name, 2, nchar(name))) == size) %>%
  janitor::tabyl(value)
#>  value n percent
#>    can 3   0.375
#>   cool 1   0.125
#>     on 2   0.250
#>   that 1   0.125
#>  today 1   0.125

^{由reprex package (v2.0.0) 于 2021-05-06 创建}

3顺便说一下，你可以专门倒序，从最后一列开始计数words，在tidyr中使用unite和separate

df %>% unite('W', starts_with('w'), sep = '=', na.rm = T, remove = T) %>%
  separate(W, into = paste0('w', seq_len(1 + max(str_count(.$W, '=')))), fill = 'left', sep = '=')

  size   w1     w2    w3    w4
1    2 <NA>   <NA>  come    on
2    2 <NA>   <NA>   why  that
3    3 <NA>     er     i   can
4    3 <NA>   well   not today
5    4  she     's going    on
6    4 well thanks  they   can
7    3 <NA>     er super  cool
8    3 <NA>   well   she   can

【讨论】：

【解决方案2】：

您可以使用行 (seq_len(nrow(df)) 和 df$size 中的值对 df 进行子集化，生成 table 并计算 proportions。

tt <- table(df[-1][cbind(seq_len(nrow(df)), df$size)])
cbind(tt, proportions(tt))
#      tt      
#can    3 0.375
#cool   1 0.125
#on     2 0.250
#that   1 0.125
#today  1 0.125

【讨论】：

【解决方案3】：

基本 R 选项

out <- rev(
  stack(
    prop.table(
      table(apply(df, 1, function(x) tail(na.omit(x), 1)))
    )
  )
)

给予

    ind values
1   can  0.375
2  cool  0.125
3    on  0.250
4  that  0.125
5 today  0.125

如果你想以降序方式对行进行排序，你可以这样做

> out[order(-out$value), ]
    ind values
1   can  0.375
3    on  0.250
2  cool  0.125
4  that  0.125
5 today  0.125

【讨论】：