【问题标题】:Compute frequency list of final words in utterances of variable length计算可变长度话语中最终单词的频率列表
【发布时间】:2021-05-06 11:25:34
【问题描述】:

我有一个大型数据框,其中包含变量 sizes 的话语:

df <- structure(list(size = c(2, 2, 3, 3, 4, 4, 3, 3), 
                     w1 = c("come", "why", "er", "well", "she", "well", "er", "well"), 
                     w2 = c("on","that", "i", "not", "'s", "thanks", "super", "she"), 
                     w3 = c(NA, NA, "can", "today", "going", "they", "cool", "can"), 
                     w4 = c(NA,NA, NA, NA, "on", "can", NA, NA)), 
                row.names = c(NA, -8L), class = "data.frame")

我想将w1 中的话语初始词 与其他w 列中的所有话语结束词 进行比较带有计数和比例的频率列表。我可以计算出话语起始词的频率列表:

library(dplyr)
df %>%
  group_by(w1) %>%
  summarise(n = n()) %>%
  mutate(prop = n / sum(n)) %>%
  arrange(desc(prop))
# A tibble: 5 x 3
  w1        n  prop
  <chr> <int> <dbl>
1 well      3 0.375
2 er        2 0.25 
3 come      1 0.125
4 she       1 0.125
5 why       1 0.125

但是当它们在不同的w 列中时,如何计算最终话语的列表?

预期

# A tibble: 5 x 3
  w_last    n  prop
  <chr> <int> <dbl>
1 can       3 0.375
2 on        2 0.25 
3 cool      1 0.125
4 that      1 0.125
5 today     1 0.125

终于有另一个解决方案了:

df %>%
  mutate(w_last = c(apply(., 1, function(x) tail(na.omit(x), 1)))) %>%
  group_by(w_last) %>%
  summarise(n = n()) %>%
  mutate(prop = n / sum(n)) %>%
  arrange(desc(prop))

【问题讨论】:

  • 我只找到一次can。还有lotdoes 作为最后一句话。
  • 是的,你是对的。我已经更新了示例!

标签: r dplyr


【解决方案1】:

tidyverse 语法风格的三个方法

1您可以在不同的列中提取 final_word 并在其上创建prop.table。 (仅限dplyr

df %>% rowwise() %>%
  mutate(final_word = get(paste0('w', size))) %>%
  janitor::tabyl(final_word)

final_word n percent
        can 3   0.375
       cool 1   0.125
         on 2   0.250
       that 1   0.125
      today 1   0.125

2稍微重构一下数据。

  • pivoted 格式。
  • 只保留sizeword_number 匹配的那些行
  • 使用janitor::tabyl() 生成您的prop.table(可以在管理员中以有用的方式进一步格式化)
df <- structure(list(size = c(2, 2, 3, 3, 4, 4, 3, 3), 
                     w1 = c("come", "why", "er", "well", "she", "well", "er", "well"), 
                     w2 = c("on","that", "i", "not", "'s", "thanks", "super", "she"), 
                     w3 = c(NA, NA, "can", "today", "going", "they", "cool", "can"), 
                     w4 = c(NA,NA, NA, NA, "on", "can", NA, NA)), 
                row.names = c(NA, -8L), class = "data.frame")


df
#>   size   w1     w2    w3   w4
#> 1    2 come     on  <NA> <NA>
#> 2    2  why   that  <NA> <NA>
#> 3    3   er      i   can <NA>
#> 4    3 well    not today <NA>
#> 5    4  she     's going   on
#> 6    4 well thanks  they  can
#> 7    3   er  super  cool <NA>
#> 8    3 well    she   can <NA>
library(tidyverse)
library(janitor)

df %>% pivot_longer(!size, values_drop_na = T) %>%
  filter(as.numeric(substr(name, 2, nchar(name))) == size) %>%
  janitor::tabyl(value)
#>  value n percent
#>    can 3   0.375
#>   cool 1   0.125
#>     on 2   0.250
#>   that 1   0.125
#>  today 1   0.125

reprex package (v2.0.0) 于 2021-05-06 创建


3顺便说一下,你可以专门倒序,从最后一列开始计数words,在tidyr中使用uniteseparate

df %>% unite('W', starts_with('w'), sep = '=', na.rm = T, remove = T) %>%
  separate(W, into = paste0('w', seq_len(1 + max(str_count(.$W, '=')))), fill = 'left', sep = '=')

  size   w1     w2    w3    w4
1    2 <NA>   <NA>  come    on
2    2 <NA>   <NA>   why  that
3    3 <NA>     er     i   can
4    3 <NA>   well   not today
5    4  she     's going    on
6    4 well thanks  they   can
7    3 <NA>     er super  cool
8    3 <NA>   well   she   can

【讨论】:

    【解决方案2】:

    您可以使用行 (seq_len(nrow(df)) 和 df$size 中的值对 df 进行子集化,生成 table 并计算 proportions

    tt <- table(df[-1][cbind(seq_len(nrow(df)), df$size)])
    cbind(tt, proportions(tt))
    #      tt      
    #can    3 0.375
    #cool   1 0.125
    #on     2 0.250
    #that   1 0.125
    #today  1 0.125
    

    【讨论】:

      【解决方案3】:

      基本 R 选项

      out <- rev(
        stack(
          prop.table(
            table(apply(df, 1, function(x) tail(na.omit(x), 1)))
          )
        )
      )
      

      给予

          ind values
      1   can  0.375
      2  cool  0.125
      3    on  0.250
      4  that  0.125
      5 today  0.125
      

      如果你想以降序方式对行进行排序,你可以这样做

      > out[order(-out$value), ]
          ind values
      1   can  0.375
      3    on  0.250
      2  cool  0.125
      4  that  0.125
      5 today  0.125
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2013-12-28
        • 2022-12-19
        • 2013-01-16
        • 1970-01-01
        • 1970-01-01
        • 2015-01-07
        相关资源
        最近更新 更多