如何在数据框中生成 ngram，以便每个 ngram 创建一个新行？答案

【问题标题】：How do I generate ngrams in a dataframe so that each ngram creates a new row?如何在数据框中生成 ngram，以便每个 ngram 创建一个新行？
【发布时间】：2020-12-09 21:20:38
【问题描述】：

我正在尝试使用 ngram_asweka 逐行识别字符向量中的 ngram，同时维护项目编号、参与者/控制等数据。我尝试过 tapply 和 sapply 却没有成功。我的数据框有更多列，但基本格式如下所示：

Item	Phrase
1.	Cats and dogs
2.	birds and bees

我需要它来输出

Item	Phrase	Ngram
1.	Cats and dogs	cats and dogs
1.	Cats and dogs	cats and
1.	Cats and dogs	and dogs
2.	birds and bees	birds and bees
2.	birds and bees	birds and

这是我的 ngram 函数

myngram <-function(x) {
  x<- ngram_asweka(x, min = 2, max = 5, sep = " ") %>% data.frame()
  return(x)

这是我尝试过但不起作用的代码。

x<-tapply(df$phrase, df$ID, myngram) %>% data.frame()

错误代码显示“ngram_asweka (x, min = 2, max = 5, sep = " ") 中的错误：尝试在 SET_STRING_ELT 中设置索引 2/2

感谢您的帮助。

【问题讨论】：

标签： r multiple-columns sapply

【解决方案1】：

在您的测试示例中，您可能希望 max = 3 用于 ngram_asweka，因为字符串的长度只有 3 个单词（3 克）。

这是使用tidyverse 的一个选项。您可以使用 group_by 获取每个 item 的结果，并使用 group_modify 创建结果行，包括短语和 n-gram。

library(tidyverse)
library(ngram)

df %>%
  group_by(item) %>%
  group_modify(function(x, y) 
     tibble(phrase = x$phrase,
            ngram = ngram_asweka(x$phrase, min = 2, max = 3, sep = " ")))

如果您想将其他列包含在具有更大数据集的输出中，您可以执行以下替代方法：

df %>%
  group_by(item) %>%
  group_modify(~ bind_cols(select(.x, everything()),
                           ngram = ngram_asweka(.x$phrase, min = 2, max = 3, sep = " ")))

输出

   item phrase         ngram         
  <dbl> <chr>          <chr>         
1     1 Cats and dogs  Cats and dogs 
2     1 Cats and dogs  Cats and      
3     1 Cats and dogs  and dogs      
4     2 birds and bees birds and bees
5     2 birds and bees birds and     
6     2 birds and bees and bees

数据

df <- structure(list(item = c(1, 2), phrase = c("Cats and dogs", "birds and bees"
)), class = "data.frame", row.names = c(NA, -2L))

【讨论】：