Tidytext - 将表达式设置为单个标记答案

【问题标题】：Tidytext - set expressions as a single tokenTidytext - 将表达式设置为单个标记
【发布时间】：2021-12-04 13:59:22
【问题描述】：

我正在尝试使用 tidytext 包中的 unnest_tokens 函数将我的文本数据分成令牌。问题是某些表达式出现多次，我想将它们保留为单个标记而不是多个标记。

正常结果：

df <- data.frame(
  Id = c(1, 2),
  Text = c('A first nice text', 'A second nice text')
)

df %>% 
  unnest_tokens(word, text)

  Id   Word
1  1      a
2  1  first
3  1   nice
4  1   text
5  2      a
6  2 second
7  2   nice
8  2   text

我想要什么（表达式 = “漂亮的文字”）：

df <- data.frame(
  Id = c(1, 2),
  Text = c('A first nice text', 'A second nice text')
)

df %>% 
  unnest_tokens(word, text)

  Id   Word
1  1      a
2  1  first
3  1   nice text
4  2      a
5  2 second
6  2   nice text

【问题讨论】：

如果答案之一解决了您的问题，请考虑接受。

标签： r tidytext

【解决方案1】：

这是一个基于负前瞻(?!...) 的简洁解决方案，如果\\s 左侧和text 右侧有nice，则不允许separate_rows 在空格\\s 上分隔nice（ \\b是单词边界锚点，如果你有，比如说，“漂亮的文本s”，你确实想要分开）

library(tidyr)
df %>%
  separate_rows(Text, sep = "(?!\\bnice\\b)\\s(?!\\btext\\b)")
# A tibble: 6 × 2
     Id Text     
  <dbl> <chr>    
1     1 A        
2     1 first    
3     1 nice text
4     2 A        
5     2 second   
6     2 nice text

更高级的正则表达式是(*SKIP)(*F)：

df %>%
  separate_rows(Text, sep = "(\\bnice text\\b)(*SKIP)(*F)|\\s")

欲了解更多信息：How do (*SKIP) or (*F) work on regex?

【讨论】：

这就是我的想法。带有正则表达式的解决方案将“好文本”从拆分中排除。

【解决方案2】：

有点冗长，可能有一个选项可以排除 unnest_tokens 中的某些短语，但它可以解决问题：

library(tidyverse)
library(tidytext)
df <- data.frame(Id = c(1, 2),,
                 Text = c('A first nice text', 'A second nice text')) %>%
  unnest_tokens('Word', Text)

df %>%
  group_by(Id) %>%
  summarize(Word = paste(if_else(lag(Word) == 'nice' & Word == 'text', 'nice text', Word))) %>%
  mutate(temp_id = row_number()) %>%
  filter(temp_id != temp_id[Word == 'nice text'] - 1) %>%
  ungroup() %>%
  select(-temp_id)

给出：

# A tibble: 6 x 2
     Id Word     
  <dbl> <chr>    
1     1 a        
2     1 first    
3     1 nice text
4     2 a        
5     2 second   
6     2 nice text

【讨论】：