标记化问题答案

【问题标题】：Tokenizing issue标记化问题
【发布时间】：2023-03-08 11:35:02
【问题描述】：

我正在尝试如下标记一个句子。

Section <- c("If an infusion reaction occurs, interrupt the infusion.")
df <- data.frame(Section)

当我使用 tidytext 和下面的代码进行标记时，

AA <- df %>%
  mutate(tokens = str_extract_all(df$Section, "([^\\s]+)"),
         locations = str_locate_all(df$Section, "([^\\s]+)"),
         locations = map(locations, as.data.frame)) %>%
  select(-Section) %>%
  unnest(tokens, locations)

它给了我一个如下的结果集（见图）。

如何将逗号和句点作为独立标记而不是“发生”和“注入”的一部分。分别使用 tidytext。所以我的令牌应该是

If
an
infusion
reaction
occurs
,
interrupt
the
infusion
.

【问题讨论】：

标签： r regex tokenize tidytext

【解决方案1】：

事先用其他东西替换它们。确保在替换之前添加一个空格。然后在空格处分割句子。

include = c(".", ",") #The symbols that should be included

mystr = Section  # copy data
for (mypattern in include){
    mystr = gsub(pattern = mypattern,
                 replacement = paste0(" ", mypattern),
                 x = mystr, fixed = TRUE)
}
lapply(strsplit(mystr, " "), function(V) data.frame(Tokens = V))
#[[1]]
#      Tokens
#1         If
#2         an
#3   infusion
#4   reaction
#5     occurs
#6          ,
#7  interrupt
#8        the
#9   infusion
#10         .

【讨论】：

【解决方案2】：

这最终会增加字符串的长度：

df%>%
  mutate(Section =  gsub("([,.])",' \\1',Section),
  start = gregexpr("\\S+",Section),
  end = list(attr(start[[1]],"match.length")+unlist(start)),
  Section = strsplit(Section,"\\s+"))%>%
  unnest()

     Section start end
1         If     1   3
2         an     4   6
3   infusion     7  15
4   reaction    16  24
5     occurs    25  31
6          ,    32  33
7  interrupt    34  43
8        the    44  47
9   infusion    48  56
10         .    57  58

【讨论】：

【解决方案3】：

这是一种无需先替换任何内容的方法，诀窍是使用匹配以下任何内容的[[:punct:]] 通配符：

!"#$%&'()*+,\-./:;<=>?@[\]^_`{|}~

该模式很简单\\w+|[[:punct:]]- 表示：匹配连续的单词字符或标点字符，str_extract_all 负责其余部分，分别拉出每个字符。如果您只想拆分特定的标点符号，您也可以使用\\w+|[,.] 或类似的。

AA <- df %>% mutate(
     tokens = str_extract_all(Section, "\\w+|[[:punct:]]"),
     locations = str_locate_all(Section, "\\w+|[[:punct:]]"),
     locations = map(locations, as.data.frame)) %>%
  select(-Section) %>%
  unnest(tokens, locations)

      tokens start end
1         If     1   2
2         an     4   5
3   infusion     7  14
4   reaction    16  23
5     occurs    25  30
6          ,    31  31
7  interrupt    33  41
8        the    43  45
9   infusion    47  54
10         .    55  55

【讨论】：

【解决方案4】：

函数unnest_tokens() 有一个strip_punct 参数，用于分词器，例如词分词器。

library(tidyverse)
library(tidytext)

df %>%
  unnest_tokens(word, Section, strip_punct = FALSE)
#> # A tibble: 10 x 1
#>    word     
#>    <chr>    
#>  1 if       
#>  2 an       
#>  3 infusion 
#>  4 reaction 
#>  5 occurs   
#>  6 ,        
#>  7 interrupt
#>  8 the      
#>  9 infusion 
#> 10 .

由reprex package (v0.2.0) 于 2018 年 8 月 15 日创建。

【讨论】：