检测外语文本的部分内容（Rstudio）答案

【问题标题】：Detecting parts of text in foreign languages (Rstudio)检测外语文本的部分内容（Rstudio）
【发布时间】：2020-04-30 13:43:44
【问题描述】：

我的数据集包含很多文本。完全用外语编写的文本将被删除。现在，所有的文本都是用英文写的，但有些有翻译，例如一个双语的人，除了英文文本之外，还把英文文本下面的英文文本翻译成非英文文本。我想过滤掉文本的那些部分。

文本都在一个变量中。我试图取消嵌套这些文本（使用 tidytext 的 unnest_tokens 函数）并使用 textcat 包来检测未嵌套单词的语言，但这给了我最不一致的语言，从法语到斯洛文尼亚语，尽管相应的单词是英语。

我用于解除嵌套和检测的代码如下（为了性能，我创建了一个示例）：

text_unnesting_tokens <- MyDF %>% tidytext::unnest_tokens(word, text) 
sample <- text_unnesting_tokens[sample(nrow(text_unnesting_tokens), 5000), ]
sample$language <- textcat(sample$word, p = textcat::TC_char_profiles)

【问题讨论】：

标签： text filtering tidytext

【解决方案1】：

如果你想使用textcat::textcat()，你应该在之前标记化，因为它是基于整个文本片段，而不是单个标记。首先使用textcat() 识别语言并然后标记化：

library(tidyverse)
library(tidytext)
library(textcat)
library(hcandersenr)

fir_tree <- hca_fairytales() %>%
  filter(book == "The fir tree") 

## how many lines per language?
fir_tree %>%
  count(language)
#> # A tibble: 5 x 2
#>   language     n
#>   <chr>    <int>
#> 1 Danish     227
#> 2 English    253
#> 3 French     227
#> 4 German     262
#> 5 Spanish    261

## how many lines per detected language?
fir_tree %>%
  mutate(detected_lang = textcat(text)) %>%
  count(detected_lang, sort = TRUE)
#> # A tibble: 30 x 2
#>    detected_lang      n
#>    <chr>          <int>
#>  1 german           257
#>  2 spanish          238
#>  3 french           215
#>  4 english          181
#>  5 danish           138
#>  6 norwegian         80
#>  7 scots             60
#>  8 portuguese         7
#>  9 swedish            6
#> 10 middle_frisian     5
#> # … with 20 more rows

## now detect language + tokenize
fir_tree %>%
  mutate(detected_lang = textcat(text)) %>%
  unnest_tokens(word, text)
#> # A tibble: 14,850 x 4
#>    book         language detected_lang word    
#>    <chr>        <chr>    <chr>         <chr>   
#>  1 The fir tree Danish   danish        ude     
#>  2 The fir tree Danish   danish        i       
#>  3 The fir tree Danish   danish        skoven  
#>  4 The fir tree Danish   danish        stod    
#>  5 The fir tree Danish   danish        der     
#>  6 The fir tree Danish   danish        sådant  
#>  7 The fir tree Danish   danish        et      
#>  8 The fir tree Danish   danish        nydeligt
#>  9 The fir tree Danish   danish        grantræ 
#> 10 The fir tree Danish   danish        det     
#> # … with 14,840 more rows

^{由reprex package (v0.3.0) 于 2020 年 4 月 30 日创建}

【讨论】：

感谢您的回复！但是，这会给我大多数单元格内文本的语言，对吗？例如，如果 60% 是英语而 40% 是例如西班牙语？然后，Textcat 将确定它是英文书面文本。我需要从数据集中删除这 40%。
哇，您的文档是多种语言的混合体？您可能会考虑一些 ngram 标记化，然后是语言检测。我建议不要尝试对单个单词进行语言检测。
成功了！仅供参考，我将文本取消嵌套成句子，确定语言，删除非英语文本的案例（句子），并将它们重新组合到 id 上。出于某种原因，空格打乱了句子的重组。