如何使用 hunspell 包在 R 的列中建议正确的单词？答案

【问题标题】：How to use hunspell package to suggest correct words in a column in R?如何使用 hunspell 包在 R 的列中建议正确的单词？
【发布时间】：2019-08-28 20:39:40
【问题描述】：

我目前正在处理每行包含大量文本的大型数据框，并希望使用hunspell 包有效地识别和替换每个句子中拼写错误的单词。我能够识别拼写错误的单词，但不知道如何在列表中执行hunspell_suggest。

以下是数据框的示例：

df1 <- data.frame("Index" = 1:7, "Text" = c("A complec sentence joins an independet",
                                            "Mary and Samantha arived at the bus staton before noon",
                                            "I did not see thm at the station in the mrning",
                                            "The participnts read 60 sentences in radom order",
                                            "how to fix mispelled words in R languge",
                                            "today is Tuesday",
                                            "bing sports quiz"))

我将文本列转换为字符，并使用hunspell 来识别每一行中的拼写错误。

library(hunspell)
df1$Text <- as.character(df1$Text)
df1$word_check <- hunspell(df1$Text)

我试过了

df1$suggest <- hunspell_suggest(df1$word_check)

但它总是给出这个错误：

Error in hunspell_suggest(df1$word_check) : 
  is.character(words) is not TRUE

我是新手，所以我不确定使用hunspell_suggest 函数的建议列会如何。任何帮助将不胜感激。

【问题讨论】：

标签： r spell-checking hunspell

【解决方案1】：

检查您的中间步骤。 df1$word_check的输出如下：

List of 5
 $ : chr [1:2] "complec" "independet"
 $ : chr [1:2] "arived" "staton"
 $ : chr [1:2] "thm" "mrning"
 $ : chr [1:2] "participnts" "radom"
 $ : chr [1:2] "mispelled" "languge"

类型为list。如果你做了lapply(df1$word_check, hunspell_suggest)，你可以得到建议。

编辑

我决定更详细地讨论这个问题，因为我还没有看到任何简单的替代方案。这就是我想出的：

cleantext = function(x){

  sapply(1:length(x),function(y){
    bad = hunspell(x[y])[[1]]
    good = unlist(lapply(hunspell_suggest(bad),`[[`,1))

    if (length(bad)){
      for (i in 1:length(bad)){
        x[y] <<- gsub(bad[i],good[i],x[y])
      }}})
  x
}

虽然可能有一种更优雅的方法，但此函数返回一个经过校正的字符串向量：

> df1$Text
[1] "A complec sentence joins an independet"                
[2] "Mary and Samantha arived at the bus staton before noon"
[3] "I did not see thm at the station in the mrning"        
[4] "The participnts read 60 sentences in radom order"      
[5] "how to fix mispelled words in R languge"               
[6] "today is Tuesday"                                      
[7] "bing sports quiz" 

> cleantext(df1$Text)
[1] "A complex sentence joins an independent"               
[2] "Mary and Samantha rived at the bus station before noon"
[3] "I did not see them at the station in the morning"      
[4] "The participants read 60 sentences in radon order"     
[5] "how to fix misspelled words in R language"             
[6] "today is Tuesday"                                      
[7] "bung sports quiz"

小心，因为这会返回hunspell 给出的第一个建议——这可能是正确的，也可能是不正确的。

【讨论】：

我明白了，这适用于建议部分。谢谢你。你会碰巧知道如何用每个单词的第一个建议替换拼写错误的单词吗？
如果建议列表为空，则此行返回错误。好 = unlist(lapply(hunspell_suggest(bad),[[,1))