PyEnchant 将字典中的单词“纠正”为不在字典中的单词答案

【问题标题】：PyEnchant 'correcting' words in dictionary to words not in dictionaryPyEnchant 将字典中的单词“纠正”为不在字典中的单词
【发布时间】：2014-02-05 08:44:18
【问题描述】：

我正在尝试从网络论坛获取大量自然语言并使用 PyEnchant 更正拼写。文本通常是非正式的，并且是关于医疗问题的，因此我创建了一个文本文件“test.pwl”，其中包含相关的医疗词汇、聊天缩写等。在某些情况下，不幸的是，一小部分 html、url 等仍然保留在其中。

我的脚本旨在同时使用 en_US 字典和 PWL 来查找所有拼写错误的单词，并完全自动地将它们更正为 d.suggest 的第一个建议。它打印一个拼写错误的单词列表，然后是一个没有建议的单词列表，并将更正的文本写入“spellfixed.txt”：

import enchant
import codecs

def spellcheckfile(filepath):
    d = enchant.DictWithPWL("en_US","test.pwl")
    try:
        f = codecs.open(filepath, "r", "utf-8")
    except IOError:
        print "Error reading the file, right filepath?"
        return
    textdata = f.read()
    mispelled = []
    words = textdata.split()
    for word in words:
        # if spell check failed and the word is also not in
        # mis-spelled list already, then add the word
        if d.check(word) == False and word not in mispelled:
            mispelled.append(word)
    print mispelled
    for mspellword in mispelled:
        #get suggestions
        suggestions=d.suggest(mspellword)
        #make sure we actually got some
        if len(suggestions) > 0:
            # pick the first one
            picksuggestion=suggestions[0]
        else: print mspellword
        #replace every occurence of the bad word with the suggestion
        #this is almost certainly a bad idea :)
        textdata = textdata.replace(mspellword,picksuggestion)
    try:
        fo=open("spellfixed.txt","w")
    except IOError:
        print "Error writing spellfixed.txt to current directory. Who knows why."
        return 
    fo.write(textdata.encode("UTF-8"))
    fo.close()
    return

问题在于输出通常包含对字典或 pwl 中单词的“更正”。例如，当输入的第一部分是：

我的新医生觉得我现在是双极的。这，在被其他人认为严重抑郁 9 年后

我知道了：

我的新医生觉得我现在是躁郁症。这是在被其他人认为严重抑郁 9 年后

我可以处理案件的变化，但医生 --> 医生一点也不好。当输入要短得多时（例如上面的引用是整个输入），结果是可取的：

我的新医生觉得我现在患有躁郁症。这，在被其他人认为严重抑郁 9 年后

谁能给我解释一下为什么？请用非常简单的话说，因为我对编程很陌生，对 Python 也很陌生。非常感谢您提供分步解决方案。

【问题讨论】：

标签： python dictionary spelling pyenchant enchant

【解决方案1】：

我认为您的问题是您要替换 inside 单词的字母序列。 “ER”可能是“er”的有效拼写更正，但这并不意味着您应该将“considered”更改为“considered”。

您可以使用正则表达式代替简单的文本替换，以确保只替换完整的单词。正则表达式中的“\b”表示“单词边界”：

>>> "considered at the er".replace( "er", "ER" )
'considERed at the ER'
>>> import re
>>> re.sub( "\\b" + "er" + "\\b", "ER", "considered at the er" )
'considered at the ER'

【讨论】：

谢谢伙计。我确实知道正则表达式，但是对于编程和 Python 如此陌生，我不知道如何在我的代码中实现单词边界分隔符。 ...线索？
我这样做是不是到了某个地方：textdata = textdata.replace("\\b" + mspellword + "\\b","\\b" + picksuggestion + "\\b")
@user2437842 不完全是，您需要使用像re.sub 这样的正则表达式函数，而不是字符串replace。请参阅我的答案以及documentation 中的代码。您可以将正则表达式构造为"\\b" + re.escape( mspellword ) + "\\b"。您要作为替换插入的文本 (picksuggestion) 不应转换为正则表达式。

【解决方案2】：

    #replace every occurence of the bad word with the suggestion
    #this is almost certainly a bad idea :)

你是对的，那是个坏主意。这就是导致“考虑”被“考虑”取代的原因。此外，即使您没有找到建议，您也在进行替换。将替换移动到 if len(suggestions) > 0 块。

至于替换单词的每个实例，您要做的是保存拼写错误单词的位置以及拼写错误单词的文本（或者可能只是位置，您可以在文本中查找单词稍后当您寻找建议时），允许重复拼写错误的单词，并且仅将单个单词替换为其建议。

不过，我会将实现细节和优化留给您。循序渐进的解决方案不会帮助您学到很多东西。

【讨论】：

我很欣赏你的智慧和精神。谢谢。我希望我爸爸有时更像你。也就是说，恐怕你所描述的完全超出我的能力范围。保存单词位置并允许重复是我从未梦想过的事情。我很高兴“为此而努力”，但我认为我需要一个推动力。做朋友吗？
@user2437842 嘿，很公平。查看元组。 (word, position) 的元组应该可以工作。