【问题标题】:Word suggestion Python单词建议 Python
【发布时间】:2019-02-18 10:40:07
【问题描述】:

我正在寻求有关使用 Python 编写单词建议系统的帮助。 在给定随机字符序列的输入时,我希望能够搜索单词列表并给出一些单词建议。

我发现的壁橱是一个拼写更正系统 (https://norvig.com/spell-correct.html),在分析函数“edits1”时它确实会产生一些结果,但是这是基于一次编辑(例如,在输入字符串中包含一个“a” )。

我想要实现的是使用多个字母,即元音或辅音。 例如给定字母“prt”,字典搜索应该推荐“part”和“apart”等。

Filler.py - https://norvig.com/spell-correct.html

            import re
            from collections import Counter

            def words(text): return re.findall(r'\w+', text.lower())

            WORDS = Counter(words(open('E:\\new\\words.txt').read())) #wordlist containing numerious word e.g. 'prut', 'prot', 'port', 'part', 'prat', 'pert', 'pret', 'apart'.

            def candidates(word): 
                "Generate possible spelling corrections for word."
                return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

            def known(words): 
                "The subset of `words` that appear in the dictionary of WORDS."
                return set(w for w in words if w in WORDS)

            def edits1(word): 
                "All edits that are one edit away from `word`."
                letters    = 'aeiouxyz'
                splits     = [(word[:i], word[i:])    for i in range(len(word) + 2)]
                inserts    = [L + c + R               for L, R in splits for c in letters]
                return set(inserts)

            def edits2(word): 
                "All edits that are two edits away from `word`."
                return (e2 for e1 in edits1(word) for e2 in edits1(e1))

输入字符串.py

            import filler

            h = ['prt']
            for x in h:
                input = filler.candidates(h[0])
                print(input)

【问题讨论】:

  • 嗯,filler.py 基于两个或更少的编辑,使用循环可以轻松进行更多编辑。问题是,你想要编辑多少次,你确定辅音也被用来插入,比如说,例如给定'bt','but','bat','beat'应该被推荐,而'belt','breakfast'等也被推荐?
  • 最多四个编辑可以满足我的要求。仅使用“字母”变量中指定的元音和辅音。

标签: python-3.x dictionary search autocorrect


【解决方案1】:

嗯,我修改了你的代码。 Suggestor 类接收两个参数,即max_timesletters,因此您可以随时随地更改它们。

import re

from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

class Suggestor:
    def __init__(self,max_times,letters):
        self.max_times = max_times
        self.letters = letters

    def candidates(self,word):
        return self.known(self.edited_word(word))

    def known(self,words):
        return set(w for w in words if w in WORDS)

    def edit(self,word):
        letters = self.letters
        splits = [(word[:i], word[i:]) for i in range(len(word) + 2)]
        inserts = [L + c + R for L, R in splits for c in letters]
        return list(set(inserts))

    def edited_word(self,raw_word):
        words = [[raw_word]]
        for i in range(self.max_times):
            i_times_words = []
            for word in words[-1]:
                i_times_words += self.edit(word)
            words.append(list(set(i_times_words)))
        return [w for word in words for w in word]

if __name__ == '__main__':
    word = 'prt'
    suggestor = Suggestor(max_times=4,letters='aeiouxyz')
    print(suggestor.candidates(word))

而上述测试的输出是:

{'partie', 'parity', 'purity', 'part', 'port', 'proto', 'porto', 'party', 'apart', 'parait', 'export', 'operate ', '专家', '海盗'}

此外,我的建议是检查所有单词的概率,您可以使用贝叶斯定理过滤其中的一些。

【讨论】:

  • 非常感谢,这正是我想要的! :)
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2010-10-11
  • 2021-12-05
  • 1970-01-01
  • 1970-01-01
  • 2019-04-11
  • 1970-01-01
相关资源
最近更新 更多