Python：查找和计算txt文件中单词的精确匹配和近似匹配答案

【问题标题】：Python: Finding and counting exact and approximate matches of words in txt filePython：查找和计算txt文件中单词的精确匹配和近似匹配
【发布时间】：2021-04-10 06:10:31
【问题描述】：

我的程序已经接近完成我想要它做的事情，但我有一个问题：我试图找到的许多关键字可能在中间有符号或者可能拼写错误。因此，我想将拼写错误的单词算作关键字匹配，就好像它们拼写正确一样。例如，假设我的文字是：“settlement settl#7*nt se##tl#ment ann&&ity annuity。”

我想计算 .txt 文件中包含关键字“settlement”和“annuity”的次数，以及以“sett”开头并以“nt”结尾的单词作为“settlement”以及以“ann”开头的单词的次数并以“y”结尾作为年金。

我已经能够计算出准确的单词，并且非常接近我想要它做的事情。但现在我想做近似匹配。我什至不确定这是可能的。谢谢。

out1 = open("seen.txt", "w")
out2 = open("missing.txt", "w")

def count_words_in_dir(dirpath, words, action=None):
    for filepath in glob.iglob(os.path.join("/Settlement", '*.txt')):
        with open(filepath) as f:
            data = f.read()
            for key, val in words.items():
                # print("key is " + key + "\n")
                ct = data.count(key)
                words[key] = ct
            if action:
                action(filepath, words)
            
                
                

def print_summary(filepath, words):
    for key, val in sorted(words.items()):
        whichout = out1 if val > 0 else out2
        print(filepath, file=whichout)
        print('{0}: {1}'.format(key, val), file=whichout)

filepath = sys.argv[1]
keys = ["annuity", "settlement"]
words = dict.fromkeys(keys, 0)

count_words_in_dir(filepath, words, action=print_summary)

out1.close()
out2.close()

【问题讨论】：

让你成为docs.python.org/3/library/stdtypes.html#str.startswith和docs.python.org/3/library/stdtypes.html#str.endswith

标签： python match counter

【解决方案1】：

模糊匹配可以使用regex模块，通过pip install regex命令安装一次。

通过这个正则表达式模块，您可以使用任何表达式，并且通过{e<=2} 后缀，您可以指定单词中可能出现的错误数以匹配正则表达式（一个错误是替换或插入或删除一个符号）。这也称为编辑距离或Levenshtein distance。

作为一个例子，我编写了自己的函数来计算给定字符串中的单词。这个函数有num_errors 参数，它指定给定单词匹配多少错误是正确的，我指定了num_errors = 3，但你可以将它设置为更高的错误率，但不要将它设置为非常高，否则文本中的任何单词将匹配任何参考词。

我用re.split()将句子拆分成单词。

Try it online!

import regex as re
def count_words(text, words, *, num_errors = 3):
    we = ['(' + re.escape(e) + f'){{e<={num_errors}}}' for e in words]
    cnt = {e : 0 for e in words}
    for wt in re.split(r'[,.\s]+', text):
        for wre, wrt in zip(we, words):
            if re.fullmatch(wre, wt):
                cnt[wrt] += 1
                break
    return cnt

text = 'settlement settl#7*nt se##tl#ment ann&&ity annuity hello world.'
print(count_words(text, ['settlement', 'annuity']))

输出：

{'settlement': 3, 'annuity': 2}

作为 regex 模块的更快替代方案，您可以使用 Levenshtein 模块，通过 pip install python-Levenshtein 命令安装一次。

这个模块只实现了编辑距离（上面提到过）并且应该比正则表达式模块工作得更快。

与上面相同但使用 Levenshtein 模块实现的代码如下：

Try it online!

import Levenshtein, re
def count_words(text, words, *, num_errors = 3):
    cnt = {e : 0 for e in words}
    for wt in re.split(r'[,.\s]+', text):
        for wr in words:
            if Levenshtein.distance(wr, wt) <= num_errors:
                cnt[wr] += 1
                break
    return cnt

text = 'settlement settl#7*nt se##tl#ment ann&&ity annuity hello world.'
print(count_words(text, ['settlement', 'annuity']))

输出：

{'settlement': 3, 'annuity': 2}

根据 OP 的要求，我正在实施第三种算法，它不使用任何 re.split() 来拆分成单词，而是使用 re.finditer()。

Try it online!

import regex as re
def count_words(text, words, *, num_errors = 3):
    we = ['(' + re.escape(e) + f'){{e<={num_errors}}}' for e in words]
    cnt = {e : 0 for e in words}
    for wre, wrt in zip(we, words):
        cnt[wrt] += len(list(re.finditer(wre, text)))
    return cnt

text = 'settlement settl#7*nt se##tl#ment ann&&ity annuity hello world.'
print(count_words(text, ['settlement', 'annuity']))

输出：

{'settlement': 3, 'annuity': 2}

【讨论】：

这很棒。谢谢你。但是你能再解释一下吗？理想情况下，我需要将它合并到我的代码中，以便它从 txt 文件的目录中计数，然后用文件名和字数写入两个新的 txt 文件。像这样：/Users/seen.txt 结算：2 /Users/seen.txt 年金：1
@JohnD'Attoma 如果将我的代码合并到您在问题中提供的代码中，则合并可能看起来像 like this，在此代码中 count_words() 是我的函数，而 action 您指定自己，随心所欲。如果你有一些现成的代码，你可以把你的代码发给我，我会把我的函数合并进去。
再次感谢您的快速回复。根据您的代码，我将尝试弄清楚。如果我遇到了死胡同，我可能会向您发送一些代码。
@JohnD'Attoma 是的，没错，我使用re.split() 将所有文本拆分为单词。您应该用拆分成单词的算法替换这个 re.split。如果你不知道如何拆分成必要的词，那么现在我将尝试实现另一种不需要拆分的算法。
@JohnD'Attoma 刚刚实现了你需要的第三种算法，没有 re.split()，请看看我的答案，现在刚刚更新，看看答案的最后，没有算法re.split()，它使用 re.finditer() 代替。另外不要忘记 num_errors 参数，也许值 3 对你的情况来说是不够的，只是实验。另外，如果你将此值设置得太高，那么你会有误报，即它会检测到不应该匹配的错误单词。所以尝试从 3 开始，如果不是所有单词都匹配，则将其增加到 4，然后再次测量。提醒您 - 这个值是错误的数量