在 Python 中为 lemmatizer 进行多次搜索和替换更精确答案

【问题标题】：Making multiple search and replace more precise in Python for lemmatizer在 Python 中为 lemmatizer 进行多次搜索和替换更精确
【发布时间】：2016-05-05 19:23:06
【问题描述】：

我正在尝试使用词形还原字典在Python2.7 中为西班牙语制作自己的词形还原器。

我想用它们的引理形式替换某个文本中的所有单词。这是我目前一直在处理的代码。

def replace_all(text, dic):
    for i, j in dic.iteritems():
        text = text.replace(i, j)
    return text


my_text = 'Flojo y cargantes. Decepcionantes. Decenté decentó'
my_text_lower= my_text.lower()

lemmatize_list = 'ExampleDictionary'
lemmatize_word_dict = {}
with open(lemmatize_list) as f:
    for line in f:
        depurated_line = line.rstrip()
        (val, key) = depurated_line.split("\t")
        lemmatize_word_dict[key] = val

txt = replace_all(my_text_lower, lemmatize_word_dict)
print txt

这是一个示例 dictionary 文件，其中包含用于替换 input 或 my_tyext_lower 中的单词的词形还原形式。示例字典是一个制表符分隔的 2 列文件，其中 Col.1 表示值，Col 2 表示要匹配的键。

示例字典

flojo   floja
flojo   flojas
flojo   flojos
cargamento  cargamentos
cargante    cargantes
decepción   decepciones
decepcionante   decepcionantes
decentar    decenté
decentar    decentéis
decentar    decentemos
decentar    decentó

我想要的输出如下：

flojo y cargante. decepcionante. decentar decentar

使用这些输入（以及示例短语，如代码中的my_text 中所列）。我目前的实际输出是：

felitrojo y cargramarramarrartserargramarramarrunirdo. decepáginacionarrtícolitroargramarramarrunirdo. decentar decentar

目前，我似乎无法理解代码出了什么问题。

它似乎是在替换每个单词的字母或块，而不是识别单词，而是在 lemma dictionary 中找到它然后替换它。

例如，这是我使用整个字典（超过 50.000 个条目）时得到的结果。我的小示例字典不会发生此问题。只有当我使用完整的字典时，这让我认为它在某些时候可能是双重“替换”？

我是否缺少一种 Python 技术，可以将其合并到此代码中以使我的搜索和替换功能更加精确，以识别要替换的完整单词而不是块和/或不进行任何双重替换？

【问题讨论】：

当我运行你的例子时，我得到了你想要的输出。但是，如果其中一个词包含字典中的另一个词，则可能会出现问题。
这就是问题所在。有些单词肯定会包含较小单词的一部分，例如，在字典中。这就是为什么我需要尝试为整个单词而不是单词粒子进行“搜索/替换”...

标签： python regex search dictionary replace

【解决方案1】：

因为您使用 text.replace ，所以您仍有可能匹配子字符串，并且文本将再次得到处理。最好一次处理一个输入单词并逐字构建输出字符串。

我把你的key-value换反了（因为你要向右查找，找到左边的单词），我主要是改了replace_all：

import re

def replace_all(text, dic):
    result = ""
    input = re.findall(r"[\w']+|[.,!?;]", text)
    for word in input:
        changed = dic.get(word,word)
        result = result + " " + changed
    return result

my_text = 'Flojo y cargantes. Decepcionantes. Decenté decentó'
my_text_lower= my_text.lower()

lemmatize_list = 'ExampleDictionary'
lemmatize_word_dict = {}
with open(lemmatize_list) as f:
    for line in f:
        kv = line.split()
        lemmatize_word_dict[kv[1]] =kv[0]

    txt = replace_all(my_text_lower, lemmatize_word_dict)
    print txt

【讨论】：

我刚刚试用了您的示例，但使用 Python2.7 使用特殊字符的单词仍然存在问题。使用这个解决方案我的结果是：` flojo y cargante 。欺骗。体面的体面`而不是 flojo y cargante 。欺骗。去中心化`。
预计到达时间。我已经通过在正则表达式中硬编码重音字母（即input = re.findall(r"[\wáÁéÉíÍóÓúÚüÜñÑçÇ']+|[.,!?;]", text)）来尝试这个解决方案，它似乎有效。只是不知道这是否是实现该结果的最佳方法。

【解决方案2】：

我发现您的代码有两个问题：

如果单词作为更大单词的一部分出现，它也会替换单词
通过一个接一个地替换单词，您可以替换（部分）已经被替换的单词

我建议使用带有字边界\b 的re.sub 而不是那个循环，以确保只替换完整的单词。这样，您还可以将可调用函数作为替换函数传递。

import re
def replace_all(text, dic):
    return re.sub(r"\b\w+\b", lambda m: dic.get(m.group(), m.group()), text)

【讨论】：

此解决方案似乎适用于带有常规字符的单词，但不适用于带有特殊字符的单词。例如，在测试时。我得到的结果是flojo y cargante. decepcionante. decenté decentó，而不是想要的flojo y cargante. decepcionante. decentar decentar
@owwoow14 嗯，它适用于 Python 3……也许是一些编码问题？
糟糕。我正在使用 Python 2.7 我将编辑我的问题以指定。