【问题标题】:Replacing string with placeholder and replacing them back after a function.用占位符替换字符串并在函数之后将它们替换回来。
【发布时间】:2018-08-22 16:49:30
【问题描述】:

给定一个字符串和一个应替换为占位符的子字符串列表,例如

import re
from copy import copy 

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"

第一个目标是首先将original_textphrases 中的子字符串替换为索引占位符,例如

text = copy(original_text)
backplacement = {}
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

[出]:

Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen

然后会有一些函数来使用占位符来操作text,例如

cleaned_text = func('Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen')
print(cleaned_text)

输出:

MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2

最后一步是我们以向后的方式进行替换并放回原来的短语,即

' '.join([backplacement[tok] if tok in backplacement else tok for tok in clean_text.split()])

[出]:

"'s_morgen ik 's-Hertogenbosch depository_financial_institution"

问题是:

  1. 如果phrases 中的子字符串列表很大,那么第一次替换和最后一次替换的时间会很长。

有没有办法用正则表达式进行替换/替换?

  1. 使用re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text) 正则表达式替换不是很有帮助,尤其是。如果短语中存在与完整单词不匹配的子字符串,

例如

phrases = ["org", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

我们得到一个尴尬的输出:

Something, 's mMWEPHRASE0en, ik MWEPHRASE1 im das MWEPHRASE2 gehen

我尝试过使用'\b{}\b'.format(phrase),但这不适用于带有标点符号的短语,即

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"\b{}\b".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

[出]:

Something, 's morgen, ik 's-Hertogenbosch im das MWEPHRASE2 gehen

re.sub 正则表达式模式中的短语是否有一些地方可以表示词的边界?

【问题讨论】:

  • 在您想要的输出中,除了ik 之外,所有未出现在phrases 中的字符串都将被删除。这是为什么呢?
  • 你这样做很困难。 Then there'll be some functions to manipulate the text with the placeholders。因此,您可以在添加占位符后处理文本。并且该函数必须对空格或其他内容进行拆分。所以,现在您有一个数组,您可以在其中操作除占位符之外的所有元素,然后您想将数组连接成一个字符串,然后使用真实单词替换占位符。对吗?
  • 单次通过,我会使用正则表达式匹配所有单词并将它们放入二维数组(或列表)中。维度 0 是字符串部分,维度 1 是标志。当匹配非短语字符串部分时,标志为 0,当它是短语词时,标志为 1。然后您可以迭代数组并忽略标志为 1 的部分。添加、删除、重新排列根据需要的元素。然后将它们重新组合在一起。正则表达式很简单((?:(?!phrase1|phrase2|phrase3)[\S\s])+)|(phrase1|phrase2|phrase3)。其中,捕获组 1 是非短语字符串部分,捕获组 2 是短语。
  • 这似乎是另一种选择:github.com/vi3k6i5/flashtext
  • 至于单词边界,你一定要找r"(?<!\w){}(?!\w)".format(phrase)。由于您的某些关键字以非单词字符开头,因此您不能使用\b。您能否提供更多需要实现的逻辑?看起来您可能需要将回调/lambda 作为第二个参数传递给 re.sub 以将每个匹配项替换一次。

标签: python regex string replace placeholder


【解决方案1】:

您可以拆分它,而不是使用 re.sub!

def do_something_with_str(string):
    # do something with string here.
    # for example let's wrap the string with "@" symbol if it's not empty
    return f"@{string}" if string else string


def get_replaced_list(string, words):
    result = [(string, True), ]

    # we take each word we want to replace
    for w in words:

        new_result = []

        # Getting each word in old result
        for r in result:

            # Now we split every string in results using our word.
            split_list = list((x, True) for x in r[0].split(w)) if r[1] else list([r, ])

            # If we replace successfully - add all the strings
            if len(split_list) > 1:

                # This one would be for [text, replaced, text, replaced...]
                sub_result = []
                ws = [(w, False), ] * (len(split_list) - 1)
                for x, replaced in zip(split_list, ws):
                    sub_result.append(x)
                    sub_result.append(replaced)
                sub_result.append(split_list[-1])

                # Add to new result
                new_result.extend(sub_result)

            # If not - just add it to results
            else:
                new_result.extend(split_list)
        result = new_result
    return result


if __name__ == '__main__':
    initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
    words_to_replace = ('a', 'c')
    replaced_list = get_replaced_list(initial_string, words_to_replace)
    modified_list = [(do_something_with_str(x[0]), True) if x[1] else x for x in replaced_list]
    final_string = ''.join([x[0] for x in modified_list])

这是上面示例的变量值:

initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
words_to_replace = ('a', 'c')
replaced_list = [('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True)]
modified_list = [('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True)]
final_string = 'ac@bbc@bbcac@bbc@bbcac@bbc@bbca'

如您所见,列表包含元组。它们包含两个值 - some stringboolean,表示它是文本还是替换值(True 时为文本)。 获得替换列表后,您可以像示例中那样对其进行修改,检查它是否为文本值 (if x[1] == True)。 希望对您有所帮助!

附言字符串格式,如 f"some string here {some_variable_here}" 需要 Python 3.6

【讨论】:

    【解决方案2】:

    我认为在这个任务中使用正则表达式有两个关键:

    1. 使用自定义边界,捕获它们,并将它们与短语一起替换回来。

    2. 使用函数在两个方向上处理替换匹配。

    以下是使用这种方法的实现。我稍微调整了你的文字以重复其中一个短语。

    import re
    from copy import copy 
    
    original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen 's morgen"
    text = copy(original_text)
    
    #
    # The phrases of interest
    #
    phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
    
    #
    # Create the mapping dictionaries
    #
    phrase_to_mwe = {}
    mwe_to_phrase = {}
    
    #
    # Build the mappings
    #
    for i, phrase in enumerate(phrases):
    
        mwephrase                = "MWEPHRASE{}".format(i)
        mwe_to_phrase[mwephrase] = phrase.replace(' ', '_')
        phrase_to_mwe[phrase]    = mwephrase
    
    #
    # Regex match handlers
    #
    def handle_forward(match):
    
        b1     = match.group(1)
        phrase = match.group(2)
        b2     = match.group(3)
    
        return b1 + phrase_to_mwe[phrase] + b2
    
    
    def handle_backward(match):
    
        return mwe_to_phrase[match.group(1)]
    
    #
    # The forward regex will look like:
    #
    #    (^|[ ])('s morgen|'s-Hertogenbosch|depository financial institution)([, ]|$)
    # 
    # which captures three components:
    #
    #    (1) Front boundary
    #    (2) Phrase
    #    (3) Back boundary
    #
    # Anchors allow matching at the beginning and end of the text. Addtional boundary characters can be
    # added as necessary, e.g. to allow semicolons after a phrase, we could update the back boundary to:
    #
    #    ([,; ]|$)
    #
    regex_forward  = re.compile(r'(^|[ ])(' + '|'.join(phrases) + r')([, ]|$)')
    regex_backward = re.compile(r'(MWEPHRASE\d+)')
    
    #
    # Pretend we cleaned the text in the middle
    #
    cleaned = 'MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2 MWEPHRASE0'
    
    #
    # Do the translations
    #
    text1 = regex_forward .sub(handle_forward,  text)
    text2 = regex_backward.sub(handle_backward, cleaned)
    
    print('original: {}'.format(original_text))
    print('text1   : {}'.format(text1))
    print('text2   : {}'.format(text2))
    

    运行此生成:

    original: Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen 's morgen
    text1   : Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen MWEPHRASE0
    text2   : 's_morgen ik 's-Hertogenbosch depository_financial_institution 's_morgen
    

    【讨论】:

      【解决方案3】:

      您可以使用以下策略:

      phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
      original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
      
      # need this module for the reduce function
      import functools as fn
      
      #convert phrases into a dictionary of numbered placeholders (tokens)
      tokens = { kw:"MWEPHRASE%s"%i for i,kw in enumerate(phrases) }
      
      #replace embedded phrases with their respective token
      tokenized = fn.reduce(lambda s,kw: tokens[kw].join(s.split(kw)), phrases, original_text)
      
      #Apply text cleaning logic on the tokenized text 
      #This assumes the placeholders are left untouched, 
      #although it's ok to move them around)
      cleaned_text = cleanUpfunction(tokenized)
      
      #reverse the token dictionary (to map original phrases to numbered placeholders)
      unTokens = {v:k for k,v in tokens.items() }
      
      #rebuild phrases with original text associated to each token (placeholder)
      final_text = fn.reduce(lambda s,kw: unTokens[kw].join(s.split(kw)), phrases, cleaned_text)
      

      【讨论】:

        【解决方案4】:

        您要查找的内容称为“多字符串搜索”或“多模式搜索”。更常见的解决方案是 Aho-Corasick 和 Rabin-Karp 算法。如果您想自己实现它,请使用 Rabin-Karp,因为它更容易掌握。否则,您会找到一些库。这是库https://pypi.python.org/pypi/py_aho_corasick的解决方案。

        phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
        original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
        

        并且,出于测试目的:

        def clean(text):
            """A simple stub"""
            assert text == 'Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen'
            return "MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2"
        

        现在,您必须定义两个自动机,一个用于外出旅行,另一个用于返回。自动机由 (key,value) 的列表定义:

        fore_automaton = py_aho_corasick.Automaton([(phrase,"MWEPHRASE{}".format(i)) for i, phrase in enumerate(phrases)])
        back_automaton = py_aho_corasick.Automaton([("MWEPHRASE{}".format(i), phrase.replace(' ','_')) for i, phrase in enumerate(phrases)])
        

        自动机将扫描文本并返回匹配列表。匹配是三元组(位置、键、值)。通过对匹配的一些工作,您将能够用值替换键:

        def process(automaton, text):
            """Returns a new text, with keys of the automaton replaced by values"""
            matches = automaton.get_keywords_found(text.lower()) # text.lower() because auomaton of py_aho_corasick uses lowercase for keys
            bk_value_eks = [(i,v,i+len(k)) for i,k,v in matches] # (begin of key, value, end of key)
            chunks = [bk_value_ek1[1]+text[bk_value_ek1[2]:bk_value_ek2[0]] for bk_value_ek1,bk_value_ek2 in zip([(-1,"",0)]+bk_value_eks, bk_value_eks+[(len(text),"",-1)] if bk_value_ek1[2] <= bk_value_ek2[0]] # see below
            return "".join(chunks)
        

        关于chunks = [bk_value_ek1[1]+text[bk_value_ek1[2]:bk_value_ek2[0]] for bk_value_ek1,bk_value_ek2 in zip([(-1,"",0)]+bk_value_eks, bk_value_eks+[(len(text),"",-1)] if bk_value_ek1[2] &lt;= bk_value_ek2[0]]的简要说明。 我几乎像往常一样使用自身压缩匹配:zip(arr, arr[1:]) 将输出(arr[0], arr[1)), (arr[1], arr[2]), ... 以考虑每个匹配及其后继。在这里我放置了两个哨兵 处理比赛的开始和结束。

        • 对于正常情况,我只输出值 (=bk_value_ek1[1]) 以及键结尾和下一个键开头之间的文本 (text[bk_value_ek1[2]:bk_value_ek2[0])。
        • begin sentinel 有一个空值,它的键在位置 0 处结束,因此第一个块将是 "" + text[0:begin of key1],即第一个键之前的文本。
        • 同样,end sentinel 也有一个空值,并且它的 key 从文本的末尾开始,因此最后一个块将是:最后一个匹配的值 + text[最后一个 key 的结尾:len(text)] .

        当键重叠时会发生什么?举个例子:text="abcdef"phrases={"bcd":"1", "cde":"2"}。您有两个匹配项:(1, "bcd", "1")(2, "cde", "3")。 我们走吧:bk_value_eks = [(1, "1", 4), (2, "2", 5)]。因此,如果没有if bk_value_ek1[2] &lt;= bk_value_ek2[0],文本将替换为text[:1]+"1"+text[4:2]+"2"+text[5:], 那是"a"+"1"+""+"2"+"f" = "a12f" 而不是"a1ef"(忽略第二个匹配)...

        现在,看看结果:

        print(process(back_automaton, clean(process(fore_automaton, original_text))))
        # "'s_morgen ik 's-Hertogenbosch depository_financial_institution"
        

        您不必为返回定义新的process 函数,只需给它back_automaton,它就可以完成这项工作。

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2020-06-09
          • 2021-09-26
          • 1970-01-01
          • 2012-05-11
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多