突出显示按顺序出现的某些单词答案

【问题标题】：Highlight certain words that appear in sequence突出显示按顺序出现的某些单词
【发布时间】：2017-10-23 11:34:01
【问题描述】：

我正在尝试打印文本，同时突出显示某些单词和单词二元组。如果我不必打印标点符号等其他标记，这将是相当直接的。

我有一个要突出显示的单词列表和另一个要突出显示的单词 bigrams 列表。

突出单个单词相当容易，例如：

import re
import string

regex_pattern = re.compile("([%s \n])" % string.punctuation)

def highlighter(content, terms_to_hightlight):
    tokens = regex_pattern.split(content)
    for token in tokens:
        if token.lower() in terms_to_hightlight:
            print('\x1b[6;30;42m' + token + '\x1b[0m', end="")
        else:
            print(token, end="")

仅突出显示按顺序出现的单词更复杂。我一直在玩迭代器，但一直想不出任何不是很复杂的东西。

【问题讨论】：

您能否提供一个示例，说明您的highlighter 函数按预期工作而不是按预期工作？提示：“按顺序出现的单词”对你来说是什么样的？
您可以先将文本拆分为一个列表，然后遍历该列表（就像您已经做过的那样）。然后，您遍历该列表并检查当前元素和下一个元素是否是有效的二元组，如果是，则将“突出显示”一词推入单独的列表中。否则，您将其“未突出显示”推入列表。确保始终检查前一个二元组是否已突出显示（新列表的）当前项。
@not_a_robot 他可能正在寻找单词 bigrams，这意味着连续两个单词。如果它们在二元组列表中，他正试图突出显示几个单词。这会导致重叠问题。
没错！我正在尝试突出显示出现在单词 bigrams 列表中的单词。这意味着这些词只有在它们实际上按顺序出现并且它们之间没有任何词的情况下才应该突出显示！
@Mountain_sheep 欢迎来到堆栈溢出。作为你的新人，我只想说，没有必要在你的问题中添加代码语言；这一切都由标签处理。通过标签的力量等等！ :)

标签： python regex string iterator

【解决方案1】：

如果我正确理解了这个问题，一个解决方案是查看下一个单词标记并检查二元组是否在列表中。

import re
import string

regex_pattern = re.compile("([%s \n])" % string.punctuation)

def find_next_word(tokens, idx):
    nonword = string.punctuation + " \n"
    for i in range(idx+1, len(tokens)):
        if tokens[i] not in nonword:
            return (tokens[i], i)
    return (None, -1)

def highlighter(content, terms, bigrams):
    tokens = regex_pattern.split(content)
    idx = 0
    while idx < len(tokens):
        token = tokens[idx]
        (next_word, nw_idx) = find_next_word(tokens, idx)
        if token.lower() in terms:
            print('*' + token + '*', end="")
            idx += 1
        elif next_word and (token.lower(), next_word.lower()) in bigrams:
            concat = "".join(tokens[idx:nw_idx+1])
            print('-' + concat + '-', end="")
            idx = nw_idx + 1
        else:
            print(token, end="")
            idx += 1

terms = ['man', 'the']
bigrams = [('once', 'upon'), ('i','was')] 
text = 'Once upon a time, as I was walking to the city, I met a man. As I was tired, I did not look once... upon this man.'
highlighter(text, terms, bigrams)

调用时，会给出：

-Once upon- a time, as -I was- walking to *the* city, I met a *man*. As -I was- tired, I did not look -once... upon- this *man*.

请注意：

这是一个贪心算法，它会匹配它找到的第一个二元组。例如，您检查yellow banana 和banana boat，yellow banana boat 始终突出显示为-yellow banana- boat。如果您想要其他行为，则应更新测试逻辑。
您可能还想更新逻辑以管理单词同时出现在 terms 和二元组第一部分的情况
我没有测试所有的边缘情况，有些东西可能会损坏/可能存在栅栏错误
如有必要，您可以通过以下方式优化性能：
- 构建二元组的第一个单词的列表，并在对下一个单词进行前瞻之前检查其中是否包含单词
- 和/或使用前瞻的结果在一个步骤中处理两个单词之间的所有非单词标记（实施此步骤应该足以确保线性性能）

希望这会有所帮助。

【讨论】：