句子比较：如何突出差异答案

【问题标题】：Sentence comparison: how to highlight differences句子比较：如何突出差异
【发布时间】：2021-04-10 11:04:25
【问题描述】：

我在 pandas 的一列中有以下字符串序列：

SEQ
An empty world
So the word is
So word is
No word is

我可以使用模糊模糊或余弦距离检查相似度。但是，我想知道如何获取有关将位置从 amore 更改为另一个的单词的信息。例如：第一行和第二行之间的相似度为 0。但这里是第 2 行和第 3 行之间的相似度。他们提出了几乎相同的词和相同的位置。如果可能的话，我想可视化这种变化（缺少单词）。与第 3 行和第 4 行类似。如何查看两行/文本之间的变化？

【问题讨论】：

您是否有兴趣根据您的示例比较连续行或所有可能的组合？

标签： python pandas cosine-similarity fuzzywuzzy sentence-similarity

【解决方案1】：

假设您正在使用 jupyter / ipython 并且您只是对行与之前的行之间的比较感兴趣，我会做这样的事情。

一般概念是：

找到两个字符串之间的共享标记（通过拆分 ' ' 并找到两个集合的交集）。
对两个字符串之间共享的标记应用一些 html 格式。
将此应用于所有行。
将生成的数据帧输出为 html 并在 ipython 中呈现。

import pandas as pd 

data = ['An empty world',
        'So the word is',
        'So word is',
        'No word is']

df = pd.DataFrame(data, columns=['phrase'])

bold = lambda x: f'<b>{x}</b>'

def highlight_shared(string1, string2, format_func):
    shared_toks = set(string1.split(' ')) & set(string2.split(' '))
    return ' '.join([format_func(tok) if tok in shared_toks else tok for tok in string1.split(' ') ])

highlight_shared('the cat sat on the mat', 'the cat is fat', bold)

df['previous_phrase'] = df.phrase.shift(1, fill_value='')
df['tokens_shared_with_previous'] = df.apply(lambda x: highlight_shared(x.phrase, x.previous_phrase, bold), axis=1)

from IPython.core.display import HTML

HTML(df.loc[:, ['phrase', 'tokens_shared_with_previous']].to_html(escape=False))

【讨论】：

您好，谢谢您的回答。我收到此错误： NameError: name 'apply_formats' is not defined 。你知道怎么解决吗？
是的 - 我已经编辑了代码，它现在可以运行了 :) 我将一个名为“apply_formats”的函数替换为“highlight_shared”。