【问题标题】:Python difflib to compare two csv files and highlight the world level differences in HTML outputPython difflib 比较两个 csv 文件并突出显示 HTML 输出中的世界级差异
【发布时间】:2017-06-01 21:38:09
【问题描述】:

我不是 Python 方面的专家,我尝试尽我所能找到答案,但找不到。请原谅我,如果这是一个重复的问题,请指出正确的方向,如果可以的话。

我正在尝试使用 Python Difflib 比较两个 CSV 文件并将 Diff 输出生成为 HTML 页面。当前的 difflib 模块具有内置选项 -m 以通过突出显示差异来并排生成两个 csv 文件的 HTML 输出。

但是,difflib 使用 difflib.SequenceMatcher 来查找差异并使用 difflib.HtmlDiff.make_file 创建 HTML 文件。但是,它产生的输出不是我想要的。

我目前从 difflib 得到的输出是:The Default Python DIFFLIB HTML output is Here.

但是,我想要的输出是:我正在寻找单词级别的突出显示,而不是在字符级别或序列突出显示的更改。如果旧文件和新文件之间发生任何更改,我希望突出显示 WHOLE WORD

我要强调的变化是: A word Level highlight of the text.

请在这方面帮助我,无论这是否真的可以使用 difflib 还是我必须使用任何其他工具/模块。我尝试使用 vimdiff 和其他插件,但我一无所获。我对这里的任何事情都持开放态度。

我使用的代码来自 PythonDiffLib 文档页面。

import sys, os, time, difflib, optparse
  def main():
   ..
   ..
   ..
    n = options.lines //I used this n = ZERO.
    fromfile, tofile = args # as specified in the usage string

    # we're passing these as arguments to the diff function
    fromdate = time.ctime(os.stat(fromfile).st_mtime)
    todate = time.ctime(os.stat(tofile).st_mtime)
    fromlines = open(fromfile, 'U').readlines()
    tolines = open(tofile, 'U').readlines()

    diff = difflib.HtmlDiff().make_file(fromlines, tolines, fromfile,
                                            tofile, context=TRUE,
                                            numlines=0)

    # we're using writelines because diff is a generator
    sys.stdout.writelines(diff)

` 旧.csv

refno,title,author,year,price
1001,CPP,MILTON,2008,456
1002,JAVA,Gilson,2002,456
1003,Adobe Flex,2010,566
1004,General Knowledge,Sinson,2007,465
1005,Actionscript,Gilto,2008,480

new.csv

refno,title,author,year,price
1001,CPP,MILTON,2010,456,2008
1002,JAVA,Gilson,2002
1003,Adobe Flexi,Johnson,2010,566
1004,General Knowledge,Simpson,2007,465
105,Action script,Gilto,2008,480
2000,Drama,DayoNe,,2020,560

我还在下面添加了默认 HTML DIFF 输出和预期的 HTML DIFF 输出。

默认来自 DIFFLIB 的 HTML DIFF 输出:

<html>

<head>
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1" />
<title></title>
<style type="text/css">
table.diff {font-family:Courier; border:medium;}
.diff_header {background-color:#e0e0e0}
td.diff_header {text-align:right}
.diff_next {background-color:#c0c0c0}
.diff_add {background-color:#aaffaa}
.diff_chg {background-color:#ffff77}
.diff_sub {background-color:#ffaaaa}
</style>
</head>

<body>

<table class="diff" id="difflib_chg_to0__top"
cellspacing="0" cellpadding="0" rules="groups" >
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<thead><tr><th class="diff_next"><br /></th><th colspan="2" class="diff_header">old.csv</th><th class="diff_next"><br /></th><th colspan="2" class="diff_header">new.csv</th></tr></thead>
<tbody>
<tr><td class="diff_next" id="difflib_chg_to0__0"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="from0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,200<span class="diff_sub">8</span>,456</td><td class="diff_next"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="to0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,20<span class="diff_add">1</span>0,456<span class="diff_add">,2008</span></td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002<span class="diff_sub">,456</span></td><td class="diff_next"></td><td class="diff_header" id="to0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_4">4</td><td nowrap="nowrap">1003,Adobe&nbsp;Flex,2010,566</td><td class="diff_next"></td><td class="diff_header" id="to0_4">4</td><td nowrap="nowrap">1003,Adobe&nbsp;Flex<span class="diff_add">i,Johnson</span>,2010,566</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_5">5</td><td nowrap="nowrap">1004,General&nbsp;Knowledge,Si<span class="diff_chg">n</span>son,2007,465</td><td class="diff_next"></td><td class="diff_header" id="to0_5">5</td><td nowrap="nowrap">1004,General&nbsp;Knowledge,Si<span class="diff_chg">mp</span>son,2007,465</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_6">6</td><td nowrap="nowrap">1<span class="diff_sub">0</span>05,Actionscript,Gilto,2008,480</td><td class="diff_next"></td><td class="diff_header" id="to0_6">6</td><td nowrap="nowrap">105,Action<span class="diff_add">&nbsp;</span>script,Gilto,2008,480</td></tr>
<tr><td class="diff_next"></td><td class="diff_header"></td><td nowrap="nowrap"></td><td class="diff_next"></td><td class="diff_header" id="to0_7">7</td><td nowrap="nowrap"><span class="diff_add">2000,Drama,DayoNe,,2020,560</span></td></tr>
</tbody>
</table>

</body>

</html>

预期来自 DIFFLIB 的 HTML DIFF 输出:

<html>

<head>
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1" />
<title></title>
<style type="text/css">
table.diff {font-family:Courier; border:medium;}
.diff_header {background-color:#e0e0e0}
td.diff_header {text-align:right}
.diff_next {background-color:#c0c0c0}
.diff_add {background-color:#aaffaa}
.diff_chg {background-color:#ffff77}
.diff_sub {background-color:#ffaaaa}
</style>
</head>

<body>

<table class="diff" id="difflib_chg_to0__top"
cellspacing="0" cellpadding="0" rules="groups" >
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
<thead><tr><th class="diff_next"><br /></th><th colspan="2" class="diff_header">old.csv</th><th class="diff_next"><br /></th><th colspan="2" class="diff_header">new.csv</th></tr></thead>
<tbody>
<tr><td class="diff_next" id="difflib_chg_to0__0"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="from0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,<span class="diff_sub">2008</span>,456</td><td class="diff_next"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="to0_2">2</td><td nowrap="nowrap">1001,CPP,MILTON,<span class="diff_add">2010</span>,456<span class="diff_add">,2008</span></td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002<span class="diff_sub">,456</span></td><td class="diff_next"></td><td class="diff_header" id="to0_3">3</td><td nowrap="nowrap">1002,JAVA,Gilson,2002</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_4">4</td><td nowrap="nowrap">1003,<span class="diff_sub">Adobe&nbsp;Flex</span>,2010,566</td><td class="diff_next"></td><td class="diff_header" id="to0_4">4</td><td nowrap="nowrap">1003,<span class="diff_add">Adobe&nbsp;Flexi</span>,<span class="diff_add">Johnson</span>,2010,566</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_5">5</td><td nowrap="nowrap">1004,General&nbsp;Knowledge,<span class="diff_sub">Sinson</span>,2007,465</td><td class="diff_next"></td><td class="diff_header" id="to0_5">5</td><td nowrap="nowrap">1004,General&nbsp;Knowledge,<span class="diff_add">Simpson</span>,2007,465</td></tr>
<tr><td class="diff_next"></td><td class="diff_header" id="from0_6">6</td><td nowrap="nowrap"><span class="diff_sub">1005</span>,<span class="diff_sub">Actionscript</span>,Gilto,2008,480</td><td class="diff_next"></td><td class="diff_header" id="to0_6">6</td><td nowrap="nowrap"><span class="diff_add">105</span>,<span class="diff_add">Action&nbsp;script</span>,Gilto,2008,480</td></tr>
<tr><td class="diff_next"></td><td class="diff_header"></td><td nowrap="nowrap"></td><td class="diff_next"></td><td class="diff_header" id="to0_7">7</td><td nowrap="nowrap"><span class="diff_add">2000,Drama,DayoNe,,2020,560</span></td></tr>
</tbody>
</table>

</body>

</html>

【问题讨论】:

  • 我会尝试找到标记的片段并将标记扩展到单词边界。
  • 另一种方法是复制difflib代码,修改为tag words。
  • 我试图这样做。然而,我不是一个伟大的 python 专家,我查看了 difflib,它使用 SequenceMatcher 和 OpCodes 进行标记。我找不到为单词标记它的方法。你能告诉我在哪里可以找到这个。
  • 它是相关的,但它不能解决我的问题。我想没有人对这个 difflib 感兴趣。

标签: python difflib


【解决方案1】:

问题:我在找词级高亮

实现class Comma_HtmlDiff,将突出显示扩展到逗号边界:
你必须重载difflib.ndiff

注意:只有展开第一个突出显示的部分才被实现。
如果difflib.ndiff 突出显示逗号,则不会更正。

class Comma_HtmlDiff(difflib.HtmlDiff):
    def __init__(self, tabsize=8, wrapcolumn=None, linejunk=None,
             charjunk=difflib.IS_CHARACTER_JUNK):
        setattr(difflib, '_ndiff', difflib.ndiff)
        setattr(difflib, 'ndiff', self.ndiff)
        super().__init__(tabsize, wrapcolumn, linejunk, charjunk)

    def ndiff(self, a, b, linejunk=None, charjunk=difflib.IS_CHARACTER_JUNK):
        _line = ''
        for line in difflib._ndiff(a, b, linejunk, charjunk):
            if line.startswith('-'):
                _d = '-'
                _line = line
            elif line.startswith('+'):
                _d = '+'
                _line = line

            if line.startswith('?'):
                dp = line.find(_d)
                if dp == -1:
                    _d = '+'
                    dp = line.find('^')
                dpl = _line.rfind(',', 0, dp)
                if dpl == -1:
                    dpl = 2
                else:
                    dpl += 1
                dpr = _line.find(',', dp)
                if dpr == dp:
                    _d = ' '
                    dpl = dp
                    dpr = dp+1

                dpw = dpr - dpl
                line = line[:dpl] + _d*dpw + line[dpr:]

            yield line

# Usage
diff = Comma_HtmlDiff().make_file(fromlines, tolines, fromfile,
                                    tofile, context=True,
                                    numlines=0)

输出

用 Python 测试:3.4.2

【讨论】:

  • 感谢代码和回复。如果我按照这种方法,我会得到它的问题,正如你所说,整行被分成单个单词,每个单词形成一个新行。但是,我不希望这条线被分成多条线。我希望整行不受干扰,但突出显示不同的单词。有可能吗?
  • 非常感谢先生。它确实帮助我解决了我的问题。正如您在笔记中提到的,我将尝试实现逗号部分。我试图投票,但由于我的声誉低,它去了版主!!!
  • 我想问另一个问题(我希望我不会在这里重复......因为除了你之外没有人回答我的问题)......是否可以使用difflib(就像我们在 Unix Diff 中所做的那样)?。我知道 difflib 使用SequenceMatcher,但它的输出很尴尬,因为它不适合常规报告文件??
  • @John:检查了man diff (GNU diffutils) 3.3,但没有看到任何关于单词级别的信息。编辑您的问题并显示一个这样做的示例命令行。我无法想象SequenceMatcherword level 输出有何关联。
  • 使用 difflib 的主要思想是我试图在 Unix (github.com/rickhowe/diffchar.vim) 中为 vimdiff 生成与 DiffCharplugin 相同的输出。但是,difflib 的结果完全不同。我知道它使用不同的算法来检测变化,但是,这是我的全部想法。我将编辑问题并尝试发布为此所需的输入。
猜你喜欢
  • 1970-01-01
  • 2017-06-06
  • 2016-12-24
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-01-24
相关资源
最近更新 更多