Pandas - 如何比较 2 个 CSV 文件和输出更改答案

【问题标题】：Pandas - How to compare 2 CSV files and output changesPandas - 如何比较 2 个 CSV 文件和输出更改
【发布时间】：2019-04-10 09:55:19
【问题描述】：

情况我有 2 个 10k 行 x 140 列的 CSV，它们基本相同，需要识别差异。标题完全相同，行几乎相同（10K 中的 100 可能已更改）。

示例

文件1.csv

ID、名字、姓氏、电话 1、电话 2、电话 3 1，鲍勃，琼斯，5555555555，4444444444，3333333333 2,吉姆,希尔,2222222222,1111111111,0000000000

文件2.csv

ID,FirstName,LastName,Phone1,,Phone2,,Phone3
1，鲍勃，琼斯，5555555555,4444455444,3333333333
2，吉姆希尔，2222222222,1155111111,0005500000
3,金格兰特,2173659851,3214569874,3698521471

输出文件.csv
ID、名字、姓氏、电话 1、电话 2、电话 3
1，鲍勃，琼斯，5555555555，4444444444，3333333333
2,吉姆,希尔,2222222222,1111111111,0005500000
3,金,格兰特,2173659851,3214569874,3698521471

我想我希望输出为 File2.csv，并以某种方式突出显示来自 File1.csv 的更改。我是 python 和 pandas 的新手，似乎不知道从哪里开始。我尽我所能在谷歌上搜索类似的东西来适应我的需要，但脚本似乎是针对具体情况的。

如果有人知道更简单/不同的方法，我会全神贯注。只要我不必逐条检查，我不在乎这是怎么发生的。

【问题讨论】：

是按顺序还是按 ID 列比较行？ file1 和 file2 的列是否保证相同？
感谢您的回复！行按 ID 列进行比较，列将 100% 相同。
我已经发布了一个一般性的答案。您可以上传文件以便我更具体吗？
试试这个：pypi.org/project/csvdiff

标签： python pandas csv

【解决方案1】：

CSV 通常不支持不同的字体，但这里有一个使用粗体和颜色输出到控制台的解决方案（注意：我只在 Mac 上测试过）。如果您使用的是 Python 3.7+（按插入顺序排序的字典），则不需要字典排序和列列表。

from collections import OrderedDict
from csv import DictReader

class Color(object):
    GREEN = '\033[92m'
    RED = '\033[91m'
    BOLD = '\033[1m'
    END = '\033[0m'

def load_csv(file):
    # Index by ID in order, and keep track of the original column order
    with open(file, 'r') as fp:
        reader = DictReader(fp, delimiter=',')
        rows = OrderedDict((r['ID'], r) for r in reader)
        return rows, reader.fieldnames

def print_row(row, cols, color, prefix):
    print(Color.BOLD + color + prefix + ','.join(row[c] for c in cols) + Color.END)

def print_diff(row1, row2, cols):
    row = []
    for col in cols:
        value1 = row1[col]

        if row2[col] != value1:
            row.append(Color.BOLD + Color.GREEN + value1 + Color.END)
        else:
            row.append(value1)

    print(','.join(row))

def diff_csv(file1, file2):

    rows1, cols = load_csv(file1)
    rows2, _ = load_csv(file2)

    for row_id, row1 in rows1.items():

        # Pop the matching ID row
        row2 = rows2.pop(row_id, None)

        # If not in file2, then it was added
        if not row2:
            print_row(row1, cols, Color.GREEN, '+')

        # In both files, print the diff
        else:
            print_diff(row1, row2, cols)

    # Anything remaining from file2 was removed in file1
    for row in rows2.values():
        print_row(row, cols, Color.RED, '-')

【讨论】：

【解决方案2】：

这可以简单地通过使用 python 内置的 CSV 库来完成。如果您还关心条目的顺序，则可以使用 OrderedDict 来维护原始文件顺序。

import csv
f = []
f3 = file('results.csv', 'w')
with open('file1.csv', 'rb') as f1, open('file2.csv', 'rb') as f2:
    reader1 = csv.reader(f1, delimiter=",")          
    reader2 = csv.reader(f2, delimiter=",")
    for line in enumerate(reader1):
            f.append(line)                        #For the first file, add them all
    for line in enumerate(reader2):
        if not any(e[0] == line[0] for e in f):       #For the second file, only add them if there is not an entry with the same name already
            f.append(line) 
        for e in f:
            if e[0] == line[0]:
                changedindexes = i != j for i, j in zip(e[0], line[0])
                for val in changedindexes:
                    e[val] = e[val] + 'c'                 
c3 = csv.writer(f3, , quoting=csv.QUOTE_ALL)
for line in f:                                       #Write the new merged files into another csv
    c3.writerow(line)


#Then find the differences between the two orderedDicts

至于粗体，在 CSV 中没有办法做到这一点，因为 csv 文件包含数据，而不是任何格式信息。

【讨论】：

这是否将文件行与行进行比较？因为我的行可以不同，所以我想吐出它们的差异。感谢您的回复！
是的，t1 和 t2 orderedDicts 中的每个条目都是数组
好的。老实说，我不知道如何找到差异。我知道我应该做一些工作，所以如果你能指出我正确的区域，我会很乐意看看。在发表评论之前，我用谷歌搜索但找不到如何比较数组并返回差异。我发现的是逐行比较，这是行不通的。
好的，所以我看了你想要的，我重写了我的代码来做你想做的，我写了 cmets 这样你就可以更好地理解它@ChadBelerique
感谢@abhishek，我得到了它的工作，但我不知道哪些行发生了变化。有没有指标什么的？

【解决方案3】：

第二种方式：

# get indices of differences
difference_locations = np.where(df1 != df2)
#define reference
changed_from = df1.values[difference_locations]
changed_to = df2.values[difference_locations]

df_differences = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)

【讨论】：