在 Python 中比较 2 个巨大的 csv 文件答案

【问题标题】：Comparing 2 Huge csv Files in Python在 Python 中比较 2 个巨大的 csv 文件
【发布时间】：2021-05-05 04:02:33
【问题描述】：

我有 2 个 csv 文件。

文件 1：

EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vinoth,12,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"

文件2：

EmployeeName,Age,Salary,Address
Karthick,10,10.245,"140,North Street,India"
Vivek,20,2000,"USA"
Vinoth,12,2548.245,"140,North Street,India"

我想比较这两个文件并将差异报告到另一个 csv 文件中。我使用了下面的python代码（2.7版）

#!/usr/bin/env python
import difflib
import csv

with open('./Input/file1', 'r' ) as t1:
    fileone = t1.readlines()
with open('./Input/file2', 'r' ) as t2:
    filetwo = t2.readlines()

with open('update.csv', 'w') as outFile:
    for line in filetwo:
        if line not in fileone:
            outFile.write(line)

    for line in fileone:
        if line not in filetwo:
            outFile.write(line)

当我执行时，下面是我得到的输出：

实际输出

Vivek,20,2000,"USA"

但我的预期输出如下，因为 file1 中“Vinoth”的记录出现了 2 次，但在 file2 中只出现了 1 次。

预期输出

Vinoth,12,2548.245,"140,North Street,India"
Vivek,20,2000,"USA"

问题

请告诉我如何获得预期的输出。
另外，如何将差异记录的文件名和行号获取到输出文件中？

【问题讨论】：

几个问题：1) 文件是否比可用内存大？ 2) 每个文件有多少 GB 数据？
我不明白你的标准。如果您的新文件中没有 Karthick，为什么应该有 Vinoth？你能解释一下吗？
@JavierLópezTomás Karthick 在两个文件中找到一次，而在 file2 中只有一个 Vinoth 行，在 file1 中只有两个。他还想考虑一行出现的次数。
@FredrikHedman 是的，文件很大。大约是 3.5 GB

标签： python

【解决方案1】：

您遇到的问题是 in 关键字仅检查项目是否存在，而不检查项目是否存在两次。如果你愿意使用外部包，你可以用 pandas 很快地做到这一点。

import pandas as pd

df1 = pd.read_csv('Input/file1.csv')
df2 = pd.read_csv('Input/file2.csv')

# create a new column with the count of how many times the row exists
df1['count'] = 0
df2['count'] = 0
df1['count'] = df1.groupby(df1.columns.to_list()[:-1]).cumcount() + 1
df2['count'] = df2.groupby(df2.columns.to_list()[:-1]).cumcount() + 1

# merge the two data frames with and outer join, add an indicator variable
# to show where each row (including the count) exists.
df_all = df1.merge(df2, on=df1.columns.to_list(), how='outer', indicator='exists')
print(df_all)
# prints:
  EmployeeName  Age    Salary                 Address  count      exists
0       Vinoth   12  2548.245  140,North Street,India      1        both
1       Vinoth   12  2548.245  140,North Street,India      2   left_only
2     Karthick   10    10.245  140,North Street,India      1        both
3        Vivek   20  2000.000                     USA      1  right_only

# clean up exists column and export the rows do not exist in both frames
df_all['exists'] = (df_all.exists.str.replace('left_only', 'file1')
                                 .str.replace('right_only', 'file2'))
df_all.query('exists != "both"').to_csv('update.csv', index=False)

编辑：非熊猫版本

您可以使用行作为键，计数作为值来检查相同行数的差异。

from collection import defaultdict

c1 = defaultdict(int)
c2 = defaultdict(int)

with open('./Input/file1', 'r' ) as t1:
    for line in t1:
        c1[line.strip()] += 1

with open('./Input/file2', 'r' ) as t2:
    for line in t2:
        c2[line.strip()] += 1

# create a set of all rows
all_keys = set()
all_keys.update(c1)
all_keys.update(c2)

# find the difference in the number of instances of the row
out = []
for k in all_keys:
    diff = c1[k] - c2[k]
    if diff == 0:
        continue
    if diff > 0:
        out.extend([k + ',file1'] * diff) # add which file it came from
    if diff < 0:
        out.extend([k + ',file2'] * abs(diff)) # add which file it came from

with open('update.csv', 'w') as outFile:
    outFile.write('\n'.join(out))

【讨论】：

我们没有pandas模块，有什么办法不用外接包
当然，请参阅更新后的答案。 collections 模块是标准库的一部分。

【解决方案2】：

使用熊猫比较

import pandas as pd

f1 = pd.read_csv(file_1.csv)
f2 = pd.read_csv(file_2.csv)

changed = f1.compare(f2)
change = f1[f1.index.isin(changed.index)]   
print(change)

【讨论】：