使用 python 脚本从 csv 文件中删除重复的行答案

【问题标题】：Removing duplicate rows from a csv file using a python script使用 python 脚本从 csv 文件中删除重复的行
【发布时间】：2026-01-23 10:00:02
【问题描述】：

目标

我从 hotmail 下载了一个 CSV 文件，但里面有很多重复项。这些副本是完整的副本，我不知道为什么我的手机会创建它们。

我想去掉重复的。

接近

编写一个python脚本来删除重复项。

技术规范

视窗 XP SP 3 蟒蛇 2.7 包含 400 个联系人的 CSV 文件

【问题讨论】：

标签： python file-io

【解决方案1】：

更新：2016 年

如果您乐于使用有用的 more_itertools 外部库：

from more_itertools import unique_everseen
with open('1.csv', 'r') as f, open('2.csv', 'w') as out_file:
    out_file.writelines(unique_everseen(f))

@IcyFlame 解决方案的更高效版本

with open('1.csv', 'r') as in_file, open('2.csv', 'w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue # skip duplicate

        seen.add(line)
        out_file.write(line)

要就地编辑同一个文件，您可以使用这个（旧 Python 2 代码）

import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
    if line in seen: continue # skip duplicate

    seen.add(line)
    print line, # standard output is now redirected to the file

【讨论】：

hi @jamylak 在使用相同文件的选项中（第三个选项），如何删除具有值的行之间的空行？
@BrondbyIF 第三个答案是在 python 2 上使用print line,，它在最后打印时没有换行符。在 python3 中你可以使用print(line, end='')
非常感谢朋友！

【解决方案2】：

您可以使用 Pandas 高效地实现去重，使用安装熊猫

pip install pandas

代码

import pandas as pd
file_name = "my_file_with_dupes.csv"
file_name_output = "my_file_without_dupes.csv"

df = pd.read_csv(file_name, sep="\t or ,")

# Notes:
# - the `subset=None` means that every column is used 
#    to determine if two rows are different; to change that specify
#    the columns as an array
# - the `inplace=True` means that the data structure is changed and
#   the duplicate rows are gone  
df.drop_duplicates(subset=None, inplace=True)

# Write the results to a different file
df.to_csv(file_name_output, index=False)

【讨论】：

我在尝试打开文件时收到UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 28: invalid start byte
@ykombinator 你可以将“encoding”参数传递给“read_csv”函数——见docs.python.org/3/library/codecs.html#standard-encodings
df.to_csv(file_name_output, index=False)

【解决方案3】：

您可以使用以下脚本：

前置条件：

1.csv 是包含重复项的文件
2.csv 是执行此脚本后将没有重复项的输出文件。

代码



inFile = open('1.csv','r')

outFile = open('2.csv','w')

listLines = []

for line in inFile:

    if line in listLines:
        continue

    else:
        outFile.write(line)
        listLines.append(line)

outFile.close()

inFile.close()

算法说明

在这里，我正在做的是：

以读取模式打开文件。这是具有重复项的文件。
然后在一个一直运行到文件结束的循环中，我们检查该行是否已经遇到了。
如果遇到它，我们不会将其写入输出文件。
如果没有，我们会将其写入输出文件并将其添加到已经遇到的记录列表中

【讨论】：

【解决方案4】：

我知道这早就解决了，但我遇到了一个密切相关的问题，即我要根据一列删除重复项。输入的 csv 文件非常大，可以通过 MS Excel/Libre Office Calc/Google Sheets 在我的电脑上打开； 147MB 约 250 万条记录。由于我不想为这么简单的事情安装整个外部库，所以我编写了下面的 python 脚本在不到 5 分钟的时间内完成了这项工作。我没有专注于优化，但我相信它可以被优化以更快、更高效地运行更大的文件。该算法类似于上面的@IcyFlame，除了我基于列（'CCC'）而不是整行/行来删除重复项。

import csv

with open('results.csv', 'r') as infile, open('unique_ccc.csv', 'a') as outfile:
    # this list will hold unique ccc numbers,
    ccc_numbers = []
    # read input file into a dictionary, there were some null bytes in the infile
    results = csv.DictReader(infile)
    writer = csv.writer(outfile)

    # write column headers to output file
    writer.writerow(
        ['ID', 'CCC', 'MFLCode', 'DateCollected', 'DateTested', 'Result', 'Justification']
    )
    for result in results:
        ccc_number = result.get('CCC')
        # if value already exists in the list, skip writing it whole row to output file
        if ccc_number in ccc_numbers:
            continue
        writer.writerow([
            result.get('ID'),
            ccc_number,
            result.get('MFLCode'),
            result.get('datecollected'),
            result.get('DateTested'),
            result.get('Result'),
            result.get('Justification')
        ])

        # add the value to the list to so as to be skipped subsequently
        ccc_numbers.append(ccc_number)

【讨论】：

【解决方案5】：

@jamylak 解决方案的更高效版本：（少了一条指令）

with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line not in seen: 
            seen.add(line)
            out_file.write(line)

要就地编辑同一个文件，您可以使用它

import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
    if line not in seen:
        seen.add(line)
        print line, # standard output is now redirected to the file

【讨论】：

坦率地说，我不确定您为什么要使用文件输入打开以将内容恢复为冻结集。 fileinput 应该用于编辑您打开的文件，而您的示例没有这样做。

【解决方案6】：

您可以在 jupyter notebook 或相关 IDE 中使用 pandas 库，我将 pandas 导入到 jupyter notebook 并读取 csv 文件

然后根据重复的参数对值进行排序，因为我首先定义了两个属性，它将按时间排序，然后按纬度

然后根据您删除时间列或相关列中存在的重复项

然后我将删除和排序的重复文件存储为 gps_sorted

import pandas as pd
stock=pd.read_csv("C:/Users/Donuts/GPS Trajectory/go_track_trackspoints.csv")
stock2=stock.sort_values(["time","latitude"],ascending=True)
stock2.drop_duplicates(subset=['time'])
stock2.to_csv("C:/Users/Donuts/gps_sorted.csv",)

希望对你有帮助

【讨论】：

我不会将 Pandas 用于此类事情，因为根据文件大小，I/O 操作可能会产生性能问题。