C# - 删除文本文件中的重复行答案

【问题标题】：C# - Remove duplicate lines within a text fileC# - 删除文本文件中的重复行
【发布时间】：2011-09-17 06:39:10
【问题描述】：

有人可以演示如何检查文件是否有重复行，然后删除任何重复项，要么覆盖现有文件，要么创建一个删除重复行的新文件

【问题讨论】：

@Felice Pollano 没有朋友，除非我是 28 岁的学生：D
好的，但无论如何你要求完成的工作......

【解决方案1】：

如果您使用的是 .NET4，那么您可以使用 File.ReadLines 和 File.WriteAllLines 的组合：

var previousLines = new HashSet<string>();

File.WriteAllLines(destinationPath, File.ReadLines(sourcePath)
                                        .Where(line => previousLines.Add(line)));

这与 LINQ 的 Distinct 方法的功能几乎相同，但有一个重要区别：Distinct 的输出不保证与输入序列的顺序相同。明确使用 HashSet<T> 确实提供了这种保证。

【讨论】：

HashSet 不保留插入顺序。我的意思是在某些情况下看起来确实如此，但不能保证。你可以在这里阅读docs.microsoft.com/it-it/dotnet/api/…

【解决方案2】：

File.WriteAllLines(topath, File.ReadAllLines(frompath).Distinct().ToArray());

编辑：修改为在 .net 3.5 中工作

【讨论】：

【解决方案3】：

伪代码：

open file reading only

List<string> list = new List<string>();

for each line in the file:
    if(!list.contains(line)):
        list.append(line)

close file
open file for writing

for each string in list:
    file.write(string);

【讨论】：

【解决方案4】：

// Requires .NET 3.5
private void RemoveDuplicate(string sourceFilePath, string destinationFilePath)
{
    var readLines = File.ReadAllLines(sourceFilePath, Encoding.Default);

    File.WriteAllLines(destinationFilePath, readLines.Distinct().ToArray(), Encoding.Default);
}

【讨论】：

【解决方案5】：

我们说的文件有多大？

一种策略是一次读取一行并将它们加载到一个数据结构中，您可以轻松地检查现有项目，例如Hashset<int>。我知道我可以使用 GetHashCode() 可靠地散列文件的每个字符串行（在内部用于检查字符串相等性——这是我们想要确定重复的内容）并且只检查已知的散列。所以，像

var known = new Hashset<int>();
using (var dupe_free = new StreamWriter(@"c:\path\to\dupe_free.txt"))
{
    foreach(var line in File.ReadLines(@"c:\path\to\has_dupes.txt")
    {
        var hash = line.GetHashCode();
        if (!known.Contains(hash)) 
        {
            known.Add(hash);
            dupe_free.Write(line);
        }
    }
}

或者，您可以利用 Linq 的 Distinct() 方法并按照 Blindy 的建议在一行中完成：

File.WriteAllLines(@"c:\path\to\dupe_free.txt", File.ReadAllLines((@"c:\path\to\has_dupes.txt").Distinct().ToArray());

【讨论】：

@LukeH 对，这就是为什么我的主要答案是在手写循环中读写它们； hashset 是一种廉价的查找方式，使用 gethashcode 可以保证正确的顺序和唯一性。