读取大制表符分隔的txt文件的有效方法？答案

【问题标题】：Efficient way to read large tab delimited txt file?读取大制表符分隔的txt文件的有效方法？
【发布时间】：2011-05-18 21:32:13
【问题描述】：

我有一个包含 500K 记录的制表符分隔的 txt 文件。我正在使用下面的代码将数据读取到数据集。使用 50K 它可以正常工作，但 500K 它给出“抛出了 'System.OutOfMemoryException' 类型的异常。”

读取大型制表符分隔数据的更有效方法是什么？或者如何解决这个问题？请举个例子

public DataSet DataToDataSet(string fullpath, string file)
{
    string sql = "SELECT * FROM " + file; // Read all the data
    OleDbConnection connection = new OleDbConnection // Connection
                  ("Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + fullpath + ";"
                   + "Extended Properties=\"text;HDR=YES;FMT=Delimited\"");
    OleDbDataAdapter ole = new OleDbDataAdapter(sql, connection); // Load the data into the adapter
    DataSet dataset = new DataSet(); // To hold the data
    ole.Fill(dataset); // Fill the dataset with the data from the adapter
    connection.Close(); // Close the connection
    connection.Dispose(); // Dispose of the connection
    ole.Dispose(); // Get rid of the adapter
    return dataset;
}

【问题讨论】：

标签： c# file-io csv

【解决方案1】：

对TextFieldParser 使用流方法 - 这样您就不会一次性将整个文件加载到内存中。

【讨论】：

来自您的链接“因此，TextFieldParser 被呈现为一个繁琐而缓慢的解决方案，最好将其留在其难以找到的命名空间中。”
@hemp - 我链接以显示 C# 中的用法。与string.Split 相比，它确实是“一个麻烦且缓慢的解决方案”。但这并不能进行公平的比较。这篇文章没有提供任何其他解析器作为比较。
@hemp：可能是“麻烦且缓慢”。但它有效，并且避免了您在使用手写分隔文本文件解析器时遇到的许多问题。任何一天，我都会接受“繁琐且缓慢，但可以工作”而不是“快速且错误”。

【解决方案2】：

您真的想枚举源文件并一次处理每一行。我使用以下

    public static IEnumerable<string> EnumerateLines(this FileInfo file)
    {
        using (var stream = File.Open(file.FullName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
        using (var reader = new StreamReader(stream))
        {
            string line;
            while ((line = reader.ReadLine()) != null)
            {
                yield return line;
            }
        }
    }

然后对于每一行，您可以使用制表符将其拆分并一次处理每一行。这使得解析所需的内存非常低，只有在应用程序需要时才使用内存。

【讨论】：

写foreach (var line in File.ReadLines("filename"))要容易得多。这与您的 EnumerateLines 方法的作用相同。
但是 File.ReadLines 返回一个字符串数组。上面的版本一次返回一行，不在内存中缓冲文件。
不，File.ReadLines 返回一个枚举器，一次读取一行文件。 File.ReadAllLines 返回一个字符串数组。

【解决方案3】：

你试过TextReader吗？

  using (TextReader tr = File.OpenText(YourFile))
  {
      string strLine = string.Empty;
      string[] arrColumns = null;
      while ((strLine = tr.ReadLine()) != null)
      {
           arrColumns = strLine .Split('\t');
           // Start Fill Your DataSet or Whatever you wanna do with your data
      }
      tr.Close();
  }

【讨论】：

这样会怎样减少内存消耗？
此外，此解决方案不处理带引号的值以允许制表符出现在这些值中。

【解决方案4】：

我找到FileHelpers

FileHelpers 是一个免费且易于使用的 .NET 库，用于从文件、字符串或流中的固定长度或分隔记录导入/导出数据。

也许能帮上忙。

【讨论】：