【问题标题】:Looping through File1.txt and File2.txt is really slow. Both files are 280MB循环通过 File1.txt 和 File2.txt 真的很慢。两个文件都是280MB
【发布时间】:2012-03-14 20:18:17
【问题描述】:

我有 2 个大文本文件,每个文件有 400,000 行文本。在 File2.txt 中,我需要从 File1.txt 的当前行中找到包含 userId 的行。在 File2.txt 中找到正确的行后,我会进行一些计算并将该行写入一个新的文本文件。

我为此创建的代码运行速度非常慢。我尝试过以各种方式重写它,但它总是不断前进并且永远不会完成。我怎样才能快速做到这一点?

private void btnExecute_Click(object sender, EventArgs e) {        
    string line1 = "";
    string line2 = "";

    //the new text file we are creating. Located in IVR_Text_Update\bin\Debug
    StreamWriter sw = new StreamWriter("NewFile.txt");

    //the new text file which contains the registrants which need removing
    StreamWriter sw_removeRegs = new StreamWriter("RemoveRegistrants.txt");

    //address has changed so we write the line to the address file
    StreamWriter sw_addressChange = new StreamWriter("AddressChanged.txt");

    List<string> lines_secondFile = new List<string>();

    using (StreamReader sr = new StreamReader(openFileDialog2.FileName)) {
        string line;
        while ((line = sr.ReadLine()) != null) {
            lines_secondFile.Add(line);
        }
    }

    //loop through the frozen file one line at a time
    while ((line1 = sr1.ReadLine()) != null) {
        //get the line from the update file, assign it to line2
        //function accepts (userId, List)
        line2 = getLine(line1.Substring(3, 8), lines_secondFile);

        //if line2 is null then userId was not found therefore we write
        //the line to Remove Registrants file
        if (line2 == null) {
            sw_removeRegs.Write(line1 + Environment.NewLine);
        }

        //address between the two lines was found to be different so we still write
        //them to the new text file but don't update codes
        else if (line1.Substring(93, 53) != line2.Substring(93, 53)) {
            sw_addressChange.Write(line1 + Environment.NewLine);
            sw.Write(line1 + Environment.NewLine);
        }

        //test for null then write the new line in our new text file
        else if ((line1 != null) && (line2 != null)) {
            sw.Write(line1.Substring(0, 608) +                    
                     line2.Substring(608, 9) +
                     line2.Substring(617, 9) +
                     line2.Substring(626, 9) +
                     line2.Substring(635, 9) +
                     line2.Substring(644, 9) +
                     line2.Substring(653, 9) +
                     line2.Substring(662, 9) +
                     line2.Substring(671, 9) +
                     line2.Substring(680, 9) +

                     line1.Substring(680, 19) + 
                     Environment.NewLine);
        }
    }

    textBox1.Text = "Finished.";
    sr1.Close();
    sw.Close();
    sw_removeRegs.Close();
    sw_addressChange.Close();
}

//returns the line from the update file which has the corresponding userId
//from the frozen file
string getLine(string userId, List<string> lines_secondFile) {

    foreach (string currentLine in lines_secondFile) {
        if (currentLine.Contains(userId)) {
            return currentLine;
        }
    }

    return null;
}

【问题讨论】:

  • 磁盘读取需要很长时间。您总是可以定期向控制台写入内容,让您知道您的应用程序正在执行某些操作。
  • 您可能想要添加一些自记录变量名称,您当前的代码非常神秘 ;-)

标签: c# performance optimization


【解决方案1】:

不考虑磁盘访问速度,您当前的算法是 O(n^2) - 对于第一个文件中的每一行,您都在 list 中查找用户 ID - 您可以使用一些缓存以避免多次查找 same 用户 ID,我假设您的用户少于 40 万,所以重复应该是大多数情况:

private Dictionary<string, string> userMap = new Dictionary<string, string>();
string getLine(string userId, List<string> lines_secondFile) 
{
    if(userMap.ContainsKey(userId))
        return userMap[userId];
    else
    {
      foreach (string currentLine in lines_secondFile) 
      {
        if (currentLine.Contains(userId)) 
        {
            userMap.Add(userId, currentLine);
            return currentLine;
        }
    }
    return null;
}

【讨论】:

    【解决方案2】:

    与其逐行读取,不如尝试一次读取所有文件。这比对文件发出许多读取请求要快得多。这是因为文件访问比内存访问慢得多。试试File.ReadAllText

    话虽如此,您应该尝试分析代码以准确查看代码中的瓶颈所在。

    【讨论】:

    • 如何一次读取一个文本文件?删除大部分代码后,我将其范围缩小到循环中的循环成为瓶颈。
    • 使用 ReadAllText 会导致 OutofMemoryException 错误,因为文件太大,280MB。
    【解决方案3】:

    如果您有资源,您可以将整个文件放入内存中。然后应该提高速度。在 C# 4 之前,您必须使用 WIN32 API 来内存映射文件,但 C# 4 添加了System.IO.MemoryMappedFiles.MemoryMappedFile

    也可以实现多线程方法来并行处理文件的各个部分,但这会增加额外的复杂性。

    【讨论】:

      猜你喜欢
      • 2022-07-27
      • 1970-01-01
      • 1970-01-01
      • 2022-06-13
      • 2016-02-20
      • 2021-06-24
      • 2020-06-01
      • 2011-11-01
      • 1970-01-01
      相关资源
      最近更新 更多