解析大文本文件时如何提高性能 - StreamReader + Regex答案

【问题标题】：How to improve performance when parsing large text file - StreamReader + Regex解析大文本文件时如何提高性能 - StreamReader + Regex
【发布时间】：2019-02-19 08:37:39
【问题描述】：

我正在开发一个 Windows 窗体应用程序，该应用程序采用其他软件生成的机器人程序并对其进行修改。修改过程如下：

StreamReader.ReadLine() 用于逐行解析文件
正则表达式用于搜索文件中的特定关键字。如果获得匹配，则将匹配的字符串复制到另一个字符串并替换为新的机器人代码行。
修改后的代码以字符串形式保存，最后写入新文件。
所有使用Regex获得的匹配字符串集合也保存在一个字符串中，最后写入一个新文件。

我已经能够成功地做到这一点

    private void Form1_Load(object sender, EventArgs e)
    {
        string NextLine = null;
        string CurrLine = null;
        string MoveL_Pos_Data = null;
        string MoveL_Ref_Data = null;
        string MoveLFull = null;
        string ModCode = null;
        string TAB = "\t";
        string NewLine = "\r\n";
        string SavePath = null;
        string ExtCode_1 = null;
        string ExtCode_2 = null;
        string ExtCallMod = null;

        int MatchCount = 0;
        int NumRoutines = 0;

        try
        {
            // Ask user location of the source file
            // Displays an OpenFileDialog so the user can select a Cursor.  
            OpenFileDialog openFileDialog1 = new OpenFileDialog
            {
                Filter = "MOD Files|*.mod",
                Title = "Select an ABB RAPID MOD File"
            };

            // Show the Dialog.  
            // If the user clicked OK in the dialog and  
            // a .MOD file was selected, open it.  
            if (openFileDialog1.ShowDialog() == System.Windows.Forms.DialogResult.OK)
            {
                // Assign the cursor in the Stream to the Form's Cursor property.  
                //this.Cursor = new Cursor(openFileDialog1.OpenFile());
                using (StreamReader sr = new StreamReader(openFileDialog1.FileName))
                {
                    // define a regular expression to search for extr calls 
                    Regex Extr_Ex = new Regex(@"\bExtr\(-?\d*.\d*\);", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Multiline);
                    Regex MoveL_Ex = new Regex(@"\bMoveL\s+(.*)(z\d.*)", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Multiline);

                    Match MoveLString = null;

                    while (sr.Peek() >= 0)
                    {
                        CurrLine = sr.ReadLine();
                        //Console.WriteLine(sr.ReadLine());

                        // check if the line is a match 
                        if (Extr_Ex.IsMatch(CurrLine))
                        {
                            // Keep a count for total matches
                            MatchCount++;

                            // Save extr calls in a string
                            ExtCode_1 += NewLine + TAB + TAB + Extr_Ex.Match(CurrLine).ToString();


                            // Read next line (always a MoveL) to get Pos data for TriggL
                            NextLine = sr.ReadLine();
                            //Console.WriteLine(NextLine);

                            if (MoveL_Ex.IsMatch(NextLine))
                            {
                                // Next Line contains MoveL
                                // get matched string 
                                MoveLString = MoveL_Ex.Match(NextLine);
                                GroupCollection group = MoveLString.Groups;
                                MoveL_Pos_Data = group[1].Value.ToString();
                                MoveL_Ref_Data = group[2].Value.ToString();
                                MoveLFull = MoveL_Pos_Data + MoveL_Ref_Data;                                

                            }

                            // replace Extr with follwing commands
                            ModCode += NewLine + TAB + TAB + "TriggL " + MoveL_Pos_Data + "extr," + MoveL_Ref_Data;
                            ModCode += NewLine + TAB + TAB + "WaitDI DI1_1,1;";
                            ModCode += NewLine + TAB + TAB + "MoveL " + MoveLFull;
                            ModCode += NewLine + TAB + TAB + "Reset DO1_1;";
                            //break;

                        }
                        else
                        {
                            // No extr Match
                            ModCode += "\r\n" + CurrLine;
                        }                     

                    }

                    Console.WriteLine($"Total Matches: {MatchCount}");
                }


            }

            // Write modified code into a new output file
            string SaveDirectoryPath = Path.GetDirectoryName(openFileDialog1.FileName);
            string ModName = Path.GetFileNameWithoutExtension(openFileDialog1.FileName);
            SavePath = SaveDirectoryPath + @"\" + ModName + "_rev.mod";
            File.WriteAllText(SavePath, ModCode);

            //Write Extr matches into new output file 
            //Prepare module
            ExtCallMod = "MODULE ExtruderCalls";

            // All extr calls in one routine
            //Prepare routines
            ExtCallMod += NewLine + NewLine + TAB + "PROC Prg_ExtCall"; // + 1;
                ExtCallMod += ExtCode_1;
                ExtCallMod += NewLine + NewLine + TAB + "ENDPROC";
                ExtCallMod += NewLine + NewLine;

            //}

            ExtCallMod += "ENDMODULE";

            // Write to file
            string ExtCallSavePath = SaveDirectoryPath + @"\ExtrCalls.mod";                
            File.WriteAllText(ExtCallSavePath, ExtCallMod);                

        }

        catch (Exception ex)
        {
            Console.WriteLine(ex.ToString());                
        }

    }                    
}

虽然这可以帮助我实现我想要的，但这个过程非常缓慢。由于我是 C# 编程的新手，我怀疑速度慢来自将原始文件内容复制到字符串而不是替换内容（我不确定是否可以直接替换原始文件中的内容）。对于 20,000 行的输入文件，整个过程需要 5 分钟多一点。

我曾经收到以下错误：Message=Managed Debugging Assistant 'ContextSwitchDeadlock' : 'CLR 在 60 秒内无法从 COM 上下文 0xb27138 转换到 COM 上下文 0xb27080。 拥有目标上下文/单元的线程很可能正在执行非泵送等待或处理非常长时间运行的操作而不泵送 Windows 消息。这种情况通常会对性能产生负面影响，甚至可能导致应用程序变得无响应或内存使用量随着时间的推移不断累积。为避免此问题，所有单线程单元 (STA) 线程都应使用泵送等待原语（例如 CoWaitForMultipleHandles）并在长时间运行的操作期间定期泵送消息。'

我可以通过在调试器设置中禁用“ContextSwitchDeadlock”设置来克服它。这可能不是最佳做法。

谁能帮助我提高代码的性能？

编辑：我发现机器人控制器对 MOD 文件（输出文件）中的行数有限制。允许的最大行数是 32768。我想出了一个逻辑，将字符串生成器的内容拆分为单独的输出文件，如下所示：

// Split modCodeBuilder into seperate strings based on final size
        const int maxSize = 32500;
        string result = modCodeBuilder.ToString();
        string[] splitResult = result.Split(new string[] { "\r\n" }, StringSplitOptions.None);
        string[] splitModCode = new string[maxSize]; 

        // Setup destination directory to be same as source directory
        string destDir = Path.GetDirectoryName(fileNames[0]);

        for (int count = 0; ; count++)
        {
            // Get the next batch of text by skipping the amount
            // we've taken so far and then taking the maxSize.
            string modName = $"PrgMOD_{count + 1}";
            string procName = $"Prg_{count + 1}()";

            // Use Array Copy to extract first 32500 lines from modCode[]
            int src_start_index = count * maxSize;
            int srcUpperLimit = splitResult.GetUpperBound(0);
            int dataLength = maxSize;

            if (src_start_index > srcUpperLimit) break; // Exit loop when there's no text left to take

            if (src_start_index > 1)
            {
                // Make sure calculate right length so that src index is not exceeded
                dataLength = srcUpperLimit - maxSize;
            }                

            Array.Copy(splitResult, src_start_index, splitModCode, 0, dataLength);
            string finalModCode = String.Join("\r\n", splitModCode);

            string batch = String.Concat("MODULE ", modName, "\r\n\r\n\tPROC ", procName, "\r\n", finalModCode, "\r\n\r\n\tENDPROC\r\n\r\nENDMODULE");

            //if (batch.Length == 0) break; 

            // Generate file name based on count
            string fileName = $"ABB_R3DP_{count + 1}.mod";

            // Write our file text
            File.WriteAllText(Path.Combine(destDir, fileName), batch);

            // Write status to output textbox
            TxtOutput.AppendText("\r\n");
            TxtOutput.AppendText("\r\n");
            TxtOutput.AppendText($"Modified MOD File: {fileName} is generated sucessfully! It is saved to location: {Path.Combine(destDir, fileName)}");
        }

【问题讨论】：

@Gauravsa 你能解释一下为什么这些线路是瓶颈以及如何改进它们吗？您的回答并没有按原样回答我的问题。
字符串是不可变的。每次对字符串进行更改时，您实际上是在创建一个新字符串、分配内存、将数据从现有字符串复制到新字符串..
这里有一个很好的阅读链接：jonskeet.uk/csharp/stringbuilder.html
就个人而言，我会使用两个线程写入和一个线程读取，这样文件可以在读取时同时写入，其次您可以通过打印找到哪个进程是瓶颈number of ticks 采取了一系列步骤......也专注于正则表达式匹配
在计划使用结果时避免使用IsMatch。直接使用Matches，这样就不用加倍正则表达式的执行了。使用Compiled 分析您的特定正则表达式 - 没有它，极其简单的表达式实际上可以运行得更快。使用StringBuilder

标签： c# regex streamreader

【解决方案1】：

字符串连接可能需要很长时间。改用StringBuilder 可能会提高您的性能：

private static void GenerateNewFile(string sourceFullPath)
{
    string posData = null;
    string refData = null;
    string fullData = null;

    var modCodeBuilder = new StringBuilder();
    var extCodeBuilder = new StringBuilder();

    var extrRegex = new Regex(@"\bExtr\(-?\d*.\d*\);", RegexOptions.Compiled | 
        RegexOptions.IgnoreCase | RegexOptions.Multiline);

    var moveLRegex = new Regex(@"\bMoveL\s+(.*)(z\d.*)", RegexOptions.Compiled | 
        RegexOptions.IgnoreCase | RegexOptions.Multiline);

    int matchCount = 0;
    bool appendModCodeNext = false;

    foreach (var line in File.ReadLines(sourceFullPath))
    {
        if (appendModCodeNext)
        {
            if (moveLRegex.IsMatch(line))
            {
                GroupCollection group = moveLRegex.Match(line).Groups;

                if (group.Count > 2)
                {
                    posData = group[1].Value;
                    refData = group[2].Value;
                    fullData = posData + refData;
                }
            }

            modCodeBuilder.Append("\t\tTriggL ").Append(posData).Append("extr,")
                .Append(refData).Append("\r\n\t\tWaitDI DI1_1,1;\r\n\t\tMoveL ")
                .Append(fullData).AppendLine("\r\n\t\tReset DO1_1;");

            appendModCodeNext = false;
        }
        else if (extrRegex.IsMatch(line))
        {
            matchCount++;
            extCodeBuilder.Append("\t\t").AppendLine(extrRegex.Match(line).ToString());
            appendModCodeNext = true;
        }
        else
        {
            modCodeBuilder.AppendLine(line);
        }
    }

    Console.WriteLine($"Total Matches: {matchCount}");

    string destDir = Path.GetDirectoryName(sourceFullPath);
    var savePath = Path.Combine(destDir, Path.GetFileNameWithoutExtension(sourceFullPath), 
        "_rev.mod");

    File.WriteAllText(savePath, modCodeBuilder.ToString());

    var extCallMod = string.Concat("MODULE ExtruderCalls\r\n\r\n\tPROC Prg_ExtCall",
        extCodeBuilder.ToString(), "\r\n\r\n\tENDPROC\r\n\r\nENDMODULE");

    File.WriteAllText(Path.Combine(destDir, "ExtrCalls.mod"), extCallMod);
}

您在 cmets 中提到您希望批量获取文本并将它们写入单独的文件。一种方法是将字符串视为char[]，然后使用System.Linq 扩展方法Skip 和Take。 Skip 将跳过字符串中的一定数量的字符，然后Take 将获取一定数量的字符并在IEnumerabe<char> 中返回它们。然后我们可以使用string.Concat 将其转换为字符串并将其写入文件。

如果我们有一个表示我们的最大大小的常量和一个从0 开始的计数器，我们可以使用一个for 循环来递增计数器并跳过counter * max 字符，然后获取max 字符从字符串。我们还可以使用counter 变量来创建文件名，因为它会在每次迭代时递增：

const int maxSize = 32500;
string result = modCodeBuilder.ToString();

for (int count = 0;; count++)
{
    // Get the next batch of text by skipping the amount
    // we've taken so far and then taking the maxSize.
    string batch = string.Concat(result.Skip(count * maxSize).Take(maxSize));

    if (batch.Length == 0) break; // Exit loop when there's no text left to take

    // Generate file name based on count
    string fileName = $"filename_{count + 1}.mod";

    // Write our file text
    File.WriteAllText(Path.Combine(destDir, fileName), batch);
}

另一种可能更快的方法是使用string.Substring，并使用count * maxSize 作为要获取的子字符串的起始索引。然后我们只需要确保我们的length 没有超出字符串的范围，并将子字符串写入文件：

for (int count = 0;; count++)
{
    // Get the bounds for the substring (startIndex and length)
    var startIndex = count * maxSize;
    var length = Math.Min(result.Length - startIndex, maxSize);

    if (length < 1) break; // Exit loop when there's no text left to take

    // Get the substring and file name
    var batch = result.Substring(startIndex, length);
    string fileName = $"filename_{count + 1}.mod";

    // Write our file text  
    File.WriteAllText(Path.Combine(destDir, fileName), batch);
}

请注意，这会将文本拆分为正好为 32500 字符的块（最后一个块除外）。如果你只想取整行，那需要更多的工作，但仍然不难。

【讨论】：

+1 !!!哇！性能提升是巨大的。整个文件生成时间不到 300 毫秒（低于 5 分钟！）。 modCodeBuilder.AppendLine(string.Concat("\t\tTriggL ", posData, "extr,", refData, "\r\n\t\tWaitDI DI1_1,1;\r\n\t\tMoveL ", fullData, "\r\n\t\tReset DO1_1;")); 有很大的不同。在较早的解决方案中，我尝试使用类似这样的 StringBuilder：StringBuilder modCodeBuilder = new StringBuilder() 并使用 modCodeBuilder.Append() 方法代替 +=。这使得性能更差。赞成
StringBuilder.Append 和 StringBuilder.AppendLine 之间是否存在性能差异？ += 和 String.Concat() 之间也有区别吗？
AppendLine 只对Append 进行了两次调用——一个用于字符串，另一个用于附加换行符。否则它们是一样的。可以看源码here。
字符串的+= 和+ 被编译成string.Concat 调用，所以应该没有区别。查看答案here。对于连接超过 7 个字符串（我认为？类似的东西），StringBuilder 会更有效。每次将一个字符串添加到另一个字符串时，都会检查它们的长度，分配内存并创建一个新字符串。 StringBuilder 而是提前分配内存块缓存，然后使用它来存储字符串。
如果您有其他代码使用 StringBuilder 并且性能更差，那么它一定是没有正确编写。