使用 64 位进程读取文本文件非常慢答案

【问题标题】：Reading text files with 64bit process very slow使用 64 位进程读取文本文件非常慢
【发布时间】：2015-09-30 08:33:00
【问题描述】：

我正在将文本文件 (.itf) 与位于文件夹中的一些逻辑合并。当我将它编译为 32 位（控制台应用程序，.Net 4.6）时，一切正常，除了如果文件夹中有大量数据，我会得到 outofmemory 异常。将其编译为 64 位可以解决该问题，但与 32 位进程相比，它的运行速度非常慢（慢了 15 倍以上）。

我用BufferedStream 和ReadAllLines 进行了尝试，但两者的性能都很差。分析器告诉我，这些方法在 99% 的时间里都在使用。不知道是不是问题...

代码如下：

private static void readData(Dictionary<string, Topic> topics)
{
    foreach (string file in Directory.EnumerateFiles(Path, "*.itf"))
    {
        Topic currentTopic = null;
        Table currentTable = null;
        Object currentObject = null;
        using (var fs = File.Open(file, FileMode.Open))
        {
            using (var bs = new BufferedStream(fs))
            {
                using (var sr = new StreamReader(bs, Encoding.Default))
                {
                    string line;
                    while ((line = sr.ReadLine()) != null)
                    {
                        if (line.IndexOf("ETOP") > -1)
                        {
                            currentTopic = null;
                        }
                        else if (line.IndexOf("ETAB") > -1)
                        {
                            currentTable = null;
                        }
                        else if (line.IndexOf("ELIN") > -1)
                        {
                            currentObject = null;
                        }
                        else if (line.IndexOf("MTID") > -1)
                        {
                            MTID = line.Replace("MTID ", "");
                        }
                        else if (line.IndexOf("MODL") > -1)
                        {
                            MODL = line.Replace("MODL ", "");
                        }
                        else if (line.IndexOf("TOPI") > -1)
                        {
                            var name = line.Replace("TOPI ", "");
                            if (topics.ContainsKey(name))
                            {
                                currentTopic = topics[name];
                            }
                            else
                            {
                                var topic = new Topic(name);
                                currentTopic = topic;
                                topics.Add(name, topic);
                            }
                        }
                        else if (line.IndexOf("TABL") > -1)
                        {
                            var name = line.Replace("TABL ", "");
                            if (currentTopic.Tables.ContainsKey(name))
                            {
                                currentTable = currentTopic.Tables[name];
                            }
                            else
                            {
                                var table = new Table(name);
                                currentTable = table;
                                currentTopic.Tables.Add(name, table);
                            }
                        }
                        else if (line.IndexOf("OBJE") > -1)
                        {
                            if (currentTable.Name != "Metadata" || currentTable.Objects.Count == 0)
                            {
                                var shortLine = line.Replace("OBJE ", "");
                                var obje = new Object(shortLine.Substring(shortLine.IndexOf(" ")));
                                currentObject = obje;
                                currentTable.Objects.Add(obje);
                            }
                        }
                        else if (currentTopic != null && currentTable != null && currentObject != null)
                        {
                            currentObject.Data.Add(line);
                        }
                    }
                }
            }
        }
    }
}

【问题讨论】：

那么分析器所说的正在减慢速度的ReadAllLines 在哪里？此外，您的瓶颈可能是由于string.IndexOf。提示：投资创建一个合适的词法分析器/解析器。
我想知道字符串分配的数量（所有这些对.Replace 创建新字符串的调用）是否是罪魁祸首——真正的分析器可能会告诉我，但我想知道是否有一种机制将整个文件作为一个流并逐个字符读取而无需重新解析/操作该行将是这里更好的解决方案。
代码示例显示BufferedStream 版本。我也有一个ReadAllLines。在 32 位中，分析器确实说 Replace 和 IndexOf 方法会消耗大量时间。但是，我想知道为什么 64 位版本要慢得多。
我不确定为什么 64 位版本比 32 位版本慢。无论如何...每次您致电line.IndexOf 时，它都会从头开始读取该行。那非常耗时。我建议您实现自己的查找索引的方法。
@M.kazemAkhgary 我不确定你能比微软版本的 IndexOf 做得更好...

标签： c# .net

【解决方案1】：

你的程序最大的问题是，当你让它在 64 位模式下运行时，它可以读取更多的文件。这很好，64 位进程的地址空间是 32 位进程的一千倍，用完它的可能性极小。

但是你不会得到一千倍的内存。

工作中“没有免费的午餐”的普遍原则。在这样的程序中，拥有足够的 RAM 非常重要。首先，它由文件系统缓存使用。使 看起来 像从磁盘读取文件的神奇操作系统功能非常便宜。它根本不是你可以在程序中做的最慢的事情之一，但它非常擅长隐藏它。当您多次运行程序时，您将调用它。第二次和随后的时间你根本不会从磁盘读取。这是一个非常危险的功能，并且在您测试程序时很难避免，您会非常对它的效率做出不切实际的假设。

64位进程的问题在于它很容易使文件系统缓存失效。由于您可以读取更多文件，因此会压倒缓存。并删除旧文件数据。现在你第二次运行你的程序它不会再快了。您读取的文件将不再在缓存中，但必须从磁盘中读取。您现在将看到程序的真实性能，以及它在生产中的行为方式。这是一件好事，即使你不太喜欢它:)

RAM 的第二个问题是较小的问题，如果您分配大量内存来存储文件数据，那么您将强制操作系统找到 RAM 来存储它。这可能会导致很多硬页面错误，当它必须取消映射另一个进程或您的进程使用的内存以释放您需要的 RAM 时会发生这种错误。一个称为“抖动”的通用问题。页面错误是您可以在任务管理器中看到的，使用查看 > 选择列来添加它。

鉴于文件系统缓存最有可能是导致速度变慢的原因，您可以做一个简单的测试是重新启动您的机器，确保缓存中不能包含任何文件数据，然后运行 32 位版本。预测它也会很慢并且 BufferedStream 和 ReadAllLines 是瓶颈。就像他们应该的那样。

最后一点，即使您的程序与模式不匹配，您也不能对 .NET 4.6 性能问题做出强有力的假设。直到this very nasty bug 得到修复。

【讨论】：

【解决方案2】：

一些提示：

你为什么使用File.Open，然后BufferedStream然后StreamReader什么时候你可以只用一个缓冲的StreamReader 来完成这项工作吗？
您应该将您的条件重新排列为最常发生的条件。
考虑阅读所有行然后使用Parallel.ForEach

【讨论】：

感谢您的提示，我实现了它们。虽然并行性在我的情况下不起作用，但由于内容的模型，我必须按顺序解析它们。

【解决方案3】：

我可以解决它。似乎.Net 编译器中存在错误。删除 VS2015 中的代码优化复选框会导致性能大幅提升。现在，它的运行性能与 32 位版本相似。我的最终版本有一些优化：

private static void readData(ref Dictionary<string, Topic> topics)
    {
        Regex rgxOBJE = new Regex("OBJE [0-9]+ ", RegexOptions.IgnoreCase | RegexOptions.Compiled);
        Regex rgxTABL = new Regex("TABL ", RegexOptions.IgnoreCase | RegexOptions.Compiled);
        Regex rgxTOPI = new Regex("TOPI ", RegexOptions.IgnoreCase | RegexOptions.Compiled);
        Regex rgxMTID = new Regex("MTID ", RegexOptions.IgnoreCase | RegexOptions.Compiled);
        Regex rgxMODL = new Regex("MODL ", RegexOptions.IgnoreCase | RegexOptions.Compiled);
        foreach (string file in Directory.EnumerateFiles(Path, "*.itf"))
        {
            if (file.IndexOf("itf_merger_result") == -1)
            {
                Topic currentTopic = null;
                Table currentTable = null;
                Object currentObject = null;
                using (var sr = new StreamReader(file, Encoding.Default))
                {
                    Stopwatch sw = new Stopwatch();
                    sw.Start();
                    Console.WriteLine(file + " read, parsing ...");
                    string line;
                    while ((line = sr.ReadLine()) != null)
                    {
                        if (line.IndexOf("OBJE") > -1)
                        {
                            if (currentTable.Name != "Metadata" || currentTable.Objects.Count == 0)
                            {
                                var obje = new Object(rgxOBJE.Replace(line, ""));
                                currentObject = obje;
                                currentTable.Objects.Add(obje);
                            }
                        }
                        else if (line.IndexOf("TABL") > -1)
                        {
                            var name = rgxTABL.Replace(line, "");
                            if (currentTopic.Tables.ContainsKey(name))
                            {
                                currentTable = currentTopic.Tables[name];
                            }
                            else
                            {
                                var table = new Table(name);
                                currentTable = table;
                                currentTopic.Tables.Add(name, table);
                            }
                        }
                        else if (line.IndexOf("TOPI") > -1)
                        {
                            var name = rgxTOPI.Replace(line, "");
                            if (topics.ContainsKey(name))
                            {
                                currentTopic = topics[name];
                            }
                            else
                            {
                                var topic = new Topic(name);
                                currentTopic = topic;
                                topics.Add(name, topic);
                            }
                        }
                        else if (line.IndexOf("ETOP") > -1)
                        {
                            currentTopic = null;
                        }
                        else if (line.IndexOf("ETAB") > -1)
                        {
                            currentTable = null;
                        }
                        else if (line.IndexOf("ELIN") > -1)
                        {
                            currentObject = null;
                        }
                        else if (currentTopic != null && currentTable != null && currentObject != null)
                        {
                            currentObject.Data.Add(line);
                        }
                        else if (line.IndexOf("MTID") > -1)
                        {
                            MTID = rgxMTID.Replace(line, "");
                        }
                        else if (line.IndexOf("MODL") > -1)
                        {
                            MODL = rgxMODL.Replace(line, "");
                        }
                    }
                    sw.Stop();
                    Console.WriteLine(file + " parsed in {0}s", sw.ElapsedMilliseconds / 1000.0);
                }
            }
        }
    }

【讨论】：

我在 VS2019 中，代码优化已经被禁用。结果相同。

【解决方案4】：

删除代码优化复选框通常会导致性能下降，而不是加速。 VS 2015 产品中可能存在问题。请提供一个独立的重现案例，其中包含一个输入集到您的程序中，以展示性能问题并报告：http://connect.microsoft.com/

【讨论】：