优化 .txt 文件中的字符串搜索答案

【问题标题】：Optimize string search in .txt file优化 .txt 文件中的字符串搜索
【发布时间】：2016-01-23 10:38:25
【问题描述】：

这可能是一个非常愚蠢的问题，但是我如何优化这段代码以使其更高效（更快，更少的内存消耗）？我制作了这段代码来帮助我对一些文本文件进行排序。它从第一个文件中读取每个字符串，然后搜索第二个文件，直到找到所有相关的字符串，然后在第三个文件中写入一些匹配的字符串。代码如下：

ifstream h("SecondFile.txt");
ifstream h2("FirstFile.txt");
ifstream uh("MatchedStrings.txt");
ofstream g("sorted.txt");    
int main()
    {
        string x, y, z;
        cout << "Sorting..." << endl;;
        while (!h.eof()){
            h >> x;
            while (!h2.eof() || (y == x)){
                h2 >> y;
                uh >> z;
                if (y == x){
                    g << z << endl;
                    break;
                    h2.clear();
                    h2.seekg(0);
                    uh.clear();
                    uh.seekg(0);
                }
            }
            if (h2.eof() && (y != x)){
                g << "none" << endl;
                h2.clear();
                h2.seekg(0);
                uh.clear();
                uh.seekg(0);
            }
        }
        cout << "Finished!";
    }

我已将代码更改为：

#include <iostream>
#include <fstream>
#include <string>

using namespace std;
ifstream h("SecondFile.txt");
ifstream h2("FirstFile.txt");
ifstream uh("MatchedStrings.txt");
ofstream g("sorted.txt");

int main()
{
    string x;
    bool write_none = true;
    int i = 0,l=0;
    string check[] = {""};
    string unhashed_checked[] = { "" };
    string sorted_array[] = { "" };
    cout << "Sorting..." << endl;
    //Get to memory
    while (!h2.eof())
    {
        h2 >> check[i];
        uh >> unhashed_checked[i];
        i++;
    }

    while (!h.eof()){
        h >> x;
        write_none = true;
        for (int t = 0; t <= i;t++)
        {
            if (x == check[t])
            {
                break;
                write_none = false;
                sorted_array[l] = unhashed_checked[i];
                l++;
            }
        }
        if (write_none)
        {
            sorted_array[l] = "none";
            l++;
        }
    }
    for (int k = 0; k <= l; k++)
    {
        g << sorted_array[k]<<endl;
    }
    cout << "Finished!";
}

但是我在运行程序时遇到了这个异常：

Unhandled exception at 0x01068FF6 in ConsoleApplication1.exe: 0xC0000005: Access violation writing location 0xCCCCCCCC

【问题讨论】：

从内存中的第一个文件中收集所有搜索字符串并在外部循环中使用这些搜索第二个文件可能会更快。
我会尝试并返回结果。
@πάνταῥεῖ 我无法让它工作：/ 我尝试将它读入内存，但得到一个未处理的异常 0xccccccc
照@Ilya 说的，使用std::vector<std::string>。
为什么您的所有流都是全球性的？无论如何，您只有一个功能！此外，没有这些文件，就不可能重现该问题。此外，文件中的输入是否甚至是重现问题所必需的？尝试先提取一个最小但完整的示例！

标签： c++ sorting search optimization text

【解决方案1】：

将h 加载到字符串向量中，并通过将每个字符串与向量的内容进行比较，在h2 中循环一次。

由于您的测试是对称的，您可以选择h 作为两个文件中最小的一个。这样，您将节省内存和时间，尤其是在其中一个文件比另一个大得多的情况下。如果比较需要花费大量时间，使用集合 (std::set) 代替向量也会有所帮助。

【讨论】：

【解决方案2】：

假设文件中的字符串数分别为 n 和 m。

按照你现在的做法，复杂度是Θ(n m)。此外，复杂性常数是文件操作的常数，它们非常慢。

相反，您应该将其中一个文件读入std::unordered_* 容器，然后比较容器之间的密钥。这应该会将运行时间减少到预期的Θ(n + m)。

作为旁注，您可能需要查看more modern ways to read strings into containers（例如，使用std::istream_iterator）。

【讨论】：

问题中还有一个“消耗内存”的部分。如果其中一个文件非常大，您可能不想将其加载到内存容器中。
@Ilya 谢谢，已更正。我现在还看到您已经在您的回答 (+1) 中解决了这个问题。不过，我建议在这里使用unordered_* 容器。
啊，我不知道std::unordered_set，学到了一些东西，谢谢。
感谢您的支持。我使用您给我的示例使其工作。