通过扫描关键字读取文本文件答案

【问题标题】：Reading text file by scanning for keywords通过扫描关键字读取文本文件
【发布时间】：2014-06-11 06:45:22
【问题描述】：

作为一个更大的应用程序的一部分，我正在开发一个类，用于从文本文件中读取输入以用于程序的初始化。现在我自己对编程还很陌生，而且我在 12 月才开始学习 C++，所以我将非常感谢一些关于如何开始的提示和想法！对于一段相当长的文字墙，我提前道歉。

文本文件格式是“关键字驱动”的，方式如下：

需要按给定顺序编写的主要/部分关键字（目前为 8 个）相当少。其中一些是可选的，但如果包含它们，则应遵守给定的顺序。

示例：

假设有 3 个潜在的关键字排序如下：

"KEY1" (required)
"KEY2" (optional)
"KEY3" (required)

如果输入文件只包含需要的文件，那么排序应该是：

"KEY1"
"KEY3"

否则应该是：

"KEY1"
"KEY2"
"KEY3"

如果所有必需的关键字都存在，并且总排序没问题，程序应该按照排序给出的顺序读取每个部分。
每个部分都将包含（可能很大）数量的子关键字，其中一些是可选的，而另一些则不是，但这里的顺序无关紧要。
以字符'*' 或'--' 开头的行表示注释行，应忽略它们（以及空行）。
包含关键字的行应该（最好）只包含关键字。至少，关键字必须是出现在那里的第一个词。

我已经实现了框架的一部分，但我觉得到目前为止我的方法是相当临时的。目前我已经为每个部分/主要关键字手动创建了一个方法，程序的第一个任务是扫描文件以找到这些关键字并将必要的信息传递给方法。

我首先使用std::ifstream 对象扫描文件，删除空行和/或注释行并将剩余行存储在std::vector<std::string> 类型的对象中。

你认为这是一个好的方法吗？

此外，我将每个关键字开始和停止的索引（在两个整数数组中）存储在这个向量中。这是上述方法的输入，看起来像这样：

bool readMAINKEY(int start, int stop);

现在我已经这样做了，虽然我觉得它不是很优雅，但我想我可以暂时保留它。

但是，我觉得我需要一个更好的方法来处理每个部分的阅读，我的主要问题是我应该如何在这里存储关键字？它们应该作为数组存储在一个输入类中的本地命名空间或类中的静态变量？还是应该在相关函数中本地定义它们？我应该使用枚举吗？问题很多！

现在我已经开始在每个readMAINKEY() 方法中本地定义子关键字，但我发现这不是最佳的。理想情况下，我希望在每个方法中尽可能多地重用代码，调用一个常见的readSECTION() 方法，而我目前的方法似乎会导致大量代码重复和编程错误的可能性。我想最聪明的做法就是删除所有（当前 8 个）不同的 readMAINKEY() 方法，并使用相同的函数来处理各种关键字。也有可能有子子关键字等（即更通用的嵌套方法），所以我认为这可能是要走的路，但我不确定如何最好地实现它？

一旦我在“底层”处理了一个关键字，程序将根据实际关键字期望以下行的特定格式。原则上，每个关键字将被不同地处理，但这里也有可能通过定义不同“类型”的关键字来重用某些代码，具体取决于程序在触发读取后期望做什么。常见任务包括例如解析整数或双精度数组，但原则上它可以是任何东西！
如果由于某种原因无法正确处理关键字，程序应尽可能尝试使用默认值而不是终止程序（如果合理），但应将错误消息写入日志文件。对于可选关键字，当然也会使用默认值。

因此，为了总结，我的主要问题如下：

1.您认为我将相关行存储在std::vector<std::string> 中的方法合理吗？

这当然需要我做很多“索引工作”来跟踪不同关键字在向量中的位置。还是我应该更“直接”地使用原始 std::ifstream 对象？还是别的什么？

2。给定这样一个存储文本文件行的向量，我怎样才能最好地检测关键字并开始阅读它们后面的信息？

在这里，我需要考虑可能的排序以及是否需要关键字。此外，我需要检查每个“底层”关键字后面的行是否符合每种情况下的预期格式。

我的一个想法是将关键字存储在不同的容器中，具体取决于它们是否是可选的（或者可能使用 std::map<std::string,bool> 类型的对象），然后将它们从容器中删除如果处理正确，但我不确定我应该怎么做..

我想确实有上千种不同的方法可以回答这些问题，但如果有经验的人能分享一些关于如何继续进行的想法，我将不胜感激。有没有例如做这些事情的“标准”方式？当然，很多细节也将取决于具体的应用程序，但我认为这里指出的通用格式可以在很多不同的应用程序中使用，而无需大量修改，如果编程方式好！

更新

好的，让我试着更具体一些。我当前的应用程序应该是一个油藏模拟器，因此作为输入的一部分，我需要有关网格/网格、岩石和流体特性、整个模拟过程中的井/边界条件等信息。目前，我一直在考虑在输入方面使用（几乎）与商业 Eclipse 模拟器相同的设置，有关详细信息，请参阅 http://petrofaq.org/wiki/Eclipse_Input_Data.

但是，我可能会稍微改变一下，所以没有什么是一成不变的。另外，我有兴趣制作一个更通用的“KeywordReader”类，稍加修改后也可以适用于其他应用程序，至少可以在合理的时间内完成。

例如，我可以发布当前代码，该代码对文本文件进行初始扫描并定位主要关键字的位置。正如我所说，我不太喜欢我的解决方案，但它似乎适用于它需要做的事情。

在 .cpp 文件的顶部，我有以下命名空间：

//Keywords used for reading input:
namespace KEYWORDS{

    /*
    * Main keywords and corresponding boolean values to signify whether or not they are required as input.
    */
    enum MKEY{RUNSPEC = 0, GRID = 1, EDIT = 2, PROPS = 3, REGIONS = 4, SOLUTION = 5, SUMMARY =6, SCHEDULE = 7};
    std::string mainKeywords[] = {std::string("RUNSPEC"), std::string("GRID"), std::string("EDIT"), std::string("PROPS"),
        std::string("REGIONS"), std::string("SOLUTION"), std::string("SUMMARY"), std::string("SCHEDULE")};
    bool required[] = {true,true,false,true,false,true,false,true};
    const int n_key = 8;

}//end KEYWORDS namespace

然后再往下我有以下功能。不过，我不确定它的理解程度如何..

bool InputReader::scanForMainKeywords(){

    logfile << "Opening file.." << std::endl;

    std::ifstream infile(filename);

    //Test if file was opened. If not, write error message:
    if(!infile.is_open()){
        logfile << "ERROR: Could not open file! Unable to proceed!" << std::endl;
        std::cout << "ERROR: Could not open file! Unable to proceed!" << std::endl;
        return false;
    }

    else{

        logfile << "Scanning for main keywords..." << std::endl;

        int nkey = KEYWORDS::n_key;

        //Initially no keywords have been found:
        startIndex = std::vector<int>(nkey, -1);
        stopIndex = std::vector<int>(nkey, -1);

        //Variable used to control that the keywords are written in the correct order:
        int foundIndex = -1;

        //STATISTICS:
        int lineCount = 0;//number of non-comment lines in text file
        int commentCount = 0;//number of commented lines in text file
        int emptyCount = 0;//number of empty lines in text file

        //Create lines vector:
        lines = std::vector<std::string>();

        //Remove comments and empty lines from text file and store the result in the variable file_lines:
        std::string str;
        while(std::getline(infile,str)){
            if(str.size()>=1 && str.at(0)=='*'){
                commentCount++;
            }
            else if(str.size()>=2 && str.at(0)=='-' && str.at(1)=='-'){
                commentCount++;
            }
            else if(str.size()==0){
                emptyCount++;
            }
            else{
                //Found a non-empty, non-comment line.
                lines.push_back(str);//store in std::vector
                //Start by checking if the first word of the line is one of the main keywords. If so, store the location of the keyword:
                std::string fw = IO::getFirstWord(str);

                for(int i=0;i<nkey;i++){
                    if(fw.compare(KEYWORDS::mainKeywords[i])==0){
                        if(i > foundIndex){
                            //Found a valid keyword!
                            foundIndex = i;
                            startIndex[i] = lineCount;//store where the keyword was found!
                            //logfile << "Keyword " << fw << " found at line " << lineCount << " in lines array!" << std::endl;
                            //std::cout << "Keyword " << fw << " found at line " << lineCount << " in lines array!" << std::endl;
                            break;//fw cannot equal several different keywords at the same time!
                        }
                        else{
                            //we have found a keyword, but in the wrong order... Terminate program:
                            std::cout << "ERROR: Keywords have been entered in the wrong order or been repeated! Cannot continue initialisation!" << std::endl;
                            logfile << "ERROR: Keywords have been entered in the wrong order or been repeated! Cannot continue initialisation!" << std::endl;
                            return false;
                        }
                    }
                }//end for loop

                lineCount++;
            }//end else (found non-comment, non-empty line)
        }//end while (reading ifstream)

        logfile <<  "\n";
        logfile << "FILE STATISTICS:" << std::endl;
        logfile << "Number of commented lines: " << commentCount << std::endl;
        logfile << "Number of non-commented lines: " << lineCount << std::endl;
        logfile << "Number of empty lines: " << emptyCount << std::endl;
        logfile << "\n";


        /*
        Print lines vector to screen:
        for(int i=0;i<lines.size();i++){
            std:: cout << "Line nr. " << i << " : " << lines[i] << std::endl;
        }*/

        /*
        * So far, no keywords have been entered in the wrong order, but have all the necessary ones been found?
        * Otherwise return false.
        */

        for(int i=0;i<nkey;i++){
            if(KEYWORDS::required[i] && startIndex[i] == -1){
                logfile << "ERROR: Incorrect input of required keywords! At least " << KEYWORDS::mainKeywords[i] << " is missing!" << std::endl;;
                logfile << "Cannot proceed with initialisation!" << std::endl;
                std::cout << "ERROR: Incorrect input of required keywords! At least " << KEYWORDS::mainKeywords[i] << " is missing!" << std::endl;
                std::cout << "Cannot proceed with initialisation!" << std::endl;
                return false;
            }
        }

        //If everything is in order, we also initialise the stopIndex array correctly:

        int counter = 0;

        //Find first existing keyword:
        while(counter < nkey && startIndex[counter] == -1){
            //Keyword doesn't exist. Leave stopindex at -1!
            counter++;
        }

        //Store stop index of each keyword:
        while(counter<nkey){

            int offset = 1;

            //Find next existing keyword:
            while(counter+offset < nkey && startIndex[counter+offset] == -1){
                offset++;
            }


            if(counter+offset < nkey){
                stopIndex[counter] = startIndex[counter+offset]-1;
            }
            else{
                //reached the end of array!
                stopIndex[counter] = lines.size()-1;
            }

            counter += offset;
        }//end while

        /*
        //Print out start/stop-index arrays to screen:
        for(int i=0;i<nkey;i++){
            std::cout << "Start index of " << KEYWORDS::mainKeywords[i] << " is : " << startIndex[i] << std::endl;
            std::cout << "Stop index of " << KEYWORDS::mainKeywords[i] << " is : " << stopIndex[i] << std::endl;
        }
        */

        return true;

    }//end else (file opened properly)
}//end scanForMainKeywords()

【问题讨论】：

我不确定我是否一切顺利。您存储关键字的方式取决于您是否需要在之后添加更多关键字，以及关键字的数量是否很大。如果它很小并且没有增长，枚举应该没问题。否则，向量是可以的。好吧，IMO，如果这是我所期望的，我会在分析每一行时直接检查。我真的不认为真的需要存储这些行，除非你想在之后重用它。好吧，您应该发布一些相关代码，以便我们更好地了解您的问题（提供原型没有帮助）。
您的描述比较冗长繁琐。尽管浏览了它，但我仍然不确定您要做什么。您要解析的这些“主要部分或关键字”是什么？你能提供一个小样本文件吗？为什么需要“做大量的索引工作”？
是的，我意识到我的意思可能不是 100% 清楚，我添加了一些额外的信息 + 示例代码。当然，这让我的帖子变得更长更麻烦！ ;)

标签： c++ string file-io io ifstream

【解决方案1】：

您说您的目的是从文本文件中读取初始化数据。看来您需要解析（语法分析）此文件并将数据存储在正确的键下。

如果语法是固定的并且每个构造都以关键字开头，您可以编写一个递归下降 (LL1) 解析器来创建一棵树（每个节点都是一个 stl 子分支向量）来存储您的数据。

如果语法是免费的，您可以选择 JSON 或 XML 并使用现有的解析库。

【讨论】：