如何计算 C++ 中的 HTML 标签？答案

【问题标题】：How to count HTML tags in C++?如何计算 C++ 中的 HTML 标签？
【发布时间】：2020-10-06 03:45:06
【问题描述】：

我是学习 C++ 的新手，对于一门课程，我的任务是为 C++ 中的 HTML 文件创建解析器。该程序是输入一个文件名，并输出该文件的内容、行数、字符、标签、链接、cmets，以及标签中字符的百分比。

我已经完成了大部分程序，我只是在一个部分磕磕绊绊：如何计算 HTML 文件中的标签数量。以下是我到目前为止所拥有的。我的问题尤其是第 106-109 行，以“if(fileChar == TAG)”开头的部分

与此主题相关的其他问题要么没有得到解答，要么正在使用我不允许使用的库。

由于这是一个理想的类，我正在寻找一种不涉及除头文件中列出的库之外的库的方法。任何帮助都将不胜感激，因为我目前正在用头撞墙:)

#include <iostream>
#include <fstream>
#include <string>

using namespace std;

int main ()
{
        const char TAG = '<', //marks the beginning of a tag
        LINK = 'a',     //marks the beginning of a link
        COMMENT = '!';  //marks the beginning of a comment

        char fileChar;  //individual characters from the file
        int charNum=0, //total characters in the file
        tagNum=0,       //total tags in the file
        linkNum=0,      //total links in the file
        commentNum=0,   //total comments in the file
        tagChars=0, //number of chars in tags
        lineNum=0, //number of lines in file
        charPercent=0; //percent of chars in tags


        int count = 0; //for counting

        string fileName; //name of file

        ifstream inFile;

//take in user input

cout << "========================================" << endl;
cout << "   Welcome to the HTML File Analyzer!" << endl;
cout << "========================================" << endl << endl;

  cout << "Please enter a valid file name (with no spaces): " << endl;
  cin >> fileName;

  inFile.open(fileName.c_str());         //opens the file

if(inFile)                //tests if file is open
 cout << "file IS open" << endl;
else
 cout << "file NOT open" << endl;

  while (!inFile) //error checking to ensure file exists
{
    inFile.clear(); //clear false file
    cout << endl << "Re-enter a valid filename: " << endl;
    cin >> fileName;
    inFile.open (fileName.c_str());
}

//display contents of file

cout << "========================================" << endl;
cout << "         Contents of the File           " << endl;
cout << "========================================" << endl << endl;

std:string line;

while(inFile)   //print out contents of the file
{
        getline(inFile, line);
        cout << line <<  endl;
        lineNum++; //add to line counter

        const int size=line.length();
        charNum = charNum + size;
        cout << "The total number of characters entered is: " << charNum << endl;

}

inFile.open(fileName.c_str()); //reopen file

while(inFile)
{
        if (fileChar == TAG)
        {
        tagNum++;
        }

}

cout << "========================================" << endl;
cout << "        End of Contents of File         " << endl;
cout << "========================================" << endl << endl;

inFile.open(fileName.c_str());

while(inFile) //count chars
{
         charNum = charNum + 1;
}
cout << "========================================" << endl;
cout << "            Content Analysis            " << endl;
cout << "========================================" << endl << endl;

cout << "Number of Lines: " << lineNum << endl;
cout << "Number of Tags: " << tagNum << endl;
cout << "Number of Comments: " << commentNum << endl;
cout << "Number of Links: " << linkNum << endl;
cout << "Number of Chars in File: " << charNum << endl;
cout << "Number of Chars in Tags: " << tagChars << endl;
cout << "Percent of Chars in Tags: " << charPercent << endl;
inFile.close ();

return (0);

}

【问题讨论】：

注意如果文件流已经打开，您将无法打开它。如果您尝试，该流将处于失败状态。在这种情况下，您可以让文件保持打开状态，clear EOF 标志并返回到开头。
另外，不要一次编写整个程序。写一点，编译和测试，然后继续下一点。如果你一次全部写完，你不仅可能会重复一些错误，而且会出现错误。找到并修复两个 bug 要比在几分钟后修复一个 bug 再修复另一个 bug 要难得多。
您能否更准确地了解您应该解析的 HTML 子集？您必须在标签内支持 cmets 吗？ (<a  >) 你必须支持标签中的任意空格吗？ (< a href = "some text" >) 是否必须支持省略的结束标签 (<a><b>foo</a>) 或立即结束的标签？ (<hr />)
无论哪种方式，答案都将涉及状态机（在标签中，在评论中，在属性名称中（具有自己的子状态），标签内部，任何标签之外的顶级）并根据输入字符在它们之间进行转换。我相信我在您删除的上一个问题中告诉过您。
当然如果all你应该做的是计算打开标签的数量，你只需要阅读直到你看到<，然后检查是否下一个字符不是!（评论）或/（结束标签）。您已经证明您知道如何从文件中读取字符，那么实际问题是什么？

标签： html c++ parsing

【解决方案1】：

假设您处理的是valid HTML5，当您在评论之外看到< 字符时，我们可以区分五种情况：

要么是评论的开头，然后是!--，要么
要么是 DOCTYPE 的开头，然后是 !DOCTYPE，要么
要么是 CDATA 的开头，然后是 ![CDATA[，要么
它是一个结束标签，后跟/，或者
它是一个开始标签，后跟一个标签名称。

while (inFile) {
  inFile >> fileChar;
  if (inFile != TAG) continue; // We are only interested in potential tag or comment starts.

  inFile >> fileChar;
  if (fileChar == '!') {
    char after1, after2;
    inFile >> after1 >> after2;
    if (after1 == '-' && after2 == '-') {
       // This is the start of a comment.
       // We start eating chars until we see '-->' pass by.
       std::string history = "  ";
       while (inFile) {
         inFile >> fileChar;
         if (history == "--" && fileChar == '>') {
            // end of comment, stop this inner loop.
            commentNum++;
            break;
         }

         // Shift history and copy current character to recent history
         history[0] = history[1];
         history[1] = fileChar;
       }
    }
  } else if (fileChar == '/') {
     // This is a closing tag. Do nothing.
  } else {
     // This is the start of a tag. Read until the first non-letter, non-digit.
     std::string tagName;
     while (inFile) {
       inFile >> fileChar;
       if (std::isalnum(fileChar)) {
         tagName.append(1, fileChar);
       } else {
         tagNum++;
         if (tagName == "a") linkNum++;
       }
     }
  }
}

请注意，这是一个非常幼稚的实现，只实现了规范的一部分。如果您提供格式错误的 HTML，它可能会中断。它绝对不处理 CDATA 块（它将其内容视为 HTML 而不是未解析的字符数据）。我不确定您所说的“标签中的百分比字符”是什么意思，但这可能是您可以在最后一个 else 分支中跟踪的内容。

最后，请注意我将它写成一个单独的块。当然，我们鼓励您将其分解为更小的函数（例如read_comment 或read_tag_name）以提高可读性。

【讨论】：

这个解决方案应该是完美的。非常感谢您抽出时间来提供帮助。你被邀请参加我的婚礼