C++ getline - 使用正则表达式提取子字符串答案

【问题标题】：C++ getline - Extracting a substring using regexC++ getline - 使用正则表达式提取子字符串
【发布时间】：2023-04-06 07:42:01
【问题描述】：

我有一个包含这样内容的文件 -

Random text
+-------------------+------+-------+-----------+-------+
|     Data          |   A  |   B   |     C     |   D   |
+-------------------+------+-------+-----------+-------+
|   Data 1          | 1403 |     0 |      2520 | 55.67 |
|   Data 2          | 1365 |     2 |      2520 | 54.17 |
|   Data 3          |    1 |     3 |      1234 | 43.12 |
Some more random text

我想提取行Data 1 的列D 的值，即我想从上面的示例中提取值55.67。我正在使用 getline -

逐行解析此文件

while(getline(inputFile1,line)) {
    if(line.find("|  Data 1") != string::npos) {
        subString = //extract the desired value
}

如何从该行中提取所需的子字符串。有没有什么方法可以使用 boost::regex 来提取这个子字符串？

【问题讨论】：

我只过滤|。第五后|在你想要的值的一行中。
如果您只说“数据 1”，.find() 的第一件事就是起作用。你不必把所有的空格。
得到它你已经知道最后一个'|'有多少字符在 55.67 之前 o 只需使用 .substring(position of the lest '|');它会从那一直持续到最后。之后，您只需扔掉所有空格！（看看 substring man）

标签： c++ regex boost getline

【解决方案1】：

虽然regex 可能有它的用途，但它可能是矫枉过正。

引入trim 函数并：

char delim;
std::string line, data;
int a, b, c;
double d;

while(std::getline(inputFile1, line)) {
    std::istringstream is(line);
    if( std::getline(is >> delim, data, '|') >>
        a >> delim >> b >> delim >> c >> delim >> d >> delim) 
    {
        trim(data);

        if(data == "Data 1") {
            std::cout << a << ' ' << b << ' ' << c << ' ' << d << '\n';
        }
    }
}

Demo

【讨论】：

谢谢。这是一个非常简单的解决方案。正则表达式会使它变得相当复杂。

【解决方案2】：

是的，使用正则表达式很容易提取您的子字符串。不需要使用boost，也可以使用已有的C++正则库。

生成的程序非常简单。

我们在一个简单的 for 循环中读取源文件的所有行。然后我们使用std::regex_match 将刚刚读取的行与我们的正则表达式匹配。如果我们找到了匹配项，那么结果将在std::smatch sm，第 1 组中。

而且因为我们将设计用于查找双精度值的正则表达式，所以我们将得到我们所需要的，而无需任何额外的空格。

我们可以将其转换为双精度并在屏幕上显示结果。而且因为我们定义了正则表达式来查找双精度，所以我们可以确定 std::stod 会起作用。

生成的程序相当简单易懂：

#include <iostream>
#include <string>
#include <sstream>
#include <regex>

// Please note. For std::getline, it does not matter, if we read from a
// std::istringstream or a std::ifstream. Both are std::istream's. And because
// we do not have files here on SO, we will use an istringstream as data source.
// If you want to read from a file later, simply create an std::ifstream inputFile1

// Source File with all data
std::istringstream inputFile1{ R"(
Random text
+-------------------+------+-------+-----------+-------+
|     Data          |   A  |   B   |     C     |   D   |
+-------------------+------+-------+-----------+-------+
|   Data 1          | 1403 |     0 |      2520 | 55.67 |
|   Data 2          | 1365 |     2 |      2520 | 54.17 |
|   Data 3          |    1 |     3 |      1234 | 43.12 |
Some more random text)" 
};

// Regex for finding the desired data
const std::regex re(R"(\|\s+Data 1\s+\|.*?\|.*?\|.*?\|\s*([-+]?[0-9]*\.?[0-9]+)\s*\|)");

int main() {

    // The result will be in here
    std::smatch sm;

    // Read all lines of the source file
    for (std::string line{}; std::getline(inputFile1, line);) {

        // If we found our matching string
        if (std::regex_match(line, sm, re)) {

            // Then extract the column D info
            double data1D = std::stod(sm[1]);

            // And show it to the user.
            std::cout << data1D << "\n";
        }
    }
}

对于大多数人来说，棘手的部分是如何定义正则表达式。有像Online regex tester and debugger 这样的页面。还有一个正则表达式的细分和一个可以理解的解释。

对于我们的正则表达式

\|\s+Data 1\s+\|.*?\|.*?\|.*?\|\s*([-+]?[0-9]*\.?[0-9]+)\s*\|

我们得到以下解释：

\|  
    matches the character | literally (case sensitive)
\s+
    matches any whitespace character (equal to [\r\n\t\f\v ])
    + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
    Data 1 matches the characters Data 1 literally (case sensitive)
\s+
    matches any whitespace character (equal to [\r\n\t\f\v ])
    + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\| 
    matches the character | literally (case sensitive)
.*?
    matches any character (except for line terminators)
    *? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
\| 
    matches the character | literally (case sensitive)
.*?
    matches any character (except for line terminators)
    *? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
\| 
    matches the character | literally (case sensitive)
.*?
    matches any character (except for line terminators)
\| 
    matches the character | literally (case sensitive)
\s*
    matches any whitespace character (equal to [\r\n\t\f\v ])

1st Capturing Group ([-+]?[0-9]*\.?[0-9]+)

\s*
    matches any whitespace character (equal to [\r\n\t\f\v ])
\| 
    matches the character | literally (case sensitive)

顺便说一句，更安全（更安全的匹配）正则表达式是：

\|\s+Data 1\s+\|\s*?\d+\s*?\|\s*?\d+\s*?\|\s*?\d+\s*?\|\s*([-+]?[0-9]*\.?[0-9]+)\s*\|

【讨论】：