使用 std::sregex_token_iterator C++ 在输出标记中包含分隔符（无 Boost）答案

【问题标题】：Including delimiters in output tokens with std::sregex_token_iterator C++ (no Boost)使用 std::sregex_token_iterator C++ 在输出标记中包含分隔符（无 Boost）
【发布时间】：2016-08-25 15:34:15
【问题描述】：

我正在尝试在 C++ 中标记脚本语言，目前正在努力将更多的分隔符作为标记。

#ifndef TOKENIZER_H
#define TOKENIZER_H

#include <regex>
#include <vector>
#include <string>
#include <iostream>
#include <fstream>
#include <cctype>
using namespace std;

regex re("[\\s]+");

vector<string> deconstructDelimit(const string &input) {
    string trimmed = input.substr(input.find_first_not_of(" \t\f\v\n\r"));

    vector<string> decons;
    sregex_token_iterator it(trimmed.begin(), trimmed.end(), re, -1);
    sregex_token_iterator reg_end;
    for (; it != reg_end; ++it) {
        decons.push_back(it->str());
    }
    return decons;
}

vector<string> tokenize(const string &input) {
    vector<string> whitespace;

    string currToken;
    for (auto it = input.begin(); it != input.end(); ++it) {
        if (*it == '\'') {
            if (currToken.length()) {
                vector<string> decons = deconstructDelimit(currToken);
                whitespace.insert(whitespace.end(), decons.begin(), decons.end());
                currToken.clear();
            }

            whitespace.push_back("\'");
            ++it;

            while (*it != '\'' && it != input.end()) {
                currToken += *it;
                ++it;
            }

            if (currToken.length()) whitespace.push_back(currToken);
            whitespace.push_back("\'");
            currToken.clear();
        } else if (*it == '\"') {
            if (currToken.length()) {
                vector<string> decons = deconstructDelimit(currToken);
                whitespace.insert(whitespace.end(), decons.begin(), decons.end());
                currToken.clear();
            }

            whitespace.push_back("\"");
            ++it;

            while (*it != '\"' && it != input.end()) {
                currToken += *it;
                ++it;
            }

            if (currToken.length()) whitespace.push_back(currToken);
            whitespace.push_back("\"");
            currToken.clear();
        } else {
            currToken += *it;
        }
    }

    if (currToken.length()) {
        vector<string> decons = deconstructDelimit(currToken);
        whitespace.insert(whitespace.end(), decons.begin(), decons.end());
    }

    return whitespace;
}


#endif

到目前为止，它能够转换此代码：

i = 1
while(i <= 10) {
    print i + " " then i++
}

到这些令牌中：

i
=
1
while(i
<=
10)
{
print
i
+
"

"
then
i++
}

但是，我想用其他分隔符分割这个字符串向量，例如运算符（++、=、

编辑：

例如，进一步标记化的结果是：

i
=
1
while(i   ->   while, (, i
<=
10)   ->   10, )
{
print
i
+
"

"
then
i++   ->   i, ++
}

展开后会是：

i
=
1
while
(
i
<=
10
)
{
print
i
+
"

"
then
i
++
}

【问题讨论】：

不确定正则表达式是否可行。您可能需要一个真正的解析器。您可以查看boost::spirit::qi
字符串拆分的问题是您需要一种方法来确定要拆分的哪里。可能您必须遍历每个字符串的内容并检测您提到的运算符并使用它们的位置将字符串分成更小的部分。
@JoelTrauger 那是我的恐惧。将这个词法分析步骤设为 O(n * m) 似乎对时间复杂度造成了可怕的打击，其中 n 是空格分割后的标记数，m 是要分割的符号数
正常的字符串拆分发生在 O(n) 时间内，其中 n 是字符串的长度。由于字符串的每个字符都与分隔符进行比较，因此您可以重载拆分函数或添加到它，以便它也正确地中断您的运算符。只是一个想法。
我觉得奇怪的是，您正在重写 C 中包含的字符串标记化函数 strtok()（您可以使用标头将其嵌入到 c++ 中）。看看这个xstrtok function，它是对原始函数的修改，看看它在你的字符串拆分中如何使你受益。我以前从未标记过整个文件，因此可以对其进行调整。

标签： c++ token delimiter interpreter lexer

【解决方案1】：

当我尝试使用正则表达式分隔数学表达式的项目时，我遇到了与您完全相同的问题。我成功地找到了一种行之有效的方法：

std::vector<std::string> resplit(const std::string& s, std::string rg_str = "\\s+"){
    std::cmatch cm;
    std::string reg_str = std::string("(.*?)(")+rg_str+std::string(")");
    std::string str = s+std::string(" ");
    unsigned a = 0;
    unsigned b = 1;
    std::string subs = str.substr(a, b-a);
    std::vector<std::string> elements;
    while(b <= str.length()){
        subs = str.substr(a, b-a);
        if(std::regex_match(subs.c_str(), cm, std::regex(reg_str), std::regex_constants::match_default)){
            for(unsigned i=1; i<cm.size(); i++){
                std::string cmi(cm[i]);

                // The following part can be adapted if you want to include whitespaces or empty strings
                if(!std::regex_match(cmi.c_str(), std::regex("\\s*"))){
                    elements.push_back(std::string(cm[i]));
                }
            }
            a = b;
            b = b+1;
        } else {
            b++;
        }
    }
    return elements;
}

当我在resplit("sin(x^2) + 1", "[^0-9a-zPI.]|[ \\(\\)]"); 上使用它时，我得到：["sin", "(", "x", "^", "2", ")", "+", "1"]。

别忘了改变：

 if(!std::regex_match(cmi.c_str(), std::regex("\\s*"))){
     elements.push_back(std::string(cm[i]));
 }

进入：

 if(!std::regex_match(cmi.c_str(), std::regex(""))){
     elements.push_back(std::string(cm[i]));
 }

如果你想包含空格（虽然它会删除空字符串，但这是更可取的）。我希望它对某人有用。祝你有美好的一天。

【讨论】：

【解决方案2】：

我遇到了同样的问题，这是我的完整解决方案，其中包含几个辅助函数：

#include <regex>
#include <string>
#include <iostream>
#include <algorithm>

void ltrim(std::string& str) {
    str.erase(str.begin(), std::find_if(str.begin(), str.end(), [](int character) {
        return !std::isspace(character);
    }));
}

void rtrim(std::string& str) {
    str.erase(std::find_if(str.rbegin(), str.rend(), [](int character) {
        return !std::isspace(character);
    }).base(), str.end());
}

void trim(std::string& str) {
    ltrim(str);
    rtrim(str);
}

bool is_empty(std::string const& str) {
    return str.empty() || str.find_first_not_of(' ') == std::string::npos;
}

std::vector<std::string> split(std::string const& str, std::string const& pattern) {
    std::regex regex(pattern);

    std::vector<std::string> result(
        std::sregex_token_iterator(str.begin(), str.end(), regex, {-1, 0}),
        std::sregex_token_iterator()
    );

    for (auto& token : result) {
        trim(token);
    }

    result.erase(
        std::remove_if(
            result.begin(),
            result.end(),
            [](std::string const& str) { return is_empty(str); }
        ),
        result.end()
    );

    return result;
}

int main() {
    for (auto &s: split("sin(x^2) + 1", "[^0-9a-zPI.]|[ \\(\\)]")) {
        std::cout << s << '\n';
    }

    return 0;
}

我使用的关键是std::sregex_token_iterator。作为其构造函数的最后一个参数，我传递了{-1, 0}，其中-1 代表不匹配的部分，0 代表整个匹配。

上面代码sn-p的结果是：

sin
(
x
^
2
)
+
1

【讨论】：