使用单词列表作为分隔符的 C++ 拆分字符串答案

【问题标题】：C++ split string using a list of words as separators使用单词列表作为分隔符的 C++ 拆分字符串
【发布时间】：2014-07-24 00:32:24
【问题描述】：

我想像这样拆分一个字符串

“this1245is@g$0,therhsuidthing345”

使用下面的单词列表

{“this”, “is”, “the”, “thing”}

进入这个列表

{“this”, “1245”, “is”, “@g$0,”, “the”,  “rhsuid”, “thing”, “345”}
// ^--------------^---------------^------------------^-- these were the delimiters

分隔符在要分割的字符串中允许出现多次，可以使用正则表达式来完成

优先级是分隔符在数组中出现的顺序

我正在开发的平台不支持 Boost 库

更新

这是我目前拥有的

#include <iostream>
#include <string>
#include <regex>

int main ()
{
    std::string s ("this1245is@g$0,therhsuidthing345");
    std::string delimiters[] = {"this", "is", "the", "thing"};

    for (int i=0; i<4; i++) {
        std::string delimiter =  "(" + delimiters[i] + ")(.*)";
        std::regex e (delimiter);   // matches words beginning by the i-th delimiter

        // default constructor = end-of-sequence:
        std::sregex_token_iterator rend;

        std::cout << "1st and 2nd submatches:";
        int submatches[] = { 1, 2 };
        std::sregex_token_iterator c ( s.begin(), s.end(), e, submatches );
        while (c!=rend) std::cout << " [" << *c++ << "]";
        std::cout << std::endl;
    }

    return 0;
}

输出：

1st and 2nd submatches:[this][x1245fisA@g$0,therhsuidthing345]
1st and 2nd submatches:[is][x1245fisA@g$0,therhsuidthing345]
1st and 2nd submatches:[the][rhsuidthing345]
1st and 2nd submatches:[thing][345]

我想我需要做一些递归的东西来调用每次迭代

【问题讨论】：

你试过什么？什么地方出了错？你在哪里卡住了？您尝试过搜索什么，为什么它不适合您？
您是按找到的第一个单词、识别单词的最大字母集合还是其他什么？
@ThomasMatthews 我不太明白你的意思，但是你能不能看看更新后的问题，也许有你想要的
匹配“is”优先于匹配“this”。你真正想要什么行为？
@jxh 是的，优先顺序是分隔符出现在分隔符数组中的顺序

标签： c++ regex string std

【解决方案1】：

只为匹配项构建您想要的表达式 (re)，然后将 {-1, 0} 传递给您的 std::sregex_token_iterator 以返回所有不匹配项 (-1) 和匹配项 (0)。

#include <iostream>
#include <regex>

int main() {
   std::string s("this1245is@g$0,therhsuidthing345");
   std::regex re("(this|is|the|thing)");

   std::sregex_token_iterator iter(s.begin(), s.end(), re, { -1, 0 });
   std::sregex_token_iterator end;

   while (iter != end) {
      //Works in vc13, clang requires you increment separately,
      //haven't gone into implementation to see if/how ssub_match is affected.
      //Workaround: increment separately.
      //std::cout << "[" << *iter++ << "] ";
        std::cout << "[" << *iter << "] ";
        ++iter;
   }
}

【讨论】：

这是我使用您的代码得到的输出：[] [is] [1245] [the] [@g$0,] [thing] [rhsuid] [] [345]，但这不是正确的输出，是否更改了令牌的顺序？
它应该返回在不匹配和匹配之间交替的迭代器（传递子匹配 {-1, 0} 的顺序很重要）。用不匹配然后匹配顺序的单词分隔您的字符串应该是[] [this] [1245] [is] [@g$0,] [the] [rhsuid] [thing] [345]。
有没有办法删除空令牌？那些[]
如果你只关心匹配和不匹配对，你总是可以只增加一次迭代器并迭代直到结束，例如：std::vector<std::string> tokens(iter+1, end);
这很整洁。我曾经使用旧的 C 风格的正则表达式进行操作，您可以在其中使用字符串的宽度，例如snprintf 可以是动态的.. 这好多了 - 不幸的是，这个功能没有在 GCC 4.7.2 中实现（或者我错过了什么？）

【解决方案2】：

我不知道如何执行优先级要求。这似乎适用于给定的输入：

std::vector<std::string> parse (std::string s)
{
    std::vector<std::string> out;

    std::regex re("\(this|is|the|thing).*");
    std::string word;

    auto i = s.begin();
    while (i != s.end()) {
        std::match_results<std::string::iterator> m;
        if (std::regex_match(i, s.end(), m, re)) {
            if (!word.empty()) {
                out.push_back(word);
                word.clear();
            }
            out.push_back(std::string(m[1].first, m[1].second));
            i += out.back().size();
        } else {
            word += *i++;
        }
    }
    if (!word.empty()) {
        out.push_back(word);
    }

    return out;
}

【讨论】：

它就像一个魅力，“this”优先于“is”，因为它首先出现在正则表达式中

【解决方案3】：

vector<string> strs; 
boost::split(strs,line,boost::is_space());

【讨论】：

boost 不能使用，抱歉我在原帖中没有提到这种情况