考虑特殊字符，将句子标记为单词答案

【问题标题】：Tokenize sentence into words, considering special characters考虑特殊字符，将句子标记为单词
【发布时间】：2015-08-29 19:20:39
【问题描述】：

我有一个函数，它接收一个句子，并根据空格“”标记成单词。现在，我想改进功能以消除一些特殊字符，例如：

I am a boy.   => {I, am, a, boy}, no period after "boy"
I said :"are you ok?"  => {I, said, are, you, ok}, no question and quotation mark

原来的功能来了，怎么改进呢？

void Tokenize(const string& str, vector<string>& tokens, const string& delimiters = " ")
{

    string::size_type lastPos = str.find_first_not_of(delimiters, 0);

    string::size_type pos = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos)
    {

        tokens.push_back(str.substr(lastPos, pos - lastPos));

        lastPos = str.find_first_not_of(delimiters, pos);

        pos = str.find_first_of(delimiters, lastPos);
    }
}

【问题讨论】：

将 str 复制到 str2 中，同时删除特殊字符。做你在 str2 上做的所有操作
如果你想改进功能，你首先需要定义你认为什么更好。然后，我建议你为已经工作的案例和它不能像你想要的那样工作的案例编写测试。然后，尝试改进该功能，如果您对此有任何具体问题，请在此处提问。就目前而言，您似乎只是在这里寻找某人为您编写它。
对于所有字符串操作，我建议提振精神以摆脱手动操作，索引计算......但它是某种大锤并且难以学习[链接]boost-spirit.com/home

标签： c++ string special-characters tokenize

【解决方案1】：

您可以使用std::regex。在那里你可以搜索任何你想要的东西，然后把结果放在一个向量中。这很简单。

见：

#include <iostream>
#include <string>
#include <algorithm>
#include <vector>
#include <regex>

// Our test data (raw string). So, containing also \" and so on
std::string testData(R"#(I said :"are you ok?")#");

std::regex re(R"#((\b\w+\b,?))#");

int main(void)
{
    // Define the variable id as vector of string and use the range constructor to read the test data and tokenize it
    std::vector<std::string> id{ std::sregex_token_iterator(testData.begin(), testData.end(), re, 1), std::sregex_token_iterator() };

    // For debug output. Print complete vector to std::cout
    std::copy(id.begin(), id.end(), std::ostream_iterator<std::string>(std::cout, " "));

    return 0;
}

【讨论】：