使用多个分隔符拆分字符串，允许引用值答案

【问题标题】：Splitting string with multiple delimiters, allowing quoted values使用多个分隔符拆分字符串，允许引用值
【发布时间】：2017-09-25 20:13:58
【问题描述】：

boost::escaped_list_separator的docs对第二个参数c的解释如下：

字符串 c 中的任何字符都被视为分隔符。

所以，我需要用多个分隔符分割字符串，允许带引号的值，其中可以包含这些分隔符：

#include <iostream>
#include <string>

#include <boost/tokenizer.hpp>

int main() {
    std::wstring str = L"2   , 14   33  50   \"AAA BBB\"";

    std::wstring escSep(L"\\"); //escape character
    std::wstring delim(L" \t\r\n,"); //split on spaces, tabs, new lines, commas
    std::wstring quotes(L"\""); //allow double-quoted values with delimiters within

    boost::escaped_list_separator<wchar_t> separator(escSep, delim, quotes);
    boost::tokenizer<boost::escaped_list_separator<wchar_t>, std::wstring::const_iterator, std::wstring> tok(str, separator);

    for(auto beg=tok.begin(); beg!=tok.end();++beg)
        std::wcout << *beg << std::endl;

    return 0;
}

预期的结果是 [2; 14; 33; 50; AAA BBB]。但是，他的代码results 是一堆空令牌：

考虑到所有分隔符，常规 boost::char_separator 忽略所有这些空标记。 boost::escaped_list_separator 似乎也考虑了所有指定的分隔符，但产生了空值。如果遇到多个连续的分隔符，是否会产生空标记？有什么办法可以避免吗？

如果始终正确，即只生成空标记，则很容易测试结果值并手动省略它们。但是，它可能会变得非常丑陋。例如，假设每个字符串都有 2 个实际值，并且可能有许多制表符和空格分隔这些值。然后将分隔符指定为L"\t "（即空格和制表符）将起作用，但会产生大量空标记。

【问题讨论】：

您想要的更像是解析而不是标记化。至少你需要有状态的扫描——这使得它不像分裂。我总是在这里使用解析器生成器方法。我在这个网站上有很多这样的例子（参见例如stackoverflow.com/questions/10289985/…）。

标签： c++ string boost boost-tokenizer

【解决方案1】：

从 Boost Tokenizer 文档来看，您确实正确地假设如果遇到多个连续的分隔符，则在使用 boost::escaped_list_separator 时将产生空标记。与boost::char_separator 不同，boost::escaped_list_separator 不提供任何构造函数，允许您传入是否保留或丢弃任何生成的空令牌。

虽然可以选择丢弃空标记，但当您考虑文档 (http://www.boost.org/doc/libs/1_64_0/libs/tokenizer/escaped_list_separator.htm) 中介绍的用例（解析 CSV 文件）时，保留空标记非常有意义。一个空的字段仍然是一个字段。

一种选择是在标记化后简单地丢弃空标记。如果您关心空标记的生成，另一种方法是在将重复的分隔符传递给标记器之前删除它，但显然您需要注意不要删除引号内的任何内容。

【讨论】：