如何使用 boost split 拆分字符串并忽略空值？答案

【问题标题】：How to use boost split to split a string and ignore empty values?如何使用 boost split 拆分字符串并忽略空值？
【发布时间】：2013-03-19 09:29:52
【问题描述】：

我正在使用 boost::split 来解析数据文件。数据文件包含如下行。

数据.txt

1:1~15  ASTKGPSVFPLAPSS SVFPLAPSS   -12.6   98.3

项目之间的空白是制表符。我要拆分以上行的代码如下。

std::string buf;
/*Assign the line from the file to buf*/
std::vector<std::string> dataLine;
boost::split( dataLine, buf , boost::is_any_of("\t "), boost::token_compress_on);       //Split data line
cout << dataLine.size() << endl;

对于上面的代码行，我应该打印出 5，但我得到了 6。我试图通读文档，这个解决方案似乎应该做我想做的事，显然我遗漏了一些东西。谢谢！

编辑：在 dataLine 上运行如下 forloop 会得到以下结果。

cout << "****" << endl;
for(int i = 0 ; i < dataLine.size() ; i ++) cout << dataLine[i] << endl;
cout << "****" << endl;


****
1:1~15
ASTKGPSVFPLAPSS
SVFPLAPSS
-12.6
98.3

****

【问题讨论】：

dataLine 中存储的值是什么？
I get 5，您的buf 包含其他内容。
也许它没有正确复制到此页面中，您将其错误地复制到了测试代码中。让我看看如何确保它正确复制。
如果你buf末尾有空格，I get the same results。
仅使用boost::algorithm::trim 变体是否不够？

标签： c++ parsing boost split

【解决方案1】：

即使“相邻的分隔符合并在一起”，似乎尾随分隔符也会造成问题，因为即使将它们视为一个分隔符，它仍然是一个分隔符。

所以你的问题不能单独用split() 解决。但幸运的是，Boost String Algo 有trim() and trim_if()，它可以从字符串的开头和结尾去除空格或分隔符。所以只需在 buf 上调用 trim()，如下所示：

std::string buf = "1:1~15  ASTKGPSVFPLAPSS SVFPLAPSS   -12.6   98.3    ";
std::vector<std::string> dataLine;
boost::trim_if(buf, boost::is_any_of("\t ")); // could also use plain boost::trim
boost::split(dataLine, buf, boost::is_any_of("\t "), boost::token_compress_on);
std::cout << out.size() << std::endl;

这个问题已经被问过了：boost::split leaves empty tokens at the beginning and end of string - is this desired behaviour?

【讨论】：

我搜索但无法找到上述问题。很抱歉重新发布。

【解决方案2】：

我建议使用C++ String Toolkit Library。在我看来，这个库比 Boost 快得多。我曾经使用 Boost 来拆分（又名标记化）一行文本，但发现这个库更符合我的要求。

strtk::parse 的一大优点是将令牌转换为最终值并检查元素的数量。

你可以这样使用它：

std::vector<std::string> tokens;

// multiple delimiters should be treated as one
if( !strtk::parse( dataLine, "\t", tokens ) )
{
    std::cout << "failed" << std::endl;
}

--- 另一个版本

std::string token1;
std::string token2;
std::string token3:
float value1;
float value2;

if( !strtk::parse( dataLine, "\t", token1, token2, token3, value1, value2) )
{
     std::cout << "failed" << std::endl;
     // fails if the number of elements is not what you want
}

库的在线文档：String Tokenizer Documentation 源代码链接：C++ String Toolkit Library

【讨论】：

将来我可能会考虑切换到 STL 以满足我的需要，但目前我有很多使用 boost 的代码。
我的代码使用量也非常惊人。我也在使用 boost tokenizer。由于速度的原因，我将此特定功能切换为 strtk 。增加了将标记即时转换为数字的能力，这对我来说是毫不费力的。

【解决方案3】：

boost::split 故意将前导和尾随空格单独保留，因为它不知道它是否重要。解决方法是在调用boost::split之前使用boost::trim。

#include <boost/algorithm/string/trim.hpp>

....

boost::trim(buf);

【讨论】：

在调用它之前？通常你拆分然后修剪令牌，对吧？
@Nick：那要看情况了。在最初的问题中，用户正在拆分制表符分隔的文件，因此之前的修剪是正确的。