使用 C++ 将字符串拆分为键值对答案

【问题标题】：Split string into key-value pairs using C++使用 C++ 将字符串拆分为键值对
【发布时间】：2016-12-13 06:13:06
【问题描述】：

我有一个这样的字符串：

"CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567"

现在": " 将键与值分开，而\n 将配对分开。我想将键值对添加到 C++ 中的映射中。

考虑到优化，是否有任何有效的方法来做到这一点？

【问题讨论】：

您是否阅读过std::string 的手册页
使用例如std::istringstream 和 std::getline 可能是一个好的开始。请注意，std::getline 可用于任意分隔符，而不仅仅是换行符。
也不要担心现阶段的优化。首先确保您的程序正常运行，然后通过 benchmark、measure 和 profile 找到瓶颈并优化它们。过早的优化只会让你误入歧途。
您可以“临时”实施它，然后您可以分析您的解决方案并找到可以根据需要优化的慢速地方。
在 SO 上，无法完成任务与要求（最）高效完成任务的方式之间存在很强的相关性。

标签： c++ dictionary

【解决方案1】：

我这里有两种方法。第一个是我一直使用的简单、明显的方法（性能很少成为问题）。第二种方法可能更有效~~但我没有做任何正式的计时~~。

在我的测试中，第二种方法大约快 3 倍。

#include <map>
#include <string>
#include <sstream>
#include <iostream>

std::map<std::string, std::string> mappify1(std::string const& s)
{
    std::map<std::string, std::string> m;

    std::string key, val;
    std::istringstream iss(s);

    while(std::getline(std::getline(iss, key, ':') >> std::ws, val))
        m[key] = val;

    return m;
}

std::map<std::string, std::string> mappify2(std::string const& s)
{
    std::map<std::string, std::string> m;

    std::string::size_type key_pos = 0;
    std::string::size_type key_end;
    std::string::size_type val_pos;
    std::string::size_type val_end;

    while((key_end = s.find(':', key_pos)) != std::string::npos)
    {
        if((val_pos = s.find_first_not_of(": ", key_end)) == std::string::npos)
            break;

        val_end = s.find('\n', val_pos);
        m.emplace(s.substr(key_pos, key_end - key_pos), s.substr(val_pos, val_end - val_pos));

        key_pos = val_end;
        if(key_pos != std::string::npos)
            ++key_pos;
    }

    return m;
}

int main()
{
    std::string s = "CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567";

    std::cout << "mappify1: " << '\n';

    auto m = mappify1(s);
    for(auto const& p: m)
        std::cout << '{' << p.first << " => " << p.second << '}' << '\n';

    std::cout << "mappify2: " << '\n';

    m = mappify2(s);
    for(auto const& p: m)
        std::cout << '{' << p.first << " => " << p.second << '}' << '\n';
}

输出：

mappify1: 
{CA => ABCD}
{CB => ABFG}
{CC => AFBV}
{CD => 4567}
mappify2: 
{CA => ABCD}
{CB => ABFG}
{CC => AFBV}
{CD => 4567}

【讨论】：

感谢分享。我认为您的解决方案2存在一个问题，即分隔符可能最终出现在值中，

【解决方案2】：

这种格式称为“标签值”。

行业中使用这种编码对性能最关键的地方可能是金融 FIX 协议（= 用于键值分隔符，'\001' 用作条目分隔符）。因此，如果您使用的是 x86 硬件，那么最好的选择是搜索“SSE4 FIX 协议解析器 github”并重用 HFT 商店的开源发现。

如果您仍想将向量化部分委托给编译器并且可以节省几纳秒以提高可读性，那么最优雅的解决方案是将结果存储在std::string（数据）+boost::flat_map<boost::string_ref, boost::string_ref>（视图）中。解析是一个口味问题，while-loop 或 strtok 对编译器来说是最容易解析的。基于 Boost-spirit 的解析器对于人类（熟悉 boost-spirit）来说是最容易阅读的。

基于 C++ for 循环的解决方案

#include <boost/container/flat_map.hpp> 
#include <boost/range/iterator_range.hpp>

#include <boost/range/iterator_range_io.hpp> 
#include <iostream>

// g++ -std=c++1z ~/aaa.cc
int main()
{
    using range_t = boost::iterator_range<std::string::const_iterator>;
    using map_t = boost::container::flat_map<range_t, range_t>;

    char const sep = ':';
    char const dlm = '\n';

    // this part can be reused for parsing multiple records
    map_t result;
    result.reserve(1024);

    std::string const input {"hello:world\n bye: world"};

    // this part is per-line/per-record
    result.clear();
    for (auto _beg = begin(input), _end = end(input), it = _beg; it != _end;)
    {
        auto sep_it = std::find(it, _end, sep);
        if (sep_it != _end)
        {
            auto dlm_it = std::find(sep_it + 1, _end, dlm);
            result.emplace(range_t {it, sep_it}, range_t {sep_it + 1, dlm_it});
            it = dlm_it + (dlm_it != _end);
        }
        else throw std::runtime_error("cannot parse");
    }

    for (auto& x: result)
        std::cout << x.first << " => " << x.second << '\n';

    return 0;
}

【讨论】：

使用解析器生成器（特别是boost::spirit monstrosity）来解析标签值字符串绝对是矫枉过正...
@MatteoItalia - 完全而言，while 循环将是最自然的方式，这也是我建议查看的大多数 github FIX 协议解析器中的方式。跨度>

【解决方案3】：

格式很简单，“手动”解析 IMO 是最好的选择，总体上仍然非常可读。

这也应该是相当有效的（key 和value 字符串总是相同的 - 尽管cleared，所以主循环内的重新分配应该在几次迭代后停止）； ret 也应该符合 NRVO、OTOH 的条件，以防出现问题，您可以随时更改为输出参数。

当然std::map可能不是西部最快的枪，但它是问题文本中的请求。

std::map<std::string, std::string> parseKV(const std::string &sz) {
    std::map<std::string, std::string> ret;
    std::string key;
    std::string value;
    const char *s=sz.c_str();
    while(*s) {
        // parse the key
        while(*s && *s!=':' && s[1]!=' ') {
            key.push_back(*s);
            ++s;
        }
        // if we quit due to the end of the string exit now
        if(!*s) break;
        // skip the ": "
        s+=2;
        // parse the value
        while(*s && *s!='\n') {
            value.push_back(*s);
            ++s;
        }
        ret[key]=value;
        key.clear(); value.clear();
        // skip the newline
        ++s;
    }
    return ret;
}

【讨论】：

【解决方案4】：

如果担心性能，您可能应该重新考虑最终结果是否需要成为地图。这最终可能会成为内存中的大量字符缓冲区。理想情况下，只跟踪每个子字符串的 char* 和长度会更快/更小。

【讨论】：

【解决方案5】：

这里有一个解决方案，使用strtok作为拆分手段。请注意，strtok 会更改您的字符串，它将 '\0' 放在拆分字符处。

#include <iostream>
#include <string>
#include <map>
#include <string.h>

using namespace std;



int main (int argc, char *argv[])
{
    char s1[] = "CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567";
    map<string, string> mymap;
    char *token;

    token = strtok(s1, "\n");
    while (token != NULL) {
        string s(token);
        size_t pos = s.find(":");
        mymap[s.substr(0, pos)] = s.substr(pos + 1, string::npos);
        token = strtok(NULL, "\n");
    }

    for (auto keyval : mymap) 
        cout << keyval.first << "/" << keyval.second << endl;

    return 0;
}

【讨论】：

std::map 没有自定义分配器是减慢代码（内存分配）并在途中碎片化堆的最佳工具。

【解决方案6】：

我怀疑您是否应该担心读取此字符串并将其转换为std::map 的优化。如果你真的想优化这个固定内容的映射，把它改成std::vector<std::pair<>> 并排序一次。

也就是说，使用标准 C++ 功能创建 std::map 的最优雅方法如下：

std::map<std::string, std::string> deserializeKeyValue(const std::string &sz) {
    constexpr auto ELEMENT_SEPARATOR = ": "s;
    constexpr auto LINE_SEPARATOR = "\n"s;

    std::map<std::string, std::string> result;
    std::size_t begin{0};
    std::size_t end{0};
    while (begin < sz.size()) {
        // Search key
        end = sz.find(ELEMENT_SEPARATOR, begin);
        assert(end != std::string::npos); // Replace by error handling
        auto key = sz.substr(begin, /*size=*/ end - begin);
        begin = end + ELEMENT_SEPARATOR.size();

        // Seach value
        end = sz.find(LINE_SEPARATOR, begin);
        auto value = sz.substr(begin, end == std::string::npos ? std::string::npos : /*size=*/ end - begin);
        begin = (end == std::string::npos) ? sz.size() : end + LINE_SEPARATOR.size();

        // Store key-value
        [[maybe_unused]] auto emplaceResult = result.emplace(std::move(key), std::move(value));
        assert(emplaceResult.second); // Replace by error handling
    }
    return result;
}

虽然每个 c++ 程序员都理解这段代码，但它的性能可能并不理想。

【讨论】：

【解决方案7】：

使用 boost 的一个非常简单的解决方案如下，它也适用于部分标记（例如，没有值的键或空对）。

#include <string>
#include <list>
#include <map>
#include <iostream>

#include <boost/foreach.hpp>
#include <boost/algorithm/string.hpp>

using namespace std;
using namespace boost;

int main() {

    string s = "CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567";

    list<string> tokenList;
    split(tokenList,s,is_any_of("\n"),token_compress_on);
    map<string, string> kvMap;

    BOOST_FOREACH(string token, tokenList) {
        size_t sep_pos = token.find_first_of(": ");
        string key = token.substr(0,sep_pos);
        string value = (sep_pos == string::npos ? "" : token.substr(sep_pos+2,string::npos));
        kvMap[key] = value;

        cout << "[" << key << "] => [" << kvMap[key] << "]" << endl;
    }

    return 0;
}

【讨论】：

【解决方案8】：

void splitString(std::map<std::string, std::string> &mymap, const std::string &text, char sep)
{
    int start = 0, end1 = 0, end2 = 0;
    while ((end1 = text.find(sep, start)) != std::string::npos && (end2 = text.find(sep, end1+1)) != std::string::npos) {
        std::string key = text.substr(start, end1 - start);
        std::string val = text.substr(end1 + 1, end2 - end1 - 1);
        mymap.insert(std::pair<std::string,std::string>(key, val));
        start = end2 + 1;
    }
}

例如：

std::string text = "key1;val1;key2;val2;key3;val3;";
std::map<std::string, std::string> mymap;
splitString(mymap, text, ';');

将生成大小为 3 的地图：{ key1="val1", key2="val2", key3="val3" }

更多示例：

"key1;val1;key2;" => {key1="val1"}（没有第二个 val，所以第二个键不算）

"key1;val1;key2;val2" => {key1="val1"} （第二个 val 末尾没有分隔符，所以不算数）

"key1;val1;key2;;" => {key1="val1",key2=""} (key2 保存空字符串)

【讨论】：

【解决方案9】：

查看了已接受的答案并尝试扩展一点，这似乎在更一般的情况下有效。测试运行可以找到here。欢迎各种 cmets 或修改。

#include <iostream>
#include <string>
#include <sstream>
#include <map>
#include <algorithm>
#include <vector>

size_t find(const std::string& line, std::vector<std::string> vect, int pos=0) {
    int eol1;
    eol1 = 0;
    for (std::vector<std::string>::iterator iter = vect.begin(); iter != vect.end(); ++iter) {
        //std::cout << *iter << std::endl;
        int eol2 = line.find(*iter, pos);
        if (eol1 == 0 && eol2 > 0)
            eol1 = eol2;
        else if (eol2 > 0 && eol2 < eol1)
            eol1 = eol2;
    }
    return eol1;
}

std::map<std::string, std::string> mappify(std::string const& s, char delim='=') {
    std::map<std::string, std::string> m;

    std::string::size_type key_pos = 0, i, j;
    std::string::size_type key_end;
    std::string::size_type val_pos;
    std::string::size_type lim_pos;
    std::string::size_type val_end;

    while ((key_end = s.find(delim, key_pos)) != std::string::npos) {
        if ((val_pos = s.find_first_not_of(delim, key_end + 1)) == std::string::npos)break;
        while (key_end - 1 > 0 && (s[key_end - 1] <= 32 || s[key_end - 1] == ';'))
            key_end--;
        while (val_pos < s.size() && (s[val_pos] <= 32 || s[val_pos] == ';'))
            val_pos++;
        val_end = s.find('\n', val_pos);
        i = s.find('\"', val_pos);
        if (i != std::string::npos)
            j = s.find('\"', i + 1);
        else
            j = 0;
        lim_pos = find(s.substr(0, i), { " ",";","\t" }, val_pos + 1);
        //std::cout << "s.substr(j):" << s.substr(j)<<std::endl;
        if (lim_pos == 0 && j != std::string::npos)lim_pos = find(s.substr(j), { " ",";","\t" }) + j;
        if (lim_pos < val_pos)lim_pos = val_pos + 1;
        if (j > 0)val_end = j + 1;
        if (val_end > lim_pos)val_end = lim_pos;
        m.emplace(s.substr(key_pos, key_end - key_pos), s.substr(val_pos, val_end - val_pos));
        key_pos = val_end;
        while ((key_pos < s.size() && s[key_pos] <= 32 || s[key_pos] == ';'))
            ++key_pos;
        if (val_end == 0)break;
    }
    return m;
}

int main() {
    std::string s ="\
File=\"c:\\dir\\ocean\\\nCCS_test.txt\"\n\
iEcho=10000; iHrShift=0 rho_Co2 = 1.15d0;\n\
Liner=01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890";
  auto m = mappify(s);
    for (auto const& p : m)
      std::cout << '{' << p.first << " :=> " << p.second << '}' << '\n';

    return 0;
}

【讨论】：