C++ 高效解析 Python 数组字符串答案

【问题标题】：C++ efficient parsing of a Python array stringC++ 高效解析 Python 数组字符串
【发布时间】：2021-05-08 15:26:24
【问题描述】：

我有一个从一系列 Python 数组创建的文件。我从ifstream 加载它。该文件是文本，只包含数组。它的形式是：

[[1 22 333 ... 9
  2 2 2    ... 2]
 ...    
 [5 6 2 ... 222
  5 5 5 ... 240]]

[[2 3 444 ... 9]
 ...    
 [5 6 2 ... 222
  5 5 5 ... 240]]

[[ etc...

每个数组的每一行都以[ 开头并以] 结尾，但可以在文件中分成几行（即，在打开和关闭[] 中有回车或换行。整个数组以方括号 [] 开头和结尾。

数字的类型总是整数。对于特定数组的每一行，每行中的条目数（即列数）将相同，但不同数组之间的数字可能不同。数组中的行数是未知的，并且可能因数组而异。而且每个文件的数组总数在打开文件之前也是未知的。

数组可以以任何格式存储。为了这个例子，让我们把它们放在一个向量的向量中，即，

typedef vector<vector<int>> myArray;  //Index [row][col]
typedef vector<myArray> myArrays;

我想有效地解析这个（可能非常大的文件，很可能很多文件）。我的老板非常热衷于为此使用std::regex，只要它高效，我就很满意。

所以我的问题是：如何使用正则表达式有效地解析它。是否有一种方法可以在不使用正则表达式的情况下更有效地解析它？

【问题讨论】：

std::regex 通常是您可以执行此类操作的最慢方式。请从你的脑海中禁止 parsing + regex 的组合。您可以访问 Python 源代码吗？最简单的方法是更改 Python 端的输出，并使用 C++ 很好支持的结构化格式。
@dtell - 很公平。这就是我问这个问题的原因。这是我老板的想法，而这正是我想要确定的。无法更改文件或它们从 Python 输出的方式。
我不太了解这篇文章的降价率。我怎样才能更好地写出这个问题，或者可以添加哪些细节？

标签： python c++ arrays regex

【解决方案1】：

std::from_chars() 是高效的，因为它会就地分析字符串的一部分并准确地告诉分析结束的位置，这样您就可以在不提取子字符串的情况下立即继续。此外，文档中的注释说：

与 C++ 和 C 库中的其他解析函数不同，std::from_chars 是独立于语言环境的、非分配的和非抛出的。只有一个小其他库使用的解析策略的子集（例如提供了 std::sscanf)。这是为了让最快在常见的高吞吐量中有用的可能实现上下文，例如基于文本的交换（JSON 或 XML）。

这是解析您的数据的尝试。

/**
  g++ -std=c++17 -o prog_cpp prog_cpp.cpp \
      -pedantic -Wall -Wextra -Wconversion -Wno-sign-conversion \
      -g -O0 -UNDEBUG -fsanitize=address,undefined
**/

#include <iostream>
#include <sstream>
#include <charconv>
#include <cctype>
#include <string>
#include <vector>
#include <stdexcept>

using MyRow = std::vector<int>;
using MyArray = std::vector<MyRow>;

std::vector<MyArray>
parse_arrays(std::istream &input_stream)
{
  auto arrays=std::vector<MyArray>{};
  auto line=std::string{};
  for(auto depth=0, line_count=1;
      std::getline(input_stream, line);
      ++line_count)
  {
    for(const auto *first=data(line), *last=first+size(line);
        first!=last;)
    {
      // try first to consume all well known characters
      for(auto c=*first; std::isspace(c)||(c=='[')||(c==']'); c=*(++first))
      {
        switch(c)
        {
          case '[': // opening a row or an array
          {
            switch(++depth)
            {
              case 1:
              {
                arrays.emplace_back(MyArray{});
                break;
              }
              case 2:
              {
                arrays.back().emplace_back(MyRow{});
                break;
              }
              default:
              {
                const auto pfx="line "+std::to_string(line_count);
                throw std::runtime_error{pfx+": too deep"};
              }
            }
            break;
          }
          case ']': // closing a row or an array
          {
            switch(--depth)
            {
              case 0:
              {
                // nothing more to be done
                break;
              }
              case 1:
              {
                const auto &a=arrays.back();
                const auto sz=size(a);
                if((sz>1)&&(size(a[sz-1])!=size(a[sz-2])))
                {
                  const auto pfx="line "+std::to_string(line_count);
                  throw std::runtime_error{pfx+": row length mismatch"};
                }
                break;
              }
              default:
              {
                const auto pfx="line "+std::to_string(line_count);
                throw std::runtime_error{pfx+": ] mismatch"};
              }
            }
            break;
          }
          default: // a separator
          {
            // nothing more to be done
          }
        }
      }
      // the other characters probably represent an integer
      auto value=int{};
      if(auto [p, ec]=std::from_chars(first, last, value); ec==std::errc())
      {
        if(depth!=2)
        {
          const auto pfx="line "+std::to_string(line_count);
          throw std::runtime_error{pfx+": depth mismatch"};
        }
        arrays.back().back().emplace_back(value);
        first=p;
      }
      else
      {
        if(p!=first)
        {
          const auto pfx="line "+std::to_string(line_count);
          throw std::runtime_error{pfx+": integer out of range"};
        }
        else if(first!=last)
        {
          const auto pfx="line "+std::to_string(line_count);
          throw std::runtime_error{pfx+": unexpected char <"+*first+'>'};
        }
      }
    }
  }
  return arrays;
}

int
main()
{
  auto input=std::istringstream{R"(
[[1 22 333  9
  2 2 2     2]
     
 [5 6 2  222
  5 5 5  240]]

[[2 3 444  9]
     
 [5 6 2  222]]
)"};
  const auto arrays=parse_arrays(input);
  for(const auto &a: arrays)
  {
    for(const auto &r: a)
    {
      for(const auto &c: r)
      {
        std::cout << c << ' ';
      }
      std::cout << '\n';
    }
    std::cout << "~~~~~~~~~~~~~~~~\n";
  }
  return 0;
}

/**
1 22 333 9 2 2 2 2 
5 6 2 222 5 5 5 240 
~~~~~~~~~~~~~~~~
2 3 444 9 
5 6 2 222 
~~~~~~~~~~~~~~~~
**/

【讨论】：