CPP：解析字符串流太慢答案

【问题标题】：CPP : Parsing String Stream is too slowCPP：解析字符串流太慢
【发布时间】：2015-11-22 07:17:08
【问题描述】：

我的 cpp 代码需要读取由空格分隔的浮点值组成的 7 MB 文本文件。将字符串值解析为浮点数组大约需要 6 秒，这对我的用例来说太多了。

我一直在网上查，人们说通常是物理 IO 需要时间。为了消除这种情况，我一次性将文件读入字符串流，并将其用于浮点解析。代码速度仍然没有提高。任何想法如何让它运行得更快？

这是我的代码（为简单起见，将数组条目替换为 dummy_f）：

    #include "stdafx.h"
    #include <iostream>
    #include <fstream>
    #include "time.h"
    #include <sstream>
    using namespace std;

    int main()
    {
      ifstream testfile;
      string filename = "test_file.txt";
      testfile.open(filename.c_str());

      stringstream string_stream;
      string_stream << testfile.rdbuf();

      testfile.close();

      clock_t begin = clock();
      float dummy_f;

      cout<<"started stream at time "<<(double) (clock() - begin) /(double) CLOCKS_PER_SEC<<endl;

      for(int t = 0; t < 6375; t++)
      {

           string_stream >> dummy_f;

           for(int t1 = 0; t1 < 120; t1++)
           {
               string_stream >> dummy_f;
           }
      }

      cout<<"finished stream at time "<<(double) (clock() - begin) /(double) CLOCKS_PER_SEC<<endl;

      string_stream.str("");

      return 0;
     }

编辑：

这是 test_cases.txt 文件的链接https://drive.google.com/file/d/0BzHKbgLzf282N0NBamZ1VW5QeFE/view?usp=sharing

使用此文件运行时请将内循环尺寸更改为 128（打错了）

编辑：找到了让它工作的方法。将 dummy_f 声明为字符串并从字符串流中读取为字符串单词。然后使用 atof 将字符串转换为浮点数。花费的时间是 0.4 秒，这对我来说已经足够了。

  string dummy_f;
  vector<float> my_vector;
  for(int t = 0; t < 6375; t++)
  {

       string_stream >> dummy_f;
       my_vector.push_back(atof(dummy_f.c_str()));
       for(int t1 = 0; t1 < 128; t1++)
       {
           string_stream >> dummy_f;
            my_vector.push_back(atof(dummy_f.c_str()));
       }
  }

【问题讨论】：

不要在调试版本中测量性能。
这是整数，但你应该改变它来处理浮点数：stackoverflow.com/questions/26736742/…
@Dieter 发布需要 6 秒。调试模式大约需要 10 秒。这就是令人费解的地方。
您是否对小节进行了计时？哪些耗时最长？
另外，请注意，当您提出与性能相关的问题时，为了获得有意义的答案，您需要非常精确地设置和配置，以便人们可以很好地猜测问题出在哪里，数据是否有意义；甚至更好的是，进行适当的分析以查明瓶颈，以便人们可以帮助您解决实际问题。否则，你会得到一堆投机的想法，就像你做的那样； +10 cmets 没有答案，因为目前几乎不可能给出一个好的答案。

标签： c++ string performance ifstream istringstream

【解决方案1】：

下面粘贴了一个使用 atof 的替代实现，它的运行速度提高了 3 倍。在我的笔记本电脑上，基于原始字符串流的一个需要 2.3 秒才能完成，而对于相同数量的浮点数，这个在 0.8 秒内完成。

static char filecontents[10*1024*1024];

int testfun2()
{
  ifstream testfile;
  string filename = "test_file.txt";
  testfile.open(filename.c_str());
  int numfloats=0;
  testfile.read(filecontents,10*1024*1024);
  size_t numBytesRead = testfile.gcount();
  filecontents[numBytesRead]='\0';
  testfile.close();

  clock_t begin = clock();
  float dummy_f;

  cout<<endl<<"started at time "<<(double) (clock() - begin) /(double) CLOCKS_PER_SEC<<endl;

  char* p= filecontents;
  char* pend = p + numBytesRead;
  while(p<pend)
  {
      while(*p && (*p <= ' '))
      {
         ++p; //skip leading white space ,\r, \n
      }
      char* pvar = p;
      while(*p > ' ')
      {
        ++p; //skip over numbers
      }
      if(*p)
      {  *p = '\0';// shorter input makes atof faster.
        ++p;
      }
      if(*pvar)
      {
         dummy_f = atof(pvar);
         ++numfloats;
      }
      //cout << endl << dummy_f;
  }

  cout<<endl<< "finished at time "<<(double) (clock() - begin) /(double) CLOCKS_PER_SEC<<endl;

  cout << endl << "numfloats= " << numfloats;
  return numfloats;
 }

【讨论】：

针对第二个实现（使用 atof 流到字符串到浮动）对此进行了测试，这个玩具似乎更快。然而，问题中的第二个实现在我的机器上需要 0.9 到 1.2 秒，这很奇怪。
我有很多文件（与这个类似，但尺寸不同），代码将根据用户输入等读取其中一个文件。所以我不能使用预定义大小的字符数组。但没关系，我可以忍受速度上的这种微小差异。 :)

【解决方案2】：

在我的 Linux 机器上只需要

hidden$ cat read-float.cpp 
#include <fstream>
#include <iostream>
#include <vector>
using namespace std;

int main() {
  ifstream fs("/tmp/xx.txt");
  vector<float> v;
  for (int i = 0; i < 6375; i++) {
    for (int j = 0; j < 129; j++) {
      float f;
      fs >> f;
      v.emplace_back(f);
    }
  }
  cout << "Read " << v.size() << " floats" << endl;
}
hidden$ g++ -std=c++11 read-float.cpp -O3
hidden$ time ./a.out 
Read 822375 floats

real    0m0.287s
user    0m0.279s
sys 0m0.008s

hidden$ g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.4-2ubuntu1~14.04' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-libmudflap --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)

【讨论】：

这很奇怪。在将流读入浮点数时使用“>>”是一个问题，而且它显然只发生在 Windows 中。从流中读取为字符串，然后使用 atof 解决了这个问题。

【解决方案3】：

更新：在 cmets 中与 @Mats 的讨论得出结论，锁定开销不太可能与此有关，所以我们回到第一方，解释为什么 Visual C++ 的库如此解析浮点数很慢。您的示例测试文件看起来主要是数量级与 1.0 相差不远的数字，没有发生任何奇怪的事情。（根据 Agner Fog 的表格，英特尔在 Sandybridge 及以后的 FPU 无论如何都不会对非规范化进行性能惩罚。）

正如其他人所说，是时候分析您的代码并找出哪个函数占用了所有 CPU 时间。此外，性能计数器可以告诉您分支错误预测或缓存未命中是否会导致问题。

每次调用cin >> dummy_f 都需要锁定以确保另一个线程不会同时修改输入缓冲区。使用scanf("%f%f%f%f", &dummy_array[0], &dummy_array[1], ...) 一次读取 4 或 8 个浮点数会更有效，如果这是瓶颈所在的话。（scanf 也不是一个很好的 API，因为它需要每个数组元素的地址作为函数参数。不过，通过在一个 scanf 中使用多个转换展开展开仍然是一个小的性能提升。）

您正在尝试使用 stringstream 解决此问题，这可能有效也可能无效。它是函数中的局部变量，所以如果编译器可以看到所有函数并内联它们，它就不用担心锁定。不能有任何其他线程可以访问此变量。

【讨论】：

所以如果你在 Linux 上运行它就不需要锁定，而在 Windows 上进行字符串读取也不需要锁定？还是您只是在编造一些听起来不错的东西，而没有真正阅读 cmets？
我会说，如果它在读取字符串的时间上增加 5.8 秒，而另一个线程没有争用锁，那么它是一个非常低效的锁。
@MatsPetersson：我基本上是根据标准库的工作方式/实现方式来编造东西。 IDK 如果这是发现它在 Windows 上慢，在 Linux 上快的正确解释。 IIRC、C stdio 和 C++ iostream 都需要是线程安全的，因此它们要么必须进行锁定，要么编译器必须聪明并进行整个程序优化以确保不能有其他线程或信号处理程序.
顺便说一句，我认为这个问题表明 C++11 在流中不是线程安全的（虽然它是关于 cout，而不是 cin，但我无法想象 cin 和 cout 有那么不同）：stackoverflow.com/questions/6374264/…
我在外循环外添加了一个std::mutex mtx，在内循环中添加了一个std::lock_guard<std::mutex> lock(mtx)。总体运行时间比我添加额外代码之前的运行时间长 1 毫秒。那就是在一毫秒内锁定和解锁 771375 次。我有点怀疑微软的锁比 Linux std::mutex 慢 5800 倍。