在矢量 C++ 中保存大数据答案

【问题标题】：Saving large data in vector c++在矢量 C++ 中保存大数据
【发布时间】：2012-05-23 18:50:46
【问题描述】：

我在一个文件中有大量数据，我需要读取这些数据并对其进行一些概率计算，因此我需要计算整个文件中每个单词的出现次数并对其进行更多计算。这些文件包含 100 万条半记录，每条记录大约 6 个字符串。我使用向量来保存这些数据，但程序在保存大约 8000 条记录后崩溃。有没有办法将此向量保存在计算机上而不是程序的内存中？！.. 或者我从搜索中听到了一个叫做符号表的东西，但我不明白它是什么意思或如何使用它。

这个问题有什么解决办法吗？

这是主文件

#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <istream>

#include "Tuple.h"
#include "VerbPair.h"
using namespace std;

string filename = "verb-argument-tuples.txt";
vector<Tuple> mytuples;
vector<VerbPair> verbpairs;

vector<Tuple> readTupleFile(string filename)
{
    cout << "Started parsing the file of tuples..." << endl;
    vector<Tuple> mt;
    string temp;
    Tuple t;

    ifstream infile;
    infile.open(filename);
    while(!(infile.eof()))
    {
        getline(infile,temp);
        t.parseTuple(temp);
        mt.push_back(t);
    }

    infile.close();
    cout << "Done with reading tuples file..." << endl;
    return mt;
}

vector<VerbPair> getVerbPairs(vector<Tuple> mytuples)
{
    vector<VerbPair> pairs;
    bool flag = false;
    VerbPair temp;
    for(int i=0;i<mytuples.size();i++)
    {
        flag = false;
        for(int h=0;h<pairs.size();h++)
        {
            if (mytuples[i].verb.compare(pairs[h].verb) == 0)
            {
                pairs[h].freq += mytuples[i].count;
                flag =true;
                break;
            }
        }
        if(! flag)
        {
            temp.verb = mytuples[i].verb;
            temp.freq = mytuples[i].count;
            pairs.push_back(temp);
        }
    }
    return pairs;
}

int numOfLines(string filename)
{
    int numLines = 0;
    string j ="";
    ifstream infile;
    infile.open(filename);

    while(!infile.eof())
    {
        getline(infile,j);
        numLines++;
    }
    infile.close();
    return numLines;
}

void train(string filename)
{
    mytuples = readTupleFile(filename);
    verbpairs = getVerbPairs(mytuples);
}
void store(string filename)
{

}
void load(string filename)
{

}

int main()
{
    cout << "Started Application..." << endl;
    train(filename);
    cout << "Size of verb pairs is " << verbpairs.size() << endl;
}

元组.h

#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <istream>
using namespace std;

class Tuple
{
public:
    int count;
    string verb;
    string frame;
    vector<string> args;
private:
    int i;
    int h;
    string p;

public:
    void parseTuple(string s)
    {
        cout << "parsing.... " << s << endl;
        i=0;
        h=0;
        p="";
        while(s[i] != 32 && s[i]!= 9) //that means temp[i] is a number
        {
            h = h*10 + (s[i] - '0');
            i++;
        }
        this->count = h;
        i++;

        // loops for everything but not the space and tab
        while(s[i] != 32 && s[i]!= 9)
        {
            p +=s[i];
            i++;
        }
        this->verb = p;
        i++;

        p="";
        while(s[i] != 32 && s[i]!= 9)
        {
            p +=s[i];
            i++;
        }
        this->frame = p;
        i++;

        p="";
        while(i < s.length())
        {
            while(s[i] != 32 && s[i]!= 9 && i < s.length())
            {
                p += s[i];
                i++;
            }
            this->args.push_back(p);
            i++;
            p="";
        }
    }
};

和 VerbPair.h

#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <istream>
using namespace std;

class VerbPair
{
public:
    string verb;
    int freq;
};

【问题讨论】：

这听起来不像是内存问题，你能显示代码吗？ -- 或者可能是错误？
不要为此使用向量。使用双端队列或列表。
@DavidSchwartz：根据 Stoustrup 在他的讲座 (channel9.msdn.com/Events/GoingNative/GoingNative-2012/…) 中展示的图表，在这种情况下，树和基于指针的列表是邪恶的。但你是对的，任何算法课程都会告诉你使用定向队列或链表。
使用向量的问题是它需要调整大小并且需要分配连续的内存。由于连续虚拟内存不足，将向量的大小调整与其中对象的分配穿插可能会导致分配失败。向量不适合高度动态的数据结构。
@DavidSchwartz：这根本不是真的，我知道它的“教科书”，但运行实验会告诉你，复杂性论点充其量需要更大的数据，最坏的情况是不成立。如果您查看 Stoustrups 论文（第 51 页，图 1）“基础设施软件开发”中的图表，您会发现对于此类操作，没有什么比 std::vector 更好的了。

标签： c++ vector symbol-table large-data

【解决方案1】：

您的代码中有很多影子变量，例如您在全局声明 filename 变量，然后在三行之后在本地使用它。你对元组向量和动词对向量做同样的事情。

也许一些封装会使您的调试任务更容易。

另一个样式问题是这样的函数：

vector<VerbPair> getVerbPairs(vector<Tuple> mytuples)
{
    vector<VerbPair> pairs;
    bool flag = false;
    VerbPair temp;
    for(int i=0;i<mytuples.size();i++)
    {
        flag = false;
        for(int h=0;h<pairs.size();h++)
        {
            if (mytuples[i].verb.compare(pairs[h].verb) == 0)
            {
                pairs[h].freq += mytuples[i].count;
                flag =true;
                break;
            }
        }
        if(! flag)
        {
            temp.verb = mytuples[i].verb;
            temp.freq = mytuples[i].count;
            pairs.push_back(temp);
        }
    }
    return pairs;
}

有几件事使调试变得困难。第一个是影子，第二个是你不要让编译器帮助你。

vector<VerbPair> getVerbPairs(const vector<Tuple>& mytuples)
{
  vector<VerbPair> pairs;
  bool flag = false;
  VerbPair temp;
  for(int i=0;i<mytuples.size();i++)
    {
      flag = false;
      for(int h=0;h<pairs.size();h++)
    {
      if (mytuples[i].verb.compare(pairs[h].verb) == 0)
        {
          pairs[h].freq += mytuples[i].count;
          flag =true;
          break;
        }
    }
      if(! flag)
    {
      temp.verb = mytuples[i].verb;
      temp.freq = mytuples[i].count;
      pairs.push_back(temp);
    }
    }
  return pairs;
}

这样编译器会告诉你如果你试图弄乱 mytupes 向量。

【讨论】：

【解决方案2】：

您可以尝试使用带有向量的保留函数吗？既然你可能知道你有大数据，你也应该使用储备功能。

另外，在这种情况下使用地图，因为使用地图，您将能够轻松计算出现次数。

对于崩溃，您必须向我们展示代码。

【讨论】：

【解决方案3】：

既然有重复数据，你为什么要使用vector。只需使用map<string,int>。每次遇到一个单词，就增加map中对应的值。

【讨论】：

虽然这样说是对的，但与他的问题并无关系。 8000 条记录绝不应该使std::vector 崩溃。除非他在做一些奇怪的事情——某种递归做错了什么。