【问题标题】:Find connected components of big Graph using Boost使用 Boost 查找大图的连通分量
【发布时间】:2014-07-26 15:01:57
【问题描述】:

我编写了一个代码来查找一个非常大的图(8000 万条边)的连通分量 但它不起作用,当边缘数接近 4000 万时它崩溃了。

int main(){
    using namespace boost;
    {
        int node1,node2;
        typedef adjacency_list <vecS, vecS, undirectedS> Graph;
        Graph G;
        std::ifstream infile("pairs.txt");
        std::string line;
        while (std::getline(infile,line))
        {
            std::istringstream iss(line);
            iss >> node1 >> node2;
            add_edge(node1, node2, G);}
            cout <<"writing file"<<endl;
            int  j = 0;
            ofstream out;
            out.open("connected_component.txt");
            std::vector<int> component(num_vertices(G));
            int num = connected_components(G, &component[0]);
            std::vector<int>::size_type i;
            for (i = 0; i != component.size(); ++i){
                out << i << "\t "<<component[i] <<endl;}
                out.close();
            }

知道如何使用 boost 做到这一点吗?或更改我的图表数据类型?

【问题讨论】:

  • 它是如何崩溃的?发布回溯?
  • 它仍在运行,但我的 CPU 使用率非常低,当我打印文件索引时它停止在 4000 万

标签: c++ boost graph bigdata


【解决方案1】:

使用随机图数据,我可以在大约 37 秒内运行 4000 万条边(peaking at 4.4GiB of memory 根据 Massif)。

/tmp$ od -Anone -w4 -t u2 -v /dev/urandom | head -n 40000000 &gt; pairs.txt
/tmp$ time ./test

Reading 40000000 done in 5543ms
Building graph done in 3425ms
Algorithm done in 8957ms
writing file
Writing done in 52ms

real    0m37.339s
user    0m36.078s
sys 0m1.202s

1。内存分配

但是请注意,我通过使用边缘列表的向量对其进行了调整,以便保留所需的容量:

typedef adjacency_list<listS, vecS, undirectedS, no_property, no_property, 
         no_property, vecS> Graph;

这个

  • 通过移除重新分配来提高负载性能
  • 减少堆碎片

2。顶点id缩放

还有一个重要的注意事项是,存储需求随着顶点数的数量而变化。更具体地说,它们随着 顶点域 进行缩放。例如。加载这样的文件:

1 7
2 7
5 6
4 9

将有大大

更少的内存需求
1 70000
2 70000
5 60000
4 90000

事实上,重新运行上述基准测试,输入完全相同,但只有第一行改变了

 47662 60203

进入

 476624766 602036020

结果如下:

Reading 40000000 done in 5485ms
tcmalloc: large alloc 14448869376 bytes == 0x7c0f2000 @  0x7f30f60aad9d 0x7f30f60caaa9 0x4023ab 0x4019d4 0x7f30f57d7de5 0x401e6a (nil)
Building graph done in 6754ms
tcmalloc: large alloc 2408144896 bytes == 0x49fe46000 @  0x7f30f60aad9d 0x7f30f60caaa9 0x401ced 0x7f30f57d7de5 0x401e6a (nil)
tcmalloc: large alloc 2408144896 bytes == 0x52ffd0000 @  0x7f30f60aad9d 0x7f30f60cb339 0x402e45 0x401d5e 0x7f30f57d7de5 0x401e6a (nil)
Algorithm done in 31644ms
writing file
Writing done in 75921ms

real    2m20.318s
user    1m30.224s
sys 0m49.821s

正如您所见,google 的 malloc 实现(来自gperftools)甚至会警告异常大的分配,实际上,它的运行速度要慢得多。 (哦,内存使用量变得如此巨大,以至于 Massif 不再使用了,但我已经看到它在 htop 中达到了 23GiB)。

完整代码

看到它Live On Coliru在 4000 条边上运行:

#include <boost/graph/adjacency_list.hpp>
#include <boost/graph/connected_components.hpp>
#include <fstream>
#include <iostream>

#include <chrono>

using Clock = std::chrono::high_resolution_clock;

int main()
{
    using namespace boost;
    typedef adjacency_list<listS, vecS, undirectedS, no_property, no_property, no_property, vecS> Graph;
    Graph G;

    // read edges
    auto start = Clock::now();
    std::ifstream infile("pairs.txt", std::ios::binary);

    std::vector<std::pair<int, int> > as_read;

    int node1, node2;
    while (infile >> node1 >> node2)
        as_read.emplace_back(node1, node2);

    std::cout << "Reading " << as_read.size() << " done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
    start = Clock::now();

    // build graph
    G.m_edges.reserve(as_read.size());
    for(auto& pair : as_read)
        add_edge(pair.first,pair.second,G);

    std::cout << "Building graph done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
    start = Clock::now();

    // find connected components
    std::vector<int> component(num_vertices(G));
    int num = connected_components(G, &component[0]);

    std::cout << "Algorithm done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
    start = Clock::now();

    // write output
    std::cout <<"writing file"<<std::endl;

    std::ofstream out;
    out.open("connected_component.txt");
    for (size_t i = 0; i != component.size(); ++i) {
        out << i << "\t "<< component[i] << std::endl; 
    }

    out.close();
    std::cout << "Writing done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
}

【讨论】:

猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2014-07-10
  • 2011-06-01
  • 2011-12-28
  • 1970-01-01
相关资源
最近更新 更多