使用 Boost 查找大图的连通分量答案

【问题标题】：Find connected components of big Graph using Boost使用 Boost 查找大图的连通分量
【发布时间】：2014-07-26 15:01:57
【问题描述】：

我编写了一个代码来查找一个非常大的图（8000 万条边）的连通分量但它不起作用，当边缘数接近 4000 万时它崩溃了。

int main(){
    using namespace boost;
    {
        int node1,node2;
        typedef adjacency_list <vecS, vecS, undirectedS> Graph;
        Graph G;
        std::ifstream infile("pairs.txt");
        std::string line;
        while (std::getline(infile,line))
        {
            std::istringstream iss(line);
            iss >> node1 >> node2;
            add_edge(node1, node2, G);}
            cout <<"writing file"<<endl;
            int  j = 0;
            ofstream out;
            out.open("connected_component.txt");
            std::vector<int> component(num_vertices(G));
            int num = connected_components(G, &component[0]);
            std::vector<int>::size_type i;
            for (i = 0; i != component.size(); ++i){
                out << i << "\t "<<component[i] <<endl;}
                out.close();
            }

知道如何使用 boost 做到这一点吗？或更改我的图表数据类型？

【问题讨论】：

它是如何崩溃的？发布回溯？
它仍在运行，但我的 CPU 使用率非常低，当我打印文件索引时它停止在 4000 万

标签： c++ boost graph bigdata

【解决方案1】：

使用随机图数据，我可以在大约 37 秒内运行 4000 万条边（peaking at 4.4GiB of memory 根据 Massif）。

/tmp$ od -Anone -w4 -t u2 -v /dev/urandom | head -n 40000000 > pairs.txt
/tmp$ time ./test

Reading 40000000 done in 5543ms
Building graph done in 3425ms
Algorithm done in 8957ms
writing file
Writing done in 52ms

real    0m37.339s
user    0m36.078s
sys 0m1.202s

1。内存分配

但是请注意，我通过使用边缘列表的向量对其进行了调整，以便保留所需的容量：

typedef adjacency_list<listS, vecS, undirectedS, no_property, no_property, 
         no_property, vecS> Graph;

这个

通过移除重新分配来提高负载性能
减少堆碎片

2。顶点id缩放

还有一个重要的注意事项是，存储需求随着顶点数的数量而变化。更具体地说，它们随着 顶点域 进行缩放。例如。加载这样的文件：

将有大大比

更少的内存需求

事实上，重新运行上述基准测试，输入完全相同，但只有第一行改变了

 47662 60203

进入

 476624766 602036020

结果如下：

Reading 40000000 done in 5485ms
tcmalloc: large alloc 14448869376 bytes == 0x7c0f2000 @  0x7f30f60aad9d 0x7f30f60caaa9 0x4023ab 0x4019d4 0x7f30f57d7de5 0x401e6a (nil)
Building graph done in 6754ms
tcmalloc: large alloc 2408144896 bytes == 0x49fe46000 @  0x7f30f60aad9d 0x7f30f60caaa9 0x401ced 0x7f30f57d7de5 0x401e6a (nil)
tcmalloc: large alloc 2408144896 bytes == 0x52ffd0000 @  0x7f30f60aad9d 0x7f30f60cb339 0x402e45 0x401d5e 0x7f30f57d7de5 0x401e6a (nil)
Algorithm done in 31644ms
writing file
Writing done in 75921ms

real    2m20.318s
user    1m30.224s
sys 0m49.821s

正如您所见，google 的 malloc 实现（来自gperftools）甚至会警告异常大的分配，实际上，它的运行速度要慢得多。（哦，内存使用量变得如此巨大，以至于 Massif 不再使用了，但我已经看到它在 htop 中达到了 23GiB）。

完整代码

看到它Live On Coliru在 4000 条边上运行：

#include <boost/graph/adjacency_list.hpp>
#include <boost/graph/connected_components.hpp>
#include <fstream>
#include <iostream>

#include <chrono>

using Clock = std::chrono::high_resolution_clock;

int main()
{
    using namespace boost;
    typedef adjacency_list<listS, vecS, undirectedS, no_property, no_property, no_property, vecS> Graph;
    Graph G;

    // read edges
    auto start = Clock::now();
    std::ifstream infile("pairs.txt", std::ios::binary);

    std::vector<std::pair<int, int> > as_read;

    int node1, node2;
    while (infile >> node1 >> node2)
        as_read.emplace_back(node1, node2);

    std::cout << "Reading " << as_read.size() << " done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
    start = Clock::now();

    // build graph
    G.m_edges.reserve(as_read.size());
    for(auto& pair : as_read)
        add_edge(pair.first,pair.second,G);

    std::cout << "Building graph done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
    start = Clock::now();

    // find connected components
    std::vector<int> component(num_vertices(G));
    int num = connected_components(G, &component[0]);

    std::cout << "Algorithm done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
    start = Clock::now();

    // write output
    std::cout <<"writing file"<<std::endl;

    std::ofstream out;
    out.open("connected_component.txt");
    for (size_t i = 0; i != component.size(); ++i) {
        out << i << "\t "<< component[i] << std::endl; 
    }

    out.close();
    std::cout << "Writing done in " << std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start).count() << "ms\n";
}

【讨论】：

添加了一些 memory profiling data 以获得良好的衡量效果
我可以使用 edge_list 类型并找到连接的组件吗？
所有“连接的组件”都需要"An undirected graph. The graph type must be a model of Vertex List Graph and Incidence Graph"。 EdgeList 本身已经过时了。虽然，您可以有一个图表来模拟这两个概念
我的意思是 this link 根据你的 cmets 我必须定义两个图
我知道你的意思。就像我说的，connected_graph 确实不支持 EdgeListGraph（本身）。除非您在其他地方绝对需要 EdgeListGraph，否则您不需要创建两个图形。请注意，还有结合概念的 Graph 实现（例如 VertexAndEdgeListGraph）