【问题标题】:Mapping MPI processes to particular nodes将 MPI 进程映射到特定节点
【发布时间】:2013-01-02 22:06:39
【问题描述】:

我认为这个问题在这里问是无关紧要的。却无法自拔。 假设我有一个包含 100 个节点的集群,每个节点有 16 个核心。 我有一个 mpi 应用程序,它的通信模式是已知的,我也知道集群拓扑(即节点之间的跳距)。 现在我知道了减少网络争用的节点映射过程。例如:进程到节点的映射是 10->20,30->90。 如何将 rank 10 的进程映射到 node-20? 请帮助我。

【问题讨论】:

    标签: mpi openmpi


    【解决方案1】:

    这个聚会有点晚了,但这里有一个 C++ 子程序,它将为您提供一个节点通信器和一个主通信器(仅适用于节点的主节点),以及每个节点的大小和等级。这很笨拙,但不幸的是我还没有找到更好的方法来做到这一点。幸运的是,它只增加了大约 0.1 秒的挂墙时间。也许你或其他人会从中得到一些用处。

    #define MASTER 0
    
    using namespace std;
    
    /*
     * Make a comunicator for each node and another for just
     * the masters of the nodes. Upon completion, everyone is
     * in a new node communicator, knows its size and their rank,
     * and the rank of their master in the master communicator,
     * which can be useful to use for indexing.
     */
    bool    CommByNode(MPI::Intracomm &NodeComm,
                    MPI::Intracomm &MasterComm,
                    int &NodeRank, int &MasterRank,
                    int &NodeSize, int &MasterSize,
                    string &NodeNameStr)
    {
        bool IsOk = true;
    
        int Rank = MPI::COMM_WORLD.Get_rank();
        int Size = MPI::COMM_WORLD.Get_size();
    
        /*
         * ======================================================================
         * What follows is my best attempt at creating a communicator
         * for each node in a job such that only the cores on that
         * node are in the node's communicator, and each core groups
         * itself and the node communicator is made using the Split() function.
         * The end of this (lengthly) process is indicated by another comment.
         * ======================================================================
         */
        char *NodeName, *NodeNameList;
        NodeName = new char [1000];
        int NodeNameLen,
            *NodeNameCountVect,
            *NodeNameOffsetVect,
            NodeNameTotalLen = 0;
        //  Get the name and name character count of each core's node
        MPI::Get_processor_name(NodeName, NodeNameLen);
    
        //  Prepare a vector for character counts of node names
        if (Rank == MASTER)
            NodeNameCountVect = new int [Size];
    
        //  Gather node name lengths to master to prepare c-array
        MPI::COMM_WORLD.Gather(&NodeNameLen, 1, MPI::INT, NodeNameCountVect, 1, MPI::INT, MASTER);
    
        if (Rank == MASTER){
            //  Need character count information for navigating node name c-array
            NodeNameOffsetVect = new int [Size];
            NodeNameOffsetVect[0] = 0;
            NodeNameTotalLen = NodeNameCountVect[0];
    
            //  build offset vector and total char count for all node names
            for (int i = 1 ; i < Size ; ++i){
                NodeNameOffsetVect[i] = NodeNameCountVect[i-1] + NodeNameOffsetVect[i-1];
                NodeNameTotalLen += NodeNameCountVect[i];
            }
            //  char-array for all node names
            NodeNameList = new char [NodeNameTotalLen];
        }
    
        //  Gatherv node names to char-array in master
        MPI::COMM_WORLD.Gatherv(NodeName, NodeNameLen, MPI::CHAR, NodeNameList, NodeNameCountVect, NodeNameOffsetVect, MPI::CHAR, MASTER);
    
        string *FullStrList, *NodeStrList;
        //  Each core keeps its node's name in a str for later comparison
        stringstream ss;
        ss << NodeName;
        ss >> NodeNameStr;
    
        delete NodeName;    //  node name in str, so delete c-array
    
        int *NodeListLenVect, NumUniqueNodes = 0, NodeListCharLen = 0;
        string NodeListStr;
    
        if (Rank == MASTER){
            /*
             * Need to prepare a list of all unique node names, so first
             * need all node names (incl duplicates) as strings, then
             * can make a list of all unique node names.
             */
            FullStrList = new string [Size];    //  full list of node names, each will be checked
            NodeStrList = new string [Size];    //  list of unique node names, used for checking above list
            //  i loops over node names, j loops over characters for each node name.
            for (int i = 0 ; i < Size ; ++i){
                stringstream ss;
                for (int j = 0 ; j < NodeNameCountVect[i] ; ++j)
                    ss << NodeNameList[NodeNameOffsetVect[i] + j];  //  each char into the stringstream
                ss >> FullStrList[i];   //  stringstream into string for each node name
                ss.str(""); //  This and below clear the contents of the stringstream,
                ss.clear(); //  since the >> operator doesn't clear as it extracts
                //cout << FullStrList[i] << endl;   //  for testing
            }
            delete NodeNameList;    //  master is done with full c-array
            bool IsUnique;  //  flag for breaking from for loop
            stringstream ss;    //  used for a full c-array of unique node names
            for (int i = 0 ; i < Size ; ++i){   //  Loop over EVERY name
                IsUnique = true;
                for (int j = 0 ; j < NumUniqueNodes ; ++j)
                    if (FullStrList[i].compare(NodeStrList[j]) == 0){   //  check against list of uniques
                        IsUnique = false;
                        break;
                    }
                if (IsUnique){
                    NodeStrList[NumUniqueNodes] = FullStrList[i];   //  add unique names so others can be checked against them
                    ss << NodeStrList[NumUniqueNodes].c_str();  //  build up a string of all unique names back-to-back
                    ++NumUniqueNodes;   //  keep a tally of number of unique nodes
                }
            }
            ss >> NodeListStr;  //  make a string of all unique node names
            NodeListCharLen = NodeListStr.size();   //  char length of all unique node names
            NodeListLenVect = new int [NumUniqueNodes]; //  list of unique node name lengths
            /*
             * Because Bcast simply duplicates the buffer of the Bcaster to all cores,
             * the buffer needs to be a char* so that the other cores can have a similar
             * buffer prepared to receive. This wouldn't work if we passed string.c_str()
             * as the buffer, becuase the receiving cores don't have string.c_str() to
             * receive into, and even if they did, c_srt() is a method and can't be used
             * that way.
             */
            NodeNameList = new char [NodeListCharLen];  //  even though c_str is used, allocate necessary memory
            NodeNameList = const_cast<char*>(NodeListStr.c_str());  //  c_str() returns const char*, so need to recast
            for (int i = 0 ; i < NumUniqueNodes ; ++i)  //  fill list of unique node name char lengths
                NodeListLenVect[i] = NodeStrList[i].size();
            /*for (int i = 0 ; i < NumUnique ; ++i)
                cout << UniqueNodeStrList[i] << endl;
            MPI::COMM_WORLD.Abort(1);*/
            //delete NodeStrList;   //  Arrays of string don't need to be deallocated,
            //delete FullStrList;   //  I'm guessing becuase of something weird in the string class.
            delete NodeNameCountVect;
            delete NodeNameOffsetVect;
        }
        /*
         * Now we send the list of node names back to all cores
         * so they can group themselves appropriately.
         */
    
        //  Bcast the number of nodes in use
        MPI::COMM_WORLD.Bcast(&NumUniqueNodes, 1, MPI::INT, MASTER);
        //  Bcast the full length of all node names
        MPI::COMM_WORLD.Bcast(&NodeListCharLen, 1, MPI::INT, MASTER);
    
        //  prepare buffers for node name Bcast's
        if (Rank > MASTER){
            NodeListLenVect = new int [NumUniqueNodes];
            NodeNameList = new char [NodeListCharLen];
        }
    
        //  Lengths of node names for navigating c-string
        MPI::COMM_WORLD.Bcast(NodeListLenVect, NumUniqueNodes, MPI::INT, MASTER);
        //  The actual full list of unique node names
        MPI::COMM_WORLD.Bcast(NodeNameList, NodeListCharLen, MPI::CHAR, MASTER);
    
        /*
         * Similar to what master did before, each core (incl master)
         * needs to build an actual list of node names as strings so they
         * can compare the c++ way.
         */
        int Offset = 0;
        NodeStrList = new string[NumUniqueNodes];
        for (int i = 0 ; i < NumUniqueNodes ; ++i){
            stringstream ss;
            for (int j = 0 ; j < NodeListLenVect[i] ; ++j)
                ss << NodeNameList[Offset + j];
            ss >> NodeStrList[i];
            ss.str("");
            ss.clear();
            Offset += NodeListLenVect[i];
            //cout << FullStrList[i] << endl;
        }
        //  Now since everyone has the same list, just check your node and find your group.
        int CommGroup = -1;
        for (int i = 0 ; i < NumUniqueNodes ; ++i)
            if (NodeNameStr.compare(NodeStrList[i]) == 0){
                CommGroup = i;
                break;
            }
        if (Rank > MASTER){
            delete NodeListLenVect;
            delete NodeNameList;
        }
        //  In case process fails, error prints and job aborts.
        if (CommGroup < 0){
            cout << "**ERROR** Rank " << Rank << " didn't identify comm group correctly." << endl;
            IsOk = false;
        }
    
        /*
         * ======================================================================
         * The above method uses c++ strings wherever possible so that things
         * like node name comparisons can be done the c++ way. I'm sure there's
         * a better way to do this because that was way too many lines of code...
         * ======================================================================
         */
    
        //  Create node communicators
        NodeComm = MPI::COMM_WORLD.Split(CommGroup, 0);
        NodeSize = NodeComm.Get_size();
        NodeRank = NodeComm.Get_rank();
    
        //  Group for master communicator
        int MasterGroup;
        if (NodeRank == MASTER)
            MasterGroup = 0;
        else
            MasterGroup = MPI_UNDEFINED;
    
        //  Create master communicator
        MasterComm = MPI::COMM_WORLD.Split(MasterGroup, 0);
        MasterRank = -1;
        MasterSize = -1;
        if (MasterComm != MPI::COMM_NULL){
            MasterRank = MasterComm.Get_rank();
            MasterSize = MasterComm.Get_size();
        }
    
        MPI::COMM_WORLD.Bcast(&MasterSize, 1, MPI::INT, MASTER);
        NodeComm.Bcast(&MasterRank, 1, MPI::INT, MASTER);
    
        return IsOk;
    }
    

    【讨论】:

      【解决方案2】:

      如果您不受任何类型的排队系统的限制,您可以通过创建自己的machinefile 来控制等级到节点的映射。

      例如,如果文件 my_machine_file 有以下 1600 行

         node001
         node002
         node003
         ....
         node100
         node001
         node002
         node003
         ....
         node100
         ...
         [repeat 13 more times]
         ...
         node001
         node002
         node003
         ....
         node100
      

      它将对应于映射

        0-> node001, 1 -> node002, ... 99 -> node100, 100 -> node001, ...
      

      你应该运行你的应用程序

        mpirun -machinefile my_machine_file -n 1600 my_app
      

      当您的应用程序需要少于 1600 个进程时,您可以相应地编辑您的机器文件。

      请记住,尽管集群管理员可能已经根据互连的拓扑对节点进行了编号。然而,有报道称,通过仔细利用集群拓扑结构,性能显着提高(大约 10%-20%)。 (后续参考)。

      注意:使用 mpirun 启动 MPI 程序既不标准化也不可移植。然而,这里的问题显然与特定的计算集群和特定的实现 (OpenMPI) 有关,并且不需要可移植的解决方案。

      【讨论】:

      • 感谢您的快速回复。
      • @srini 正确。所有的核都驻留在同一个节点上,不能用 mpirun 区分。操作系统调度程序将进程映射到内核。与核心的进程亲和性为a separate issue
      • 这可能与上下文无关,但事实上,Open MPI 允许指定每个单独等级到给定节点上特定核心的映射。这是通过将“rankfile”与-rf 选项一起传递给mpirun 来实现的。
      • @HristoIliev:我认为您的意思是 Open MPI options -bycore,-bysocket。您还可以使用 taskset 或 numactl 命令将进程绑定到特定内核。
      • @srini,rankfiles 比-bycore-bysocket 更灵活。在等级文件中,可以指定等级0 应该在主机A 上执行并绑定到核心0,等级1 应该在主机B 上执行并绑定到核心3,等等。 -bycore-bysocket 只能告诉 Open MPI 如何填充每个主机上的可用插槽,排名仍然按主机线性排列(或循环,如果指定)。
      猜你喜欢
      • 2015-03-22
      • 2016-08-12
      • 1970-01-01
      • 2012-01-29
      • 1970-01-01
      • 2012-07-31
      • 2015-07-21
      • 2013-08-10
      • 1970-01-01
      相关资源
      最近更新 更多