将 MPI 进程映射到特定节点答案

【问题标题】：Mapping MPI processes to particular nodes将 MPI 进程映射到特定节点
【发布时间】：2013-01-02 22:06:39
【问题描述】：

我认为这个问题在这里问是无关紧要的。却无法自拔。假设我有一个包含 100 个节点的集群，每个节点有 16 个核心。我有一个 mpi 应用程序，它的通信模式是已知的，我也知道集群拓扑（即节点之间的跳距）。现在我知道了减少网络争用的节点映射过程。例如：进程到节点的映射是 10->20,30->90。如何将 rank 10 的进程映射到 node-20？请帮助我。

【问题讨论】：

标签： mpi openmpi

【解决方案1】：

这个聚会有点晚了，但这里有一个 C++ 子程序，它将为您提供一个节点通信器和一个主通信器（仅适用于节点的主节点），以及每个节点的大小和等级。这很笨拙，但不幸的是我还没有找到更好的方法来做到这一点。幸运的是，它只增加了大约 0.1 秒的挂墙时间。也许你或其他人会从中得到一些用处。

#define MASTER 0

using namespace std;

/*
 * Make a comunicator for each node and another for just
 * the masters of the nodes. Upon completion, everyone is
 * in a new node communicator, knows its size and their rank,
 * and the rank of their master in the master communicator,
 * which can be useful to use for indexing.
 */
bool    CommByNode(MPI::Intracomm &NodeComm,
                MPI::Intracomm &MasterComm,
                int &NodeRank, int &MasterRank,
                int &NodeSize, int &MasterSize,
                string &NodeNameStr)
{
    bool IsOk = true;

    int Rank = MPI::COMM_WORLD.Get_rank();
    int Size = MPI::COMM_WORLD.Get_size();

    /*
     * ======================================================================
     * What follows is my best attempt at creating a communicator
     * for each node in a job such that only the cores on that
     * node are in the node's communicator, and each core groups
     * itself and the node communicator is made using the Split() function.
     * The end of this (lengthly) process is indicated by another comment.
     * ======================================================================
     */
    char *NodeName, *NodeNameList;
    NodeName = new char [1000];
    int NodeNameLen,
        *NodeNameCountVect,
        *NodeNameOffsetVect,
        NodeNameTotalLen = 0;
    //  Get the name and name character count of each core's node
    MPI::Get_processor_name(NodeName, NodeNameLen);

    //  Prepare a vector for character counts of node names
    if (Rank == MASTER)
        NodeNameCountVect = new int [Size];

    //  Gather node name lengths to master to prepare c-array
    MPI::COMM_WORLD.Gather(&NodeNameLen, 1, MPI::INT, NodeNameCountVect, 1, MPI::INT, MASTER);

    if (Rank == MASTER){
        //  Need character count information for navigating node name c-array
        NodeNameOffsetVect = new int [Size];
        NodeNameOffsetVect[0] = 0;
        NodeNameTotalLen = NodeNameCountVect[0];

        //  build offset vector and total char count for all node names
        for (int i = 1 ; i < Size ; ++i){
            NodeNameOffsetVect[i] = NodeNameCountVect[i-1] + NodeNameOffsetVect[i-1];
            NodeNameTotalLen += NodeNameCountVect[i];
        }
        //  char-array for all node names
        NodeNameList = new char [NodeNameTotalLen];
    }

    //  Gatherv node names to char-array in master
    MPI::COMM_WORLD.Gatherv(NodeName, NodeNameLen, MPI::CHAR, NodeNameList, NodeNameCountVect, NodeNameOffsetVect, MPI::CHAR, MASTER);

    string *FullStrList, *NodeStrList;
    //  Each core keeps its node's name in a str for later comparison
    stringstream ss;
    ss << NodeName;
    ss >> NodeNameStr;

    delete NodeName;    //  node name in str, so delete c-array

    int *NodeListLenVect, NumUniqueNodes = 0, NodeListCharLen = 0;
    string NodeListStr;

    if (Rank == MASTER){
        /*
         * Need to prepare a list of all unique node names, so first
         * need all node names (incl duplicates) as strings, then
         * can make a list of all unique node names.
         */
        FullStrList = new string [Size];    //  full list of node names, each will be checked
        NodeStrList = new string [Size];    //  list of unique node names, used for checking above list
        //  i loops over node names, j loops over characters for each node name.
        for (int i = 0 ; i < Size ; ++i){
            stringstream ss;
            for (int j = 0 ; j < NodeNameCountVect[i] ; ++j)
                ss << NodeNameList[NodeNameOffsetVect[i] + j];  //  each char into the stringstream
            ss >> FullStrList[i];   //  stringstream into string for each node name
            ss.str(""); //  This and below clear the contents of the stringstream,
            ss.clear(); //  since the >> operator doesn't clear as it extracts
            //cout << FullStrList[i] << endl;   //  for testing
        }
        delete NodeNameList;    //  master is done with full c-array
        bool IsUnique;  //  flag for breaking from for loop
        stringstream ss;    //  used for a full c-array of unique node names
        for (int i = 0 ; i < Size ; ++i){   //  Loop over EVERY name
            IsUnique = true;
            for (int j = 0 ; j < NumUniqueNodes ; ++j)
                if (FullStrList[i].compare(NodeStrList[j]) == 0){   //  check against list of uniques
                    IsUnique = false;
                    break;
                }
            if (IsUnique){
                NodeStrList[NumUniqueNodes] = FullStrList[i];   //  add unique names so others can be checked against them
                ss << NodeStrList[NumUniqueNodes].c_str();  //  build up a string of all unique names back-to-back
                ++NumUniqueNodes;   //  keep a tally of number of unique nodes
            }
        }
        ss >> NodeListStr;  //  make a string of all unique node names
        NodeListCharLen = NodeListStr.size();   //  char length of all unique node names
        NodeListLenVect = new int [NumUniqueNodes]; //  list of unique node name lengths
        /*
         * Because Bcast simply duplicates the buffer of the Bcaster to all cores,
         * the buffer needs to be a char* so that the other cores can have a similar
         * buffer prepared to receive. This wouldn't work if we passed string.c_str()
         * as the buffer, becuase the receiving cores don't have string.c_str() to
         * receive into, and even if they did, c_srt() is a method and can't be used
         * that way.
         */
        NodeNameList = new char [NodeListCharLen];  //  even though c_str is used, allocate necessary memory
        NodeNameList = const_cast<char*>(NodeListStr.c_str());  //  c_str() returns const char*, so need to recast
        for (int i = 0 ; i < NumUniqueNodes ; ++i)  //  fill list of unique node name char lengths
            NodeListLenVect[i] = NodeStrList[i].size();
        /*for (int i = 0 ; i < NumUnique ; ++i)
            cout << UniqueNodeStrList[i] << endl;
        MPI::COMM_WORLD.Abort(1);*/
        //delete NodeStrList;   //  Arrays of string don't need to be deallocated,
        //delete FullStrList;   //  I'm guessing becuase of something weird in the string class.
        delete NodeNameCountVect;
        delete NodeNameOffsetVect;
    }
    /*
     * Now we send the list of node names back to all cores
     * so they can group themselves appropriately.
     */

    //  Bcast the number of nodes in use
    MPI::COMM_WORLD.Bcast(&NumUniqueNodes, 1, MPI::INT, MASTER);
    //  Bcast the full length of all node names
    MPI::COMM_WORLD.Bcast(&NodeListCharLen, 1, MPI::INT, MASTER);

    //  prepare buffers for node name Bcast's
    if (Rank > MASTER){
        NodeListLenVect = new int [NumUniqueNodes];
        NodeNameList = new char [NodeListCharLen];
    }

    //  Lengths of node names for navigating c-string
    MPI::COMM_WORLD.Bcast(NodeListLenVect, NumUniqueNodes, MPI::INT, MASTER);
    //  The actual full list of unique node names
    MPI::COMM_WORLD.Bcast(NodeNameList, NodeListCharLen, MPI::CHAR, MASTER);

    /*
     * Similar to what master did before, each core (incl master)
     * needs to build an actual list of node names as strings so they
     * can compare the c++ way.
     */
    int Offset = 0;
    NodeStrList = new string[NumUniqueNodes];
    for (int i = 0 ; i < NumUniqueNodes ; ++i){
        stringstream ss;
        for (int j = 0 ; j < NodeListLenVect[i] ; ++j)
            ss << NodeNameList[Offset + j];
        ss >> NodeStrList[i];
        ss.str("");
        ss.clear();
        Offset += NodeListLenVect[i];
        //cout << FullStrList[i] << endl;
    }
    //  Now since everyone has the same list, just check your node and find your group.
    int CommGroup = -1;
    for (int i = 0 ; i < NumUniqueNodes ; ++i)
        if (NodeNameStr.compare(NodeStrList[i]) == 0){
            CommGroup = i;
            break;
        }
    if (Rank > MASTER){
        delete NodeListLenVect;
        delete NodeNameList;
    }
    //  In case process fails, error prints and job aborts.
    if (CommGroup < 0){
        cout << "**ERROR** Rank " << Rank << " didn't identify comm group correctly." << endl;
        IsOk = false;
    }

    /*
     * ======================================================================
     * The above method uses c++ strings wherever possible so that things
     * like node name comparisons can be done the c++ way. I'm sure there's
     * a better way to do this because that was way too many lines of code...
     * ======================================================================
     */

    //  Create node communicators
    NodeComm = MPI::COMM_WORLD.Split(CommGroup, 0);
    NodeSize = NodeComm.Get_size();
    NodeRank = NodeComm.Get_rank();

    //  Group for master communicator
    int MasterGroup;
    if (NodeRank == MASTER)
        MasterGroup = 0;
    else
        MasterGroup = MPI_UNDEFINED;

    //  Create master communicator
    MasterComm = MPI::COMM_WORLD.Split(MasterGroup, 0);
    MasterRank = -1;
    MasterSize = -1;
    if (MasterComm != MPI::COMM_NULL){
        MasterRank = MasterComm.Get_rank();
        MasterSize = MasterComm.Get_size();
    }

    MPI::COMM_WORLD.Bcast(&MasterSize, 1, MPI::INT, MASTER);
    NodeComm.Bcast(&MasterRank, 1, MPI::INT, MASTER);

    return IsOk;
}

【讨论】：

【解决方案2】：

如果您不受任何类型的排队系统的限制，您可以通过创建自己的machinefile 来控制等级到节点的映射。

例如，如果文件 my_machine_file 有以下 1600 行

   node001
   node002
   node003
   ....
   node100
   node001
   node002
   node003
   ....
   node100
   ...
   [repeat 13 more times]
   ...
   node001
   node002
   node003
   ....
   node100

它将对应于映射

  0-> node001, 1 -> node002, ... 99 -> node100, 100 -> node001, ...

你应该运行你的应用程序

  mpirun -machinefile my_machine_file -n 1600 my_app

当您的应用程序需要少于 1600 个进程时，您可以相应地编辑您的机器文件。

请记住，尽管集群管理员可能已经根据互连的拓扑对节点进行了编号。然而，有报道称，通过仔细利用集群拓扑结构，性能显着提高（大约 10%-20%）。（后续参考）。

注意：使用 mpirun 启动 MPI 程序既不标准化也不可移植。然而，这里的问题显然与特定的计算集群和特定的实现 (OpenMPI) 有关，并且不需要可移植的解决方案。

【讨论】：

感谢您的快速回复。
@srini 正确。所有的核都驻留在同一个节点上，不能用 mpirun 区分。操作系统调度程序将进程映射到内核。与核心的进程亲和性为a separate issue。
这可能与上下文无关，但事实上，Open MPI 允许指定每个单独等级到给定节点上特定核心的映射。这是通过将“rankfile”与-rf 选项一起传递给mpirun 来实现的。
@HristoIliev：我认为您的意思是 Open MPI options -bycore,-bysocket。您还可以使用 taskset 或 numactl 命令将进程绑定到特定内核。
@srini，rankfiles 比-bycore 和-bysocket 更灵活。在等级文件中，可以指定等级0 应该在主机A 上执行并绑定到核心0，等级1 应该在主机B 上执行并绑定到核心3，等等。 -bycore 和 -bysocket 只能告诉 Open MPI 如何填充每个主机上的可用插槽，排名仍然按主机线性排列（或循环，如果指定）。