MPI 中的矩阵乘法答案

【问题标题】：Matrix Multiplication in MPIMPI 中的矩阵乘法
【发布时间】：2017-10-12 01:22:32
【问题描述】：

我正在尝试使用 1、2、4 或 8 个处理器创建一个带有 MPI 的简单矩阵乘法程序。我的代码适用于 1（在这种情况下，它只进行正常的矩阵乘法，我的意思是，如果你只运行一个等级，就没有工作可以拆分）。它也适用于 2 和 4 处理器。但是，当我尝试使用 8 个处理器（即运行程序时在命令行上使用 -n 8）时，矩阵 c 中的所有位置都没有得到正确的值。

这里是例子：如果SIZE = 8（即a和b和c都是8x8矩阵），得到的矩阵如下：

   8.00   8.00   8.00   8.00   8.00   8.00   8.00   8.00
   8.00   8.00   8.00   8.00   8.00   8.00   8.00   8.00
   8.00   8.00   8.00   8.00   8.00   8.00   8.00   8.00
   8.00   8.00   8.00   8.00   8.00   8.00   8.00   8.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
   0.00   0.00  16.00  16.00  16.00  16.00  16.00  16.00

如果 SIZE = 16：

  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00  16.00
  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00
  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00
  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00
  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00
  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00
  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00
   0.00   0.00   0.00   0.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00
  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00  32.00

如您所见，左下方会弹出零。 Rank 7 正在做的事情是导致这些坐标变为 0。

我现在一直盯着我的代码死，我觉得我只需要另一双眼睛盯着它们。据我所知，所有的发送和接收都正常工作，所有不同的任务都得到了他们应该得到的价值。根据我所做的测试，实际上没有任何任务将 c 矩阵中的任何位置的值设为 0。我不知道它为什么会发生、如何发生，或者我可以做些什么来修复它。

代码如下：

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

#define SIZE 16 /*assumption: SIZE a multiple of number of nodes*/
#define FROM_MASTER 1/*setting a message type*/
#define FROM_WORKER 2/*setting a message type*/
#define DEBUG 1/*1 = debug on, 0 = debug off*/

MPI_Status status;

static double a[SIZE][SIZE];
static double b[SIZE][SIZE];
static double c[SIZE][SIZE];
static double b_to_trans[SIZE][SIZE];
static void init_matrix(void)
{
    int i, j;
    for (i = 0; i < SIZE; i++)
    {
        for (j = 0; j < SIZE; j++) {
            a[i][j] = 1.0;
            if(i >= SIZE/2) a[i][j] = 2.0;
            b_to_trans[i][j] = 1.0;
            if(j >= SIZE/2) b[i][j] = 2.0;
//          c[i][j] = 1.0;
        }
    }
}

static void print_matrix(void)
{
    int i, j;
    for(i = 0; i < SIZE; i++) {
        for(j = 0; j < SIZE; j++) {
            printf("%7.2f", c[i][j]);
        }
    printf("\n");
    }
}

static void transpose_matrix()
{
    int i, j;
    for(i = 0; i<SIZE; i++)
        for(j = 0; j<SIZE;j++)
            b[i][j] = b_to_trans[j][i];
}

int main(int argc, char **argv)
{
    int myrank, nproc;
    int rows; /*amount of work per node (rows per worker)*/
    int mtype; /*message type: send/recv between master and workers*/
    int dest, src, offseta, offsetb;
    int runthrough, runmod;
    double start_time, end_time;
    int i, j, k, l;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
    rows = SIZE/nproc;
    mtype = FROM_MASTER;

    if (myrank == 0) {
        /*Initialization*/
        printf("SIZE = %d, number of nodes = %d\n", SIZE, nproc);
        init_matrix();
        transpose_matrix();
        start_time = MPI_Wtime();

        if(nproc == 1) { /*In case we only run on one processor, the master will simply do a regular matrix-matrix multiplacation.*/
            for(i = 0; i < SIZE; i++) {
                for(j = 0; j < SIZE; j++) {
                    for(k = 0; k < SIZE; k++)
                        c[i][j] = c[i][j] + a[i][k]*b[j][k];
                }
            }
            end_time = MPI_Wtime();
            if(DEBUG) /*Prints the resulting matrix c*/
                print_matrix();
            printf("Execution time on %2d nodes: %f\n", nproc, end_time-start_time);
        }
        else {

            for(l = 0; l < nproc; l++){
                offsetb = rows*l;
                offseta = rows;
                mtype = FROM_MASTER;

                for(dest = 1; dest < nproc; dest++){
                    MPI_Send(&offseta, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
                    MPI_Send(&offsetb, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
                    MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
                    MPI_Send(&a[offseta][0], rows*SIZE, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);
                    MPI_Send(&b[offsetb][0], rows*SIZE, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);
                    offseta += rows;
                    offsetb = (offsetb+rows)%SIZE;
                }

                offseta = rows;
                offsetb = rows*l;
                //printf("Rank: %d, offseta: %d, offsetb: %d\n", myrank, offseta, offsetb);
                //printf("Offseta: %d\n", offseta);
                //printf("Offsetb: %d\n", offsetb);
                for(i = 0; i < offseta; i++) {
                    for(j = offsetb; j < offsetb+rows; j++) {
                            for(k = 0; k < SIZE; k++){
                                c[i][j] = c[i][j] + a[i][k]*b[j][k];
                        }
                    }
                }
                mtype = FROM_WORKER;
                for(src = 1; src < nproc; src++){
                    MPI_Recv(&offseta, 1, MPI_INT, src, mtype, MPI_COMM_WORLD, &status);
                    MPI_Recv(&offsetb, 1, MPI_INT, src, mtype, MPI_COMM_WORLD, &status);
                    MPI_Recv(&rows, 1, MPI_INT, src, mtype, MPI_COMM_WORLD, &status);
                    for(i = 0; i < rows; i++) {
                        MPI_Recv(&c[offseta+i][offsetb], offseta, MPI_DOUBLE, src, mtype, MPI_COMM_WORLD, &status); /*returns answer c(1,1)*/
                    }
                }
            }


            end_time = MPI_Wtime();
            if(DEBUG) /*Prints the resulting matrix c*/
                print_matrix();
            printf("Execution time on %2d nodes: %f\n", nproc, end_time-start_time);
        }
    }
    else{
        if(nproc > 1) {
            for(l = 0; l < nproc; l++){
                mtype = FROM_MASTER;
                MPI_Recv(&offseta, 1, MPI_INT, 0, mtype, MPI_COMM_WORLD, &status);
                MPI_Recv(&offsetb, 1, MPI_INT, 0, mtype, MPI_COMM_WORLD, &status);
                MPI_Recv(&rows, 1, MPI_INT, 0, mtype, MPI_COMM_WORLD, &status);
                MPI_Recv(&a[offseta][0], rows*SIZE, MPI_DOUBLE, 0, mtype, MPI_COMM_WORLD, &status);
                MPI_Recv(&b[offsetb][0], rows*SIZE, MPI_DOUBLE, 0, mtype, MPI_COMM_WORLD, &status);

                for(i = offseta; i < offseta+rows; i++) {
                    for(j = offsetb; j < offsetb+rows; j++) {
                        for(k = 0; k < SIZE; k++){
                            c[i][j] = c[i][j] + a[i][k]*b[j][k];
                        }
                    }
                }

                mtype = FROM_WORKER;
                MPI_Send(&offseta, 1, MPI_INT, 0, mtype, MPI_COMM_WORLD);
                MPI_Send(&offsetb, 1, MPI_INT, 0, mtype, MPI_COMM_WORLD);
                MPI_Send(&rows, 1, MPI_INT, 0, mtype, MPI_COMM_WORLD);
                for(i = 0; i < rows; i++){
                    MPI_Send(&c[offseta+i][offsetb], offseta, MPI_DOUBLE, 0, mtype, MPI_COMM_WORLD);
                }
            }
        }
    }
    MPI_Finalize();
    return 0;
}

任何建议都会有所帮助，在此先感谢您。

【问题讨论】：

詹斯，你也应该在这里添加你的代码，而不是把它放在 pastebin 上（我现在会为你做），这样你就不会得到封闭的投票
我不知道人们是否愿意将整个代码放在此处，或者他们是否会抱怨它会淹没页面。但无论哪种方式都非常感谢您。
嗨 Jens，好吧，所以我查看了您的代码，mpirun 肯定会产生问题，我将使用 cmets 重做一个矩阵乘法解决方案，而不是调试您的，（这是您的左下角没有整理）
我已经设法解决了这个问题，但是非常欢迎你试一试，如果没有别的办法，只能分享一个替代解决方案^^

标签： c mpi matrix-multiplication

【解决方案1】：

这不是一个确定的答案，但肯定可以帮助您进行调试。

我做了一个测试，在 master 从 worker 接收最终数据的地方添加以下代码。在一堆输出中，我只显示重要的输出。请注意，j+count 永远不会超过 SIZE，除非处理器数量为 8。这很重要，因为您写入的是未分配的内存。

for(i = 0; i < rows; i++) {
    MPI_Recv(&c[offseta+i][offsetb], offseta, MPI_DOUBLE, src, mtype, MPI_COMM_WORLD, &status);
    // I added the following for debugging.            
    if (src == nproc-1)
    {
        printf("src = %i\n", src);
        printf("i = %i\n", offseta+i);
        printf("j = %i\n", offsetb);
        printf("count = %i\n", offseta);
    }
}

np = 2

src = 1
i = 15
j = 8
count = 8

np = 4

src = 3
i = 15
j = 4
count = 12

np = 8

src = 7
i = 15
j = 10
count = 14

【讨论】：

我希望我可以说这对我的情况有所帮助，但我不明白这有多重要。 Offseta 和 offset b （它们是 printfs 中的 j 和计数）只是整数。它们的总和可能大于大小，但我只写过 c[offseta+i][offsetb]。我看不出这将如何到达未分配的内存，除非 offseta+i 或 offsetb 大于大小？这是我第一次使用 c 和 MPI，所以我可能因为无知而遗漏了一些东西......
例如接收缓冲区的地址是c[15][10]，所以你可以接收5个MPI_DOUBLE。 c[15][16]。 c[15][17]等是未分配的内存空间。正确的？请注意，您收到的是 14 MPI_DOUBLE 而不是 5。
这是我需要的最后一条提示，我设法解决了它。现在看，当 offseta 的值发生变化时，声明我想要发送的字节数似乎很明显是“offseta”。我通过更改我的发送和接收来修复它以发送和接收内存中的“行”字节数，现在它可以工作了。非常感谢你^^我会宣布你的答案是正确的并点赞。