【发布时间】:2020-05-30 18:07:05
【问题描述】:
因此,我目前正在构建一个控制程序的基础,以便在多个树莓派上运行,该树莓派将使用每个树莓派上的所有可用内核。当我使用所有内核在其中一个节点上测试我的代码时,它工作正常,但是使用多个节点会给我一个分段错误。
我查看了过去提出的所有类似问题,但它们都存在仅在一个节点上破坏我的代码的问题。
完整代码:
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdbool.h>
#include <time.h>
int main(int argc, char *argv[])
{
FILE *input;
char batLine[86]; //may need to be made larger if bat commands get longer
char sentbatch[86];
int currentTask;
int numTasks, rank, rc, i;
MPI_Status stat;
bool exitFlag = false;
//mpi stuff
MPI_Init(&argc,&argv); //initilize mpi enviroment
MPI_Comm_size(MPI_COMM_WORLD, &numTasks);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
//printf("Number of tasks: %d \n", numTasks);
//printf ("MPI task %d has started...\n", rank);
if(argc != 2)
{
printf("Usage: batallocation *.bat");
exit(1); //exit with 1 indicates a failure
}
//contains file name: argv[1]
input = fopen(argv[1],"r");
currentTask = 0;
if (rank ==0)
{
while(1)
{
if(exitFlag)
break; //allows to break out of while and for when no more lines exist
char command[89] = "./";
for(i=0; i < 16; i++) //will need to be 16 for full testing
{
//fgets needs to be character count of longest line + 2 or it fails
if(fgets(batLine,86,input) != NULL)
{
printf("preview:%s\n",batLine);
if(i==0)
{
strcat(command,batLine);
printf("rank0 gets: %s\n", command);
//system(command);
}
else
{
//MPI_Send(buffer,count,type,dest,tag,comm)
MPI_Send(batLine,85,MPI_CHAR,i,i,MPI_COMM_WORLD);
printf("sent rank%d: %s\n",i,batLine);
}
}
else
{
exitFlag = true; //flag to break out of while loop
break;
}
}
//need to recieve data from other nodes here
//put the data together in proper order
//and only after that can the next sets be sent out
}
}
else
{
char command[89] = "./";
//MPI_Recv(buffer,count,type,source,tag,comm,status)
MPI_Recv(sentbatch,86,MPI_CHAR,0,rank,MPI_COMM_WORLD,&stat);
//using rank as flag makes it so only the wanted rank gets sent the data
strcat(command,sentbatch); //adds needed ./ before batch data
printf("rank=%d recieved data:%s",rank,sentbatch);
//system(command); //should run batch line
}
fclose(input);
MPI_Finalize();
return(0);
}
被传递的文件内容:
LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-006.flx spec-56321-GAC099N59V1_sp01-006.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-008.flx spec-56321-GAC099N59V1_sp01-008.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-010.flx spec-56321-GAC099N59V1_sp01-010.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-013.flx spec-56321-GAC099N59V1_sp01-013.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-015.flx spec-56321-GAC099N59V1_sp01-015.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-018.flx spec-56321-GAC099N59V1_sp01-018.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-022.flx spec-56321-GAC099N59V1_sp01-022.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-023.flx spec-56321-GAC099N59V1_sp01-023.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-024.flx spec-56321-GAC099N59V1_sp01-024.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-025.flx spec-56321-GAC099N59V1_sp01-025.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-028.flx spec-56321-GAC099N59V1_sp01-028.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-029.flx spec-56321-GAC099N59V1_sp01-029.nor f
您会注意到我还没有做一些将在最终版本中完成的事情,它们在 cmets 中以使故障排除更容易。主要是因为 LAMOST 代码不快,我不想等待它完成。
有效的命令提示符及其输出:
$mpiexec -N 4 --host 10.0.0.3 -oversubscribe batTest2 shortpass2.bat
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f
rank0 gets: ./LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
sent rank1: LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
sent rank2: LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
sent rank3: LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
rank=1 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
rank=3 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
rank=2 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
Shortpass2 是同一个文件,但只有前 4 行。我的代码理论上应该适用于所有 16 行,但在修复当前问题后,我将使用完整文件对其进行测试。
在多个节点上运行命令和输出:
$mpiexec -N 4 --host 10.0.0.3,10.0.0.4,10.0.0.5,10.0.0.6 -oversubscribe batTest2 shortpass.bat
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f
rank0 gets: ./LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
sent rank1: LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
rank=1 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
sent rank2: LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
rank=2 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
sent rank3: LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-006.flx spec-56321-GAC099N59V1_sp01-006.nor f
rank=3 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
sent rank4: LAMOSTv108 spec-56321-GAC099N59V1_sp01-006.flx spec-56321-GAC099N59V1_sp01-006.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-008.flx spec-56321-GAC099N59V1_sp01-008.nor f
rank=4 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-006.flx spec-56321-GAC099N59V1_sp01-006.nor f
[node2:27622] *** Process received signal ***
[node2:27622] Signal: Segmentation fault (11)
[node2:27622] Signal code: Address not mapped (1)
[node2:27622] Failing at address: (nil)
[node2:27622] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
corrupted double-linked list
Aborted
有时它会在完全中止之前成功达到 5 级,并且会有多个实例出现相同的错误消息。此外,Open MPI 安装了多线程支持,所以这不是问题。这是我第一次使用 MPI,但这不是整个项目的第一部分,我已经对 MPI 进行了大量研究,甚至可以做到这一点。
我知道这不是由我的数组引起的,因为那时它也会在 node1 上中断。所有的 pi 都是相同的,因此阵列导致分段错误是没有意义的。 (虽然我承认在这个项目的不同部分工作时,我曾多次遇到过这个问题,因为我更习惯 Java 和 C# 处理数组的方式)
编辑:我检查了是否可以从其他节点之一跨 4 个内核运行它,并且工作正常并产生与在 node1 上相同的输出。因此,这确认它不是仅发生在其他节点上的阵列问题。 还添加了预览打印输出代码中缺少的一行。
Edit2: Per Gilles 建议:该代码也适用于在一个节点上运行 16 个任务。这是输出:
$ mpiexec -N 16 --host 10.0.0.3 -oversubscribe batTest4 shortpass.bat
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f
rank0 gets: ./LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
sent rank1: LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
sent rank2: LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
sent rank3: LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-006.flx spec-56321-GAC099N59V1_sp01-006.nor f
sent rank4: LAMOSTv108 spec-56321-GAC099N59V1_sp01-006.flx spec-56321-GAC099N59V1_sp01-006.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-008.flx spec-56321-GAC099N59V1_sp01-008.nor f
sent rank5: LAMOSTv108 spec-56321-GAC099N59V1_sp01-008.flx spec-56321-GAC099N59V1_sp01-008.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-010.flx spec-56321-GAC099N59V1_sp01-010.nor f
sent rank6: LAMOSTv108 spec-56321-GAC099N59V1_sp01-010.flx spec-56321-GAC099N59V1_sp01-010.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-013.flx spec-56321-GAC099N59V1_sp01-013.nor f
sent rank7: LAMOSTv108 spec-56321-GAC099N59V1_sp01-013.flx spec-56321-GAC099N59V1_sp01-013.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-015.flx spec-56321-GAC099N59V1_sp01-015.nor f
sent rank8: LAMOSTv108 spec-56321-GAC099N59V1_sp01-015.flx spec-56321-GAC099N59V1_sp01-015.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-018.flx spec-56321-GAC099N59V1_sp01-018.nor f
sent rank9: LAMOSTv108 spec-56321-GAC099N59V1_sp01-018.flx spec-56321-GAC099N59V1_sp01-018.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-022.flx spec-56321-GAC099N59V1_sp01-022.nor f
sent rank10: LAMOSTv108 spec-56321-GAC099N59V1_sp01-022.flx spec-56321-GAC099N59V1_sp01-022.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-023.flx spec-56321-GAC099N59V1_sp01-023.nor f
sent rank11: LAMOSTv108 spec-56321-GAC099N59V1_sp01-023.flx spec-56321-GAC099N59V1_sp01-023.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-024.flx spec-56321-GAC099N59V1_sp01-024.nor f
sent rank12: LAMOSTv108 spec-56321-GAC099N59V1_sp01-024.flx spec-56321-GAC099N59V1_sp01-024.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-025.flx spec-56321-GAC099N59V1_sp01-025.nor f
sent rank13: LAMOSTv108 spec-56321-GAC099N59V1_sp01-025.flx spec-56321-GAC099N59V1_sp01-025.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-028.flx spec-56321-GAC099N59V1_sp01-028.nor f
sent rank14: LAMOSTv108 spec-56321-GAC099N59V1_sp01-028.flx spec-56321-GAC099N59V1_sp01-028.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-029.flx spec-56321-GAC099N59V1_sp01-029.nor f
sent rank15: LAMOSTv108 spec-56321-GAC099N59V1_sp01-029.flx spec-56321-GAC099N59V1_sp01-029.nor f
rank=3 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
rank=5 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-008.flx spec-56321-GAC099N59V1_sp01-008.nor f
rank=6 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-010.flx spec-56321-GAC099N59V1_sp01-010.nor f
rank=7 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-013.flx spec-56321-GAC099N59V1_sp01-013.nor f
rank=11 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-023.flx spec-56321-GAC099N59V1_sp01-023.nor f
rank=12 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-024.flx spec-56321-GAC099N59V1_sp01-024.nor f
rank=9 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-018.flx spec-56321-GAC099N59V1_sp01-018.nor f
rank=2 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
rank=4 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-006.flx spec-56321-GAC099N59V1_sp01-006.nor f
rank=8 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-015.flx spec-56321-GAC099N59V1_sp01-015.nor f
rank=10 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-022.flx spec-56321-GAC099N59V1_sp01-022.nor f
rank=15 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-029.flx spec-56321-GAC099N59V1_sp01-029.nor f
rank=1 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
rank=13 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-025.flx spec-56321-GAC099N59V1_sp01-025.nor f
rank=14 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-028.flx spec-56321-GAC099N59V1_sp01-028.nor f
【问题讨论】:
-
等级 4 及以上接收但从不发送任何东西。尝试在单个节点上运行 16 个任务,看看会发生什么(您需要
mpirun --oversubscribe ... -
当您使用
mpiexec -N 4 ...运行时,您如何获得诸如rank=4 received data ...之类的输出?在这种情况下,排名范围从 0 到 3。 -
似乎
mpiexec -N 4 --oversubscribe ...意味着每个节点有 4 个 MPI 任务(至少对于 Open MPIv2.0.x)。无论如何,for() MPI_Send()循环永远不会发送到 4 级及以上(并且代码中没有preview这样的东西)所以输出显然与您发布的代码不匹配。 -
吉尔斯是正确的。使用“-N”而不是 -n 允许它为每个节点执行 4 个任务。至于在一个节点上运行 16 个任务是否允许?每个 pi 只有 4 个内核。对于预览,我一定忘记将其添加到记事本上。它应该在 " "if (rank ==0)" 之前。这是一个确保 fgets 正常工作的测试。
-
Gilles,我尝试按照您在之前的 cmets 中的建议在一个节点上运行 16 个任务,并且效果很好。我会将输出编辑到我原来的问题中,以防它有任何用处
标签: c raspberry-pi mpi