【问题标题】:Slurm not writing batch script log filesSlurm不写批处理脚本日志文件
【发布时间】:2021-11-30 21:35:16
【问题描述】:

运行批处理脚本时

#!/bin/sh
#SBATCH -n 1
#SBATCH --output=serial_test_%j.log 
hostname
echo "test" > test.log

serial_test_142.log 文件没有写入,test.log 也没有写入。 /var/log/slurm_JobComp 列表

JobId=142 UserId=ebryer(1000) GroupId=ebryer(1000) Name=hw.sh JobState=FAILED Partition=n TimeLimit=UNLIMITED StartTime=2021-11-30T13:52:14 EndTime=2021-11-30T13:52:14 NodeList=n2 NodeCnt=1 ProcCnt=1 WorkDir=/home/ebryer ReservationName= Gres= Account= QOS= WcKey= Cluster=unknown SubmitTime=2021-11-30T13:52:12 EligibleTime=2021-11-30T13:52:12 DerivedExitCode=0:0 ExitCode=1:0 

和 /var/log/slurmctld 列表

[2021-11-30T13:52:12.678] _slurm_rpc_submit_batch_job: JobId=142 InitPrio=4294901740 usec=2069
[2021-11-30T13:52:14.894] sched: Allocate JobId=142 NodeList=n2 #CPUs=1 Partition=n
[2021-11-30T13:52:14.894] prolog_running_decr: Configuration for JobId=142 is complete
[2021-11-30T13:52:14.962] _job_complete: JobId=142 WEXITSTATUS 1
[2021-11-30T13:52:14.962] _job_complete: JobId=142 done

如果我将批处理脚本行更改为 #SBATCH --output=/tmp/serial_test_%j.log,则作业的退出状态为 0,如 /var/log/slurm_JobComp 所示:

[2021-11-30T13:52:35.265] _slurm_rpc_submit_batch_job: JobId=143 InitPrio=4294901739 usec=2067
[2021-11-30T13:52:35.970] sched: Allocate JobId=143 NodeList=n2 #CPUs=1 Partition=n
[2021-11-30T13:52:35.971] prolog_running_decr: Configuration for JobId=143 is complete
[2021-11-30T13:52:36.247] _job_complete: JobId=143 WEXITSTATUS 0
[2021-11-30T13:52:36.247] _job_complete: JobId=143 done

scontrol 显示成功,JobState=COMPLETED Reason=None Dependency=(null),但 /tmp 或 test.log 输出中没有日志文件。有人能说明为什么会这样吗?

【问题讨论】:

    标签: slurm


    【解决方案1】:

    /home 未安装在节点上。当我定义日志文件到那里时,它写入 n5:/tmp,这是我应该预料到的。

    【讨论】: