【发布时间】:2022-10-18 22:20:16
【问题描述】:
当您在slurm 脚本中应用--wait 标志时,是否可以实时显示它已经等待了多长时间?
【问题讨论】:
当您在slurm 脚本中应用--wait 标志时,是否可以实时显示它已经等待了多长时间?
【问题讨论】:
当sbatch 与--wait 选项一起使用时,该命令在提交的作业终止之前不会退出。
没有其他选项可用于显示待定时间。
但是,如果作业仍处于挂起状态,您可以打开另一个会话并执行以下命令以显示挂起时间(以秒为单位):
squeue --Format=PendingTime -j <jobid> --noheader
一次显示
如果您只是想知道作业被安排之前经过的时间,您可以在批处理脚本中添加以下行:
echo "waited: $(squeue --Format=PendingTime -j $SLURM_JOB_ID --noheader | tr -d ' ')s"
注意:这里使用 tr 命令删除 squeue 添加的尾随空格
实时计数器
如果您想实时显示经过的时间,您可以删除 --wait 选项并使用 sbatch-wrapper,例如:
#!/bin/sh
# Time before issuing another squeue command
# XXX: Ensure this is large enough to avoid flooding the Slurm controller
WAIT=20
# Convert seconds to days:hours:minutes:seconds format
seconds_to_days()
{
printf '%dd:%dh:%dm:%ds
' $(($1/86400)) $(($1%86400/3600)) $(($1%3600/60)) $(($1%60))
}
# Convert days-hours:minutes:seconds time format to seconds
squeue_time_to_seconds()
{
local time=$(echo $1 | tr -d ' ') # Removing spaces
# Print input and return if the time format is not recongized
echo $time | grep -q ':' ||
{
printf "$time"
return
}
# Check if time contains hours, otherwise add 0 hour
[ $(echo $time | awk -F: '{print NF-1}') -eq 2 ] || time="0:$time"
# Check if time contains days, otherwise add 0 day
echo $time | grep -q '-' || time="0-$time"
# Parse and convert to seconds
echo $time | tr '-' ':' |
awk -F: '{ print ($1 * 86400) + ($2 * 3600) + ($3 * 60) + $4 }'
}
# Poll job counter with squeue
squeue_polling()
{
local counter=$1
local counter_description=$2
local jobid=$3
local prev_time="-${WAIT}"
while true; do
elapsed_time=$(squeue --Format=$counter -j $jobid --noheader || exit $?)
elapsed_time=$(squeue_time_to_seconds "$elapsed_time")
# Return in case no counter is found
if [ -z "$elapsed_time" ]; then
echo; return
fi
# Update one more time the counter if it is not progressing anymore
if [ "$elapsed_time" -lt "$((prev_time + WAIT ))" ]; then
printf "[2K
$counter_description: $(seconds_to_days $prev_time)
"
return
fi
# Update the counter without calling squeue to release the pressure on
# the Slurm controller
for i in $(seq 1 $WAIT); do
printf "[2K
$counter_description: $(seconds_to_days $(($elapsed_time + i)))"
sleep 1
done
prev_time=$elapsed_time
done
}
# Execute sbatch and display the output
OUTPUT=$(sbatch $@)
echo $OUTPUT
# Exit on error
if [ $? -ne 0 ]; then
exit $?
fi
# Parse the job ID
JOBID=$(echo $OUTPUT | sed -rn 's/Submitted batch job ([0-9]+)//p')
# Display pending time until the job is scheduled
squeue_polling 'PendingTime' 'Pending time' $JOBID
# Display the time used by the allocation until the job is over
squeue_polling 'TimeUsed' 'Allocation time' $JOBID
它将就像您使用 --wait 标志提交作业一样(即在作业完成时返回)。待定时间会实时更新
./sbatch-wait <options> <batch script>
Submitted batch job 42
Pending time: 0d:0h:1m:0s
Allocation time: 0d:0h:1m:23s
【讨论】:
一个简单的方法是(ab)使用pv 命令,如下所示:
sbatch --wait ... | pv -t
它看起来像这样:
$ sbatch --wait --wrap "sleep 30" | pv -t
Submitted batch job 123456
0:00:42
作业完成后秒表将停止
【讨论】: