【问题标题】:Slurm wait option: show time waitingSlurm 等待选项:显示时间等待
【发布时间】:2022-10-18 22:20:16
【问题描述】:

当您在slurm 脚本中应用--wait 标志时,是否可以实时显示它已经等待了多长时间?

【问题讨论】:

    标签: time slurm


    【解决方案1】:

    sbatch--wait 选项一起使用时,该命令在提交的作业终止之前不会退出。 没有其他选项可用于显示待定时间。

    但是,如果作业仍处于挂起状态,您可以打开另一个会话并执行以下命令以显示挂起时间(以秒为单位):

    squeue --Format=PendingTime -j <jobid> --noheader
    

    一次显示

    如果您只是想知道作业被安排之前经过的时间,您可以在批处理脚本中添加以下行:

    echo "waited: $(squeue --Format=PendingTime -j $SLURM_JOB_ID --noheader | tr -d ' ')s"
    

    注意:这里使用 tr 命令删除 squeue 添加的尾随空格

    实时计数器

    如果您想实时显示经过的时间,您可以删除 --wait 选项并使用 sbatch-wrapper,例如:

    #!/bin/sh
    
    # Time before issuing another squeue command
    # XXX: Ensure this is large enough to avoid flooding the Slurm controller
    WAIT=20
    
    # Convert seconds to days:hours:minutes:seconds format
    seconds_to_days()
    {
        printf '%dd:%dh:%dm:%ds
    ' $(($1/86400)) $(($1%86400/3600)) $(($1%3600/60)) $(($1%60))
    }
    
    # Convert days-hours:minutes:seconds time format to seconds
    squeue_time_to_seconds()
    {
        local time=$(echo $1 | tr -d ' ') # Removing spaces
    
        # Print input and return if the time format is not recongized
        echo $time | grep -q ':' ||
        {
            printf "$time"
            return
        }
    
        # Check if time contains hours, otherwise add 0 hour
        [ $(echo $time | awk -F: '{print NF-1}') -eq 2 ] || time="0:$time"
    
        # Check if time contains days, otherwise add 0 day
        echo $time | grep -q '-' || time="0-$time"
    
        # Parse and convert to seconds
        echo $time | tr '-' ':' |
            awk -F: '{ print ($1 * 86400) + ($2 * 3600) + ($3 * 60) + $4 }'
    }
    
    # Poll job counter with squeue
    squeue_polling()
    {
        local counter=$1
        local counter_description=$2
        local jobid=$3
        local prev_time="-${WAIT}"
    
        while true; do
            elapsed_time=$(squeue --Format=$counter -j $jobid --noheader || exit $?)
            elapsed_time=$(squeue_time_to_seconds "$elapsed_time")
    
            # Return in case no counter is found
            if [ -z "$elapsed_time" ]; then
                echo; return
            fi
    
            # Update one more time the counter if it is not progressing anymore
            if [ "$elapsed_time" -lt "$((prev_time + WAIT ))" ]; then
                printf "[2K
    $counter_description: $(seconds_to_days $prev_time)
    "
                return
            fi
    
            # Update the counter without calling squeue to release the pressure on
            # the Slurm controller
            for i in $(seq 1 $WAIT); do
                printf "[2K
    $counter_description: $(seconds_to_days $(($elapsed_time + i)))"
                sleep 1
            done
            prev_time=$elapsed_time
        done
    }
    
    # Execute sbatch and display the output
    OUTPUT=$(sbatch $@)
    echo $OUTPUT
    
    # Exit on error
    if [ $? -ne 0 ]; then
        exit $?
    fi
    
    # Parse the job ID
    JOBID=$(echo $OUTPUT | sed -rn 's/Submitted batch job ([0-9]+)//p')
    
    # Display pending time until the job is scheduled
    squeue_polling 'PendingTime' 'Pending time' $JOBID
    
    # Display the time used by the allocation until the job is over
    squeue_polling 'TimeUsed' 'Allocation time' $JOBID
    

    它将就像您使用 --wait 标志提交作业一样(即在作业完成时返回)。待定时间会实时更新

    ./sbatch-wait  <options> <batch script>
    Submitted batch job 42
    Pending time: 0d:0h:1m:0s
    Allocation time: 0d:0h:1m:23s
    

    【讨论】:

      【解决方案2】:

      一个简单的方法是(ab)使用pv 命令,如下所示:

      sbatch --wait ... | pv -t
      

      它看起来像这样:

      $ sbatch --wait --wrap "sleep 30" | pv -t
      Submitted batch job 123456
      0:00:42
      

      作业完成后秒表将停止

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2021-03-25
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2017-04-19
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多