【问题标题】:Airflow BashOperator return exit code 0 even when task failed and return exit code 1即使任务失败,Airflow BashOperator 也返回退出代码 0 并返回退出代码 1
【发布时间】:2020-01-06 12:35:48
【问题描述】:

我正在尝试使用 Kubernetes 从气流的 bash 运算符运行 spark 作业,我已将 callback_failure 配置为某些函数,但是即使 spark 作业失败并退出代码为 1,我的任务始终被标记为成功并且不会调用函数(callbcak 失败)。以下是气流日志的sn-ps:

[2020-01-03 13:22:46,730] {{bash_operator.py:128}} INFO - 20/01/03 13:22:46 INFO LoggingPodStatusWatcherImpl: Container final statuses:
[2020-01-03 13:22:46,730] {{bash_operator.py:128}} INFO - 
[2020-01-03 13:22:46,730] {{bash_operator.py:128}} INFO - 
[2020-01-03 13:22:46,730] {{bash_operator.py:128}} INFO -    Container name: spark-kubernetes-driver
[2020-01-03 13:22:46,730] {{bash_operator.py:128}} INFO -    Container image: XXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/spark-py:XX_XX
[2020-01-03 13:22:46,730] {{bash_operator.py:128}} INFO -    Container state: Terminated
[2020-01-03 13:22:46,730] {{bash_operator.py:128}} INFO -    Exit code: 1
[2020-01-03 13:22:46,731] {{bash_operator.py:128}} INFO - 20/01/03 13:22:46 INFO Client: Application run_report_generator finished.
[2020-01-03 13:22:46,736] {{bash_operator.py:128}} INFO - 20/01/03 13:22:46 INFO ShutdownHookManager: Shutdown hook called
[2020-01-03 13:22:46,737] {{bash_operator.py:128}} INFO - 20/01/03 13:22:46 INFO ShutdownHookManager: Deleting directory /tmp/spark-adb99a7e-ce6c-49f6-8307-a17c28448043
[2020-01-03 13:22:46,761] {{bash_operator.py:132}} INFO - Command exited with return code 0
[2020-01-03 13:22:49,994] {{logging_mixin.py:95}} INFO - [ [34m2020-01-03 13:22:49,994 [0m] {{ [34mlocal_task_job.py: [0m105}} INFO [0m - Task exited with return code 0 

【问题讨论】:

  • 看起来你的 bash 脚本在容器失败时返回 0(请参阅 github.com/apache/airflow/blob/master/airflow/operators/… 了解 BashOperator 如何处理退出代码)。它成功提交了作业,但我的猜测是 Bash 脚本不会检查作业的结果。您可以发布您的脚本代码 + BashOperator。

标签: apache-spark kubernetes amazon-eks airflow


【解决方案1】:

您需要使用set -e 确保BashOperator 停止执行并返回任何非零代码的错误。

【讨论】:

    【解决方案2】:

    您必须确保最后一个退出代码不是 0。

    根据您的输入,您有:

    [2020-01-03 13:22:46,761] {{bash_operator.py:132}} 信息 - 命令退出并返回代码 0

    然后 bash 操作员将整个操作员作业视为成功。

    解决方案是让这个退出代码显式等于 1。

    例如在 python 中你可以有:

     import sys
    
     if condition_for_exiting:
    
        sys.exit(1)
    

    【讨论】: