【问题标题】:Sagemaker timing out for flask model deploymentSagemaker 为烧瓶模型部署超时
【发布时间】:2021-07-30 00:35:52
【问题描述】:

下面是 ECR Container 中的 predict.py 。 Sagemaker 端点在重试 10-12 分钟后给出“状态:失败”输出。 /ping 和 /invocations 方法都可用

/opt/ml/code/predict.py
----------
logger = logging.getLogger()
logger.setLevel(logging.INFO)
classpath =  <.pkl file> 
model = pickle.load(open(classpath, "rb"))


app = flask.Flask(__name__)
print(app)

@app.route("/ping", methods=["GET"]
def ping():
    """Determine if the container is working and healthy."""
    return flask.Response(response="Flask running", status=200, mimetype="application/json")

@app.route("/invocations", methods=["POST"])
    ""InferenceCode""
    return flask.Response(response="Invocation Completed", status=200, 
    mimetype="application/json")

Below snippet was both added and removed , however I still have the endpoint in failed status

 if __name__ == '__main__':
     app.run(host='0.0.0.0',port=5000)

Error : 
"The primary container for production variant <modelname> did not pass the ping health check. Please check CloudWatch logs for this endpoint."


Sagemaker endpoint Cloudwatch logs.
[INFO] Starting gunicorn 20.1.0
[INFO] Listening at: http://0.0.0.0:8000 (1)
[INFO] Using worker: sync
[INFO] Booting worker with pid: 11```

【问题讨论】:

    标签: python amazon-sagemaker amazon-ecr


    【解决方案1】:

    您的预测器文件旨在测试模型是否已加载到 /ping 中,以及您是否可以在 /invocations 中执行推理。如果您在 SageMaker 上训练了模型,则需要从 /opt/ml 目录加载它,如下所示。

    prefix = "/opt/ml/"
    model_path = os.path.join(prefix, "model")
    
    class ScoringService(object):
        model = None  # Where we keep the model when it's loaded
    
        @classmethod
        def get_model(rgrs):
            """Get the model object for this instance, loading it if it's not already loaded."""
            if rgrs.model == None:
                with open(os.path.join(model_path, "rf-model.pkl"), "rb") as inp:
                    rgrs.model = pickle.load(inp)
            return rgrs.model
    
        @classmethod
        def predict(rgrs, input):
            """For the input, do the predictions and return them.
            Args:
                input (a pandas dataframe): The data on which to do the predictions. There will be
                    one prediction per row in the dataframe"""
            rf = rgrs.get_model()
            return rf.predict(input)
    

    该类有助于加载您的模型,然后我们可以在 /ping 中进行验证。

    # The flask app for serving predictions
    app = flask.Flask(__name__)
    
    
    @app.route("/ping", methods=["GET"])
    def ping():
        """Determine if the container is working and healthy. In this sample container, we declare
        it healthy if we can load the model successfully."""
        health = ScoringService.get_model() is not None  # You can insert a health check here
    
        status = 200 if health else 404
        return flask.Response(response="\n", status=status, mimetype="application/json")
    

    SageMaker 将在此处测试您是否已正确加载模型。对于 /invocations 包括您传递给模型预测功能的任何数据格式的推理逻辑。

    @app.route("/invocations", methods=["POST"])
    def transformation():
        
        data = None
    
        # Convert from CSV to pandas
        if flask.request.content_type == "text/csv":
            data = flask.request.data.decode("utf-8")
            s = io.StringIO(data)
            data = pd.read_csv(s, header=None)
        else:
            return flask.Response(
                response="This predictor only supports CSV data", status=415, mimetype="text/plain"
            )
    
        print("Invoked with {} records".format(data.shape[0]))
    
        # Do the prediction
        predictions = ScoringService.predict(data)
    
        # Convert from numpy back to CSV
        out = io.StringIO()
        pd.DataFrame({"results": predictions}).to_csv(out, header=False, index=False)
        result = out.getvalue()
        
        
        return flask.Response(response=result, status=200, mimetype="text/csv")
    

    确保如上所示设置或配置您的 predictor.py,以便 SageMaker 可以正确理解/检索您的模型。

    我为 AWS 工作,我的意见是我自己的。

    【讨论】:

      猜你喜欢
      • 2019-02-10
      • 2019-04-24
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-03-19
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多