在 ML-Engine 预测上出现错误，但本地预测工作正常答案

【问题标题】：Getting error on ML-Engine predict but local predict works fine在 ML-Engine 预测上出现错误，但本地预测工作正常
【发布时间】：2017-08-30 06:24:51
【问题描述】：

我在这里搜索了很多，但遗憾的是找不到答案。

我在本地机器上运行 TensorFlow 1.3（通过 MacOS 上的 PiP 安装），并使用 provided "ssd_mobilenet_v1_coco" 检查点创建了一个模型。

我设法在本地和 ML-Engine（运行时 1.2）上进行训练，并成功地将我的 savedModel 部署到 ML-Engine。

本地预测（下面的代码）工作正常，我得到了模型结果

gcloud ml-engine local predict --model-dir=... --json-instances=request.json

 FILE request.json: {"inputs": [[[242, 240, 239], [242, 240, 239], [242, 240, 239], [242, 240, 239], [242, 240, 23]]]}

但是，在部署模型并尝试在 ML-ENGINE 上运行以使用以下代码进行远程预测时：

gcloud ml-engine predict --model "testModel" --json-instances request.json(SAME JSON FILE AS BEFORE)

我收到此错误：

{
  "error": "Prediction failed: Exception during model execution: AbortionError(code=StatusCode.INVALID_ARGUMENT, details=\"NodeDef mentions attr 'data_format' not in Op<name=DepthwiseConv2dNative; signature=input:T, filter:T -> output:T; attr=T:type,allowed=[DT_FLOAT, DT_DOUBLE]; attr=strides:list(int); attr=padding:string,allowed=[\"SAME\", \"VALID\"]>; NodeDef: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise = DepthwiseConv2dNative[T=DT_FLOAT, _output_shapes=[[-1,150,150,32]], data_format=\"NHWC\", padding=\"SAME\", strides=[1, 1, 1, 1], _device=\"/job:localhost/replica:0/task:0/cpu:0\"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Relu6, FeatureExtractor/MobilenetV1/Conv2d_1_depthwise/depthwise_weights/read)\n\t [[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise = DepthwiseConv2dNative[T=DT_FLOAT, _output_shapes=[[-1,150,150,32]], data_format=\"NHWC\", padding=\"SAME\", strides=[1, 1, 1, 1], _device=\"/job:localhost/replica:0/task:0/cpu:0\"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Relu6, FeatureExtractor/MobilenetV1/Conv2d_1_depthwise/depthwise_weights/read)]]\")"
}

我在这里看到了类似的东西：https://github.com/tensorflow/models/issues/1581

关于“数据格式”参数的问题。但不幸的是，我无法使用该解决方案，因为我已经在使用 TensorFlow 1.3。

看来也可能是MobilenetV1的问题：https://github.com/tensorflow/models/issues/2153

有什么想法吗？

【问题讨论】：

您是如何在本地训练并成功将 savedModel 部署到 ML-Engine 的？这似乎意味着您使用 TensorFlow 1.3 进行训练，然后使用 1.2 版本进行预测。
嗨，乔治！感谢您的评论！我确实使用过 TF1.3 进行培训，也许就是这样。但是我怎样才能使用 1.2 进行预测呢？我可以在 gcloud 工具或网页界面中设置吗？？？
您可以在本地使用 TF 1.2 版本来代替当前的 TF1.3 进行模型训练。
再次感谢 cmets George！最后，我和我的团队决定在专用服务器中使用 Tensorflow Serving 来提供预测服务。到目前为止，与 ML-Engine 上存在错误的相同模型工作良好。但我希望有类似问题的人能找到这个帖子并尝试你的建议。我也很失望，很难从谷歌方面获得支持（通过 GCP）=(

标签： tensorflow google-prediction

【解决方案1】：

我有一个类似的issue。此问题是由于用于训练和推理的 Tensorflow 版本不匹配造成的。我通过使用 Tensorflow - 1.4 进行训练和推理解决了这个问题。

请参考this回答。

【讨论】：

非常感谢！对于我正在从事的项目，我们决定不使用 GCP-ML，但我一定会检查一下。由于我相信您的回答应该可以解决此问题，因此我将其标记为已解决。我很高兴 TF 团队解决了这个问题 =D
这个问题在 Tensorflow 1.9 版本中解决了吗？我尝试在 CloudML 中进行预测，但仍然出现同样的错误。

【解决方案2】：

如果您想知道如何确保您的模型版本运行的是您需要运行的正确 tensorflow 版本，请先查看model versions list page

您需要知道哪个模型版本支持您需要的 Tensorflow 版本。在撰写本文时：

ML 1.4 版支持 TensorFlow 1.4.0 和 1.4.1
ML 1.2 版支持 TensorFlow 1.2.0 和
ML 1.0 版支持 TensorFlow 1.0.1

既然您知道您需要哪个模型版本，您需要从您的模型创建一个新版本，如下所示：

gcloud ml-engine versions create <version name> \
--model=<Name of the model> \
--origin=<Model bucket link. It starts with gs://...> \
--runtime-version=1.4

就我而言，我需要使用 Tensorflow 1.4.1 进行预测，因此我使用了运行时版本 1.4。

参考这个official MNIST tutorial page，以及这个ML Versioning Page

【讨论】：