Google Cloud ML 在训练时以非零状态 245 退出答案

【问题标题】：Google Cloud ML exited with a non-zero status of 245 when trainingGoogle Cloud ML 在训练时以非零状态 245 退出
【发布时间】：2017-04-28 23:40:31
【问题描述】：

我尝试使用此示例代码在 Google Cloud ML 上训练我的模型：

import keras
from keras import optimizers
from keras import losses
from keras import metrics
from keras.models import Model, Sequential
from keras.layers import Dense, Lambda, RepeatVector, TimeDistributed
import numpy as np

def test():
    model = Sequential()
    model.add(Dense(2, input_shape=(3,)))
    model.add(RepeatVector(3))
    model.add(TimeDistributed(Dense(3)))
    model.compile(loss=losses.MSE,
                  optimizer=optimizers.RMSprop(lr=0.0001),
                  metrics=[metrics.categorical_accuracy],
                  sample_weight_mode='temporal')
    x = np.random.random((1, 3))
    y = np.random.random((1, 3, 3))
    model.train_on_batch(x, y)

if __name__ == '__main__':
    test()

我得到了这个错误：

The replica master 0 exited with a non-zero status of 245. Termination reason: Error.

详细的错误输出很大，所以我贴上here in pastebin

【问题讨论】：

在 console.google.com 中转到汉堡菜单，选择“ML Engine > Jobs”并点击您的工作。滚动到底部。您的 RAM 使用情况如何？你可以 OOMed 吗？
对于这个特定的工作“这个图表没有数据”。但对于我的其他工作，它更复杂，并且有同样的错误，内存使用量是 0.0359
日志输出表明您遇到了分段错误。在您的 Cloud ML 作业中，您是否指定了要使用的 TensorFlow 版本？
@JeremyLewi 不，我没有指定版本。我刚刚尝试在测试代码上再次运行作业，它现在可以工作了。稍后我会尝试测试我的主项目。
可能是您的旧项目默认使用旧的运行时版本，其中包含旧版本的 numpy，我们偶尔会在其中看到这些段错误

标签： machine-learning tensorflow google-cloud-platform google-cloud-ml google-cloud-ml-engine

【解决方案1】：

注意这个输出：

Module raised an exception for failing to call a subprocess Command '['python', '-m', u'trainer.test', '--job-dir', u'gs://my_test_bucket_keras/s_27_100630']' returned non-zero exit status -11.

我猜谷歌云会使用一个名为--job-dir 的额外参数来运行您的代码。那么也许您可以尝试在示例代码中添加以下代码？

import ...
import argparse

def test():
model = Sequential()
model.add(Dense(2, input_shape=(3,)))
model.add(RepeatVector(3))
model.add(TimeDistributed(Dense(3)))
model.compile(loss=losses.MSE,
              optimizer=optimizers.RMSprop(lr=0.0001),
              metrics=[metrics.categorical_accuracy],
              sample_weight_mode='temporal')
x = np.random.random((1, 3))
y = np.random.random((1, 3, 3))
model.train_on_batch(x, y)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    # Input Arguments
    parser.add_argument(
      '--job-dir',
      help='GCS location to write checkpoints and export models',
      required=True
    )
    args = parser.parse_args()
    arguments = args.__dict__

    test()
    # test(**arguments) # or if you want to use this job_dir parameter in your code

不是 100% 肯定这会奏效，但我认为您可以尝试一下。我也有一个post here 来做类似的事情，也许你也可以看看那里。

【讨论】：

谢谢，实际上我在开始使用 Google ML 时遵循了这个教程，然后它就奏效了。但看起来代码没有问题。

【解决方案2】：

问题已解决。我所要做的就是使用 tensorflow 1.1.0 而不是默认的 1.0.1

【讨论】：

你是如何更改 tensorflow 版本的？
@BadgerCat 只需添加到 setup.py 安装要求 tensorflow==1.1.0