【问题标题】:Jupyter: The kernel appears to have died. It will restart automatically. (Keras Related)Jupyter:内核似乎已经死了。它将自动重新启动。 (Keras 相关)
【发布时间】:2020-05-13 21:11:02
【问题描述】:

我正在尝试训练 Resnet50,但无论我做什么都失败了,因为 Jupyter 笔记本的内核正在死亡 (The kernel appears to have died. It will restart automatically),即它开始训练的那一刻 (Epoch 1/100)。我有 GeForce GTX 1060 Ti,当我在训练期间执行 nvidia-smi 时(虽然持续 1 秒),我只看到与过去相比分配了 80 MB 的内存,然后内核死了,好像它在尝试但它失败。

以下是要求:

pandas==0.25.1
numpy==1.17.2
opencv-python==4.1.1.26
scikit-image==0.15.0
scikit-learn==0.21.3
tensorflow-gpu==1.14.0
Keras==2.2.5
matplotlib==3.1.1
Pillow==6.1.0
albumentations==0.3.2
tqdm==4.35.0
jupyter

我很满意。以下是我设置培训课程的方式:

config = tf.ConfigProto()
config.gpu_options.allow_growth = False
config.gpu_options.per_process_gpu_memory_fraction = 0.9
sess = tf.Session(config=config) 
keras.backend.set_session(sess)

keras.__version__
os.environ["CUDA_VISIBLE_DEVICES"] = '0' #yes, this is the ID of my GPU.

# create the FCN model
model_mobilenet = ResNet50(input_shape=(1024, 1024, 3), include_top=False) # use the Resnet
model_x8_output = Conv2D(128, (1, 1), activation='relu')(model_mobilenet.layers[-95].output)
model_x8_output = UpSampling2D(size=(8, 8))(model_x8_output)
model_x8_output = Conv2D(3, (3, 3), padding='same', activation='sigmoid')(model_x8_output)
MODEL_x8 = Model(inputs=model_mobilenet.input, outputs=model_x8_output)

MODEL_x8.compile(loss='binary_crossentropy', optimizer=Adam(lr=1e-3), metrics=[jaccard_distance])

MODEL_x8.fit_generator(train_generator, steps_per_epoch=300, epochs=100, verbose=1, validation_data=val_generator, validation_steps=10)

Epoch 1/100
  1/300 [..............................] - ETA: 1:01:59 - loss: 0.7193 - jaccard_distance: 0.1125

我试过设置:

  • config.gpu_options.allow_growthTrue
  • config.gpu_options.per_process_gpu_memory_fraction 到任何其他任意值,例如 0.1
  • 注释掉:#os.environ["CUDA_VISIBLE_DEVICES"] = 0

他们都没有工作。我很欣赏建设性的答案。

提前致谢。

编辑:我现在尝试将其作为脚本(而不是作为笔记本)运行,当 Tensorflow 会话行出现时,终端会抛出以下内容:

2020-01-28 13:44:55.756819: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757047: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757313: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757526: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757736: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757940: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.808416: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-01-28 13:44:55.808444: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...

这很奇怪,因为我没有 CUDA 10,而是 9.0,所以甚至不应该问这个问题。我的 TensorFlow 版本错了吗?

【问题讨论】:

    标签: python tensorflow keras deep-learning jupyter-notebook


    【解决方案1】:

    这很可能是因为没有足够的内存来存储数据/模型。您的输入图像大小也是 1024x1024。我建议您尝试使用 256 甚至 128 之类的小图像进行训练,看看它是否至少可以工作。

    另外,你的 GPU 是否被 TF 检测到了?

    【讨论】:

      【解决方案2】:

      好的,知道了。

      问题是我的 tensorflow=gpu 版本 (1.14) 与我的 CUDA 版本 (9.0) 不兼容。我必须安装低于 1.13 的版本。但这不是唯一的问题。我的 CuDNN 版本 (705) 也有问题,我不得不将我的 Tensorflow-gpu 一直降低到 1.9.0。

      现在一切正常。

      【讨论】:

      • 如果您愿意,您可以保留 TF-gpu 1.14 并将您的 CUDA 版本更新到 10.0
      猜你喜欢
      • 1970-01-01
      • 2020-03-07
      • 2017-08-12
      • 2019-06-04
      • 2019-05-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多