尝试运行 keras Tuner 时，Google Colab 不断崩溃答案

【问题标题】：Google Colab keeps crashing when trying to run keras tuner尝试运行 keras Tuner 时，Google Colab 不断崩溃
【发布时间】：2021-03-14 10:22:35
【问题描述】：

这是我的第一个机器学习项目，使用的是我自己创建的数据集。

很遗憾，Google Colab 不断崩溃。而且好像和keras Tuner有关系，但我不确定。

它实际上工作了一段时间。但是现在当我运行它时它会立即崩溃。

编辑：当我运行 Tuner.search 时 Colab 崩溃了。

日志。（从下往上阅读）

Dec 2, 2020, 12:53:12 PM    WARNING 
WARNING:root:kernel e615fcc9-5bdc-44af-ad35-ee2a772f131f restarted
Dec 2, 2020, 12:53:12 PM    INFO    KernelRestarter: 
restarting kernel (1/5), keep random ports
Dec 2, 2020, 12:53:11 PM    WARNING 2020-12-02 11:53:11.006902: 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] 
Created TensorFlow device 
(/job:localhost/replica:0/task:0/device:GPU:0 with 10630 MB memory) 
-> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, 
compute capability: 3.7)
Dec 2, 2020, 12:53:11 PM    WARNING 2020-12-02 11:53:11.006032: 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] 
successful NUMA node read from SysFS had negative value (-1), 
but there must be at least one NUMA node, so returning NUMA node zero
Dec 2, 2020, 12:53:11 PM    WARNING 2020-12-02 11:53:11.004903: 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] 
successful NUMA node read from SysFS had negative value (-1), 
but there must be at least one NUMA node, so returning NUMA node zero
Dec 2, 2020, 12:53:11 PM    WARNING 2020-12-02 11:53:11.004580: 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
Dec 2, 2020, 12:53:11 PM    WARNING 2020-12-02 11:53:11.004559: 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
Dec 2, 2020, 12:53:11 PM    WARNING 2020-12-02 11:53:11.004497: 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] 
Device interconnect StreamExecutor with strength 1 edge matrix:
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.529441: 
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] 
Successfully opened dynamic library libcudart.so.10.1
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.529298: 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] 
Adding visible gpu devices: 0
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.528166: 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] 
successful NUMA node read from SysFS had negative value (-1), 
but there must be at least one NUMA node, so returning NUMA node zero
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.526440: 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] 
successful NUMA node read from SysFS had negative value (-1), 
but there must be at least one NUMA node, so returning NUMA node zero
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.526344: 
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] 
Successfully opened dynamic library libcudnn.so.7
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.526305: 
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] 
Successfully opened dynamic library libcusparse.so.10
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.526268: 
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] 
Successfully opened dynamic library libcusolver.so.10
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.526227: 
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] 
Successfully opened dynamic library libcurand.so.10
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.526186: 
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] 
Successfully opened dynamic library libcufft.so.10
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.526125: 
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] 
Successfully opened dynamic library libcublas.so.10
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.525706: 
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] 
Successfully opened dynamic library libcudart.so.10.1
Dec 2, 2020, 12:53:10 PM    WARNING coreClock: 0.8235GHz coreCount: 
13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
Dec 2, 2020, 12:53:10 PM    WARNING pciBusID: 0000:00:04.0 name: 
Tesla K80 computeCapability: 3.7
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.525625: 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] 
Found device 0 with properties:
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.524630: 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] 
successful NUMA node read from SysFS had negative value (-1)
, but there must be at least one NUMA node, so returning NUMA node zero
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.523938: 
I tensorflow/compiler/xla/service/service.cc:176] 
StreamExecutor device (0): Tesla K80, Compute Capability 3.7
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.523902: 
I tensorflow/compiler/xla/service/service.cc:168] 
XLA service 0x7a39500 initialized for platform CUDA 
(this does not guarantee that XLA will be used). Devices:
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.522755: 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] 
successful NUMA node read from SysFS had negative value (-1), 
but there must be at least one NUMA node, so returning NUMA node zero
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.467341: 
I tensorflow/compiler/xla/service/service.cc:176] 
StreamExecutor device (0): Host, Default Version
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.467308: 
I tensorflow/compiler/xla/service/service.cc:168] 
XLA service 0x2383480 initialized for platform Host 
(this does not guarantee that XLA will be used). Devices:
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.466693: 
I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] 
CPU Frequency: 2300000000 Hz

我的代码

import tensorflow as tf
import kerastuner
from tensorflow import keras
from kerastuner.tuners import RandomSearch
from kerastuner.engine.hypermodel import HyperModel
from kerastuner.engine.hyperparameters import HyperParameters
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.losses import sparse_categorical_crossentropy

!unzip -q /content/paintings.zip

data_dir = "/content/paintings"

#Theese three rows of code is only here because i read somewhere 
#that it would help solve the problem, but it does not.
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
for device in gpu_devices:
    tf.config.experimental.set_memory_growth(device, True)

num_classes = 50
nb_epochs = 10
batch_size = 16
img_height = 128
img_width = 128

train_datagen = ImageDataGenerator(rescale=1./255,
    validation_split=0.2) 

train_generator = train_datagen.flow_from_directory(
    data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    shuffle = True,
    class_mode="sparse",
    subset='training') 

validation_generator = train_datagen.flow_from_directory(
    data_dir, 
    target_size=(img_height, img_width),
    batch_size=batch_size,
    shuffle = True,
    class_mode="sparse",
    subset='validation') 

hp = HyperParameters()
hp.Choice('learning_rate', [0.005, 1e-4])
hp.Int('num_layers_conv', 1, 5)
hp.Int('num_layers_dense', 1, 3)
hp.Int('dense_n',
        min_value=0,
        max_value=500,
        step=50)
hp.Choice(
        'activation',
        values=['relu', 'tanh'],
        default='relu')
hp.Float('dropout',
          min_value=0.0,
          max_value=0.5,
          default=0.25,
          step=0.05)

def build_model(hp):
    model = keras.Sequential()

    for i in range(hp.get('num_layers_conv')): 
        model.add(layers.Conv2D
            (filters=hp.Int('filters_' + str(i), 0, 512, step=32),
            kernel_size=hp.Int('kernel_size_' + str(i), 3, 5), padding="same", 
            activation=hp.get('activation')))

    model.add(layers.MaxPooling2D(pool_size=(2,2)))
  
    model.add(layers.Conv2D(32, kernel_size=(3, 3), activation='relu'))
    
    model.add(layers.MaxPooling2D(pool_size=(2,2)))

    model.add(layers.Flatten())
    
    for i in range(hp.get('num_layers_dense')): 
        model.add(layers.Dense(units=hp.get('dense_n'), 
        activation=hp.get('activation')))
        model.add(layers.BatchNormalization())
        model.add(layers.Dropout(rate=hp.get('dropout')))

    model.add(layers.Dense(num_classes, activation='softmax'))
    
    model.compile(
        optimizer=keras.optimizers.Adam(hp.get('learning_rate')),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy'])
    
    return model

tuner = RandomSearch(
    build_model,
    max_trials=100,
    executions_per_trial=1,
    hyperparameters=hp,
    directory = "output",
    project_name = "ArtNet",
    objective='val_accuracy')

tuner.search(train_generator,
             epochs=10,
             validation_data=validation_generator)

任何帮助将不胜感激！

【问题讨论】：

标签： tensorflow keras deep-learning google-colaboratory keras-tuner

【解决方案1】：

这可能是因为打开了多个 colab 选项卡并且您的 RAM 不足。仅使用单个选项卡并运行该过程。使用下面的代码检查您有多少 RAM 以及启动该过程需要多少。让我知道这是否有效。

# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
 print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

【讨论】：

是的，我只打开了一个标签。在我得到这个输出之前运行上面的代码 Gen RAM Free: 12.6 GB |进程大小：388.7 MB GPU RAM 可用：15069MB |已用：10MB |使用率 0% |总共 15079MB...那么我应该把 printm() 放在 Tuner.search 之后，它正在启动这个过程吗？我试过了，但它崩溃了，我没有从 printm() 得到任何输出。
你的tensorflow和cuda版本匹配吗？因为如果它们不匹配，也会导致崩溃冲突。
当你提到它时，我真的认为这是问题所在，因为我前段时间安装了另一个版本的 cuda 工具包和 cudnn，但现在我卸载了它，但它没有帮助。几周前我什至将我的系统恢复到一定程度，但这也无济于事。您还有其他想法吗？
我不说另一个版本。我说的是匹配版本。您使用的 tensorflow 版本应该支持该 cuda 和 cudnn 版本
似乎有 cuda 10.1 以及您使用的 tensorflow 版本。尝试 1.15.4。