Keras中如何正确分配内存【内存分配错误】答案

【问题标题】：How to allocate memory properly in Keras [Memory allocation error]Keras中如何正确分配内存【内存分配错误】
【发布时间】：2020-06-03 19:09:40
【问题描述】：

所以我试图解决 Kaggle 的黑色素瘤比赛，但在尝试运行简单的 keras conv 模型时，我不断收到此错误：

Resource exhausted: OOM when allocating tensor with shape[20,128,1022,1022] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

最初我尝试使用 33k 图像和大约 6 层（我几乎不知道会出现这种错误）。然后我说好吧，因为所有这些图像都是 1024x1024，我要降低层和内部单元以便于计算，但问题仍然存在，我什至无法通过第一个 epoch。

然后我说好的，我要创建一个新目录，其中只有 600 个用于训练的图像和 200 个用于验证的图像（这个问题怎么会持续存在呢？？）。那么它继续，我开始意识到问题可能是我的电脑配置。我有 Ubuntu 20，我检查了我的 GPU 是否被使用，实际上每次我在开始终端运行代码时都会说：

Using TensorFlow backend.
2020-06-03 13:48:35.960461: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-03 13:48:35.994757: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-03 13:48:35.995140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 computeCapability: 7.5
coreClock: 1.56GHz coreCount: 16 deviceMemorySize: 3.82GiB deviceMemoryBandwidth: 119.24GiB/s

我不知道是不是我配置有问题，我在使用 10k + 图像进行数字识别器和猫狗（这些图像是 28*28 和 256*256）等简单代码时没有问题...

到目前为止，我的代码看起来像这样（在更改模型以便于计算之后）：

from keras.layers import Input, Dense, Conv2D, Flatten, MaxPool2D
model = models.Model()
// Capas

input_layer = Input(shape = (1024,1024,3),dtype = 'float32')
conv1 = Conv2D(128,(3,3),activation = 'relu')(input_layer)
maxpool1 = MaxPool2D((2,2))(conv1)
conv2 = Conv2D(128,(3,3),activation = 'relu',dtype = 'float32')(maxpool1)
// All layers below until next comment are commented for taking load from cpu
maxpool2 = MaxPool2D((2,2))(conv2)
conv3 = Conv2D(128,(3,3),activation = 'relu',dtype = 'float32')(maxpool2)
maxpool3 = MaxPool2D((2,2))(conv3)
conv4 = Conv2D(256,(3,3),activation = 'relu',dtype = 'float32')(maxpool3)
maxpool4 = MaxPool2D((2,2))(conv4)
conv5 = Conv2D(256,(3,3),activation = 'relu',dtype = 'float32')(maxpool4)
maxpool5 = MaxPool2D((2,2))(conv5)
conv6 = Conv2D(256,(3,3),activation = 'relu',dtype = 'float32')(maxpool5)
// Here stops the commenting of lines
flatten = Flatten()(conv2)
dense1 = Dense(64,activation = 'relu')(flatten)
output_layer = Dense(1, activation='sigmoid')(dense1)

 Generating model

model = models.Model(inputs = input_layer, outputs = output_layer)

from keras import optimizers

model.compile(loss='binary_crossentropy',
              optimizer=optimizers.RMSprop(lr=1e-4),
              metrics=['acc'])

from keras.preprocessing.image import ImageDataGenerator

 All images will be rescaled by 1./255
train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        # This is the target directory
        train_dir,
        # All images will be resized to 150x150
        target_size=(1024, 1024),
        batch_size=20,
        # Since we use binary_crossentropy loss, we need binary labels
        class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
        valid_dir,
        target_size=(1024, 1024),
        batch_size=20,
        class_mode='binary')


history = model.fit_generator(
      train_generator,
      steps_per_epoch=30,
      epochs=30,
      validation_data=validation_generator,
      validation_steps=10)

欢迎任何想法或建议，非常感谢您抽出宝贵时间！

【问题讨论】：

你应该缩小图像，1024x1024 太大了。像 256x256 这样的东西更合理。

标签： python machine-learning keras deep-learning data-science

【解决方案1】：

您可以尝试使用 TPU https://www.kaggle.com/product-feedback/129828

# detect and init the TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)

# instantiate a distribution strategy
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)

# instantiating the model in the strategy scope creates the model on the TPU
with tpu_strategy.scope():
    model = tf.keras.Sequential( … ) # define your model normally
    model.compile( … )

# train model normally
model.fit(training_dataset, epochs=EPOCHS, steps_per_epoch=…)

但是 kaggle 的尺寸 1024x1024 太大了。

【讨论】：