如何在tensorflow中解决“用shape[XXX]分配张量时的OOM”（训练GCN时）答案

【问题标题】：How to solve "OOM when allocating tensor with shape[XXX]" in tensorflow (when training a GCN)如何在tensorflow中解决“用shape[XXX]分配张量时的OOM”（训练GCN时）
【发布时间】：2021-04-20 11:31:07
【问题描述】：

所以...我已经检查了一些关于这个问题的帖子（应该有很多我没有检查但我认为现在就一个问题寻求帮助是合理的），但我还没有找到任何解决方案可能适合我的情况。

此 OOM 错误消息总是出现（没有一个例外）在任意折叠训练循环的第二轮中，以及在第一次运行后再次重新运行训练代码时。所以这可能是与这篇文章有关的问题：A previous stackoverflow question for OOM linked with tf.nn.embedding_lookup()，但我不确定我的问题出在哪个函数上。

我的 NN 是一个具有两个图形卷积层的 GCN，我在一个具有多个 10 GB Nvidia P102-100 GPU 的服务器上运行代码。已将 batch_size 设置为 1，但没有任何改变。我也在使用 Jupyter Notebook 而不是使用命令运行 python 脚本，因为在命令行中我什至无法运行一轮......顺便说一句，有人知道为什么在命令行中弹出 OOM 时某些代码可以在 Jupyter 上毫无问题地运行吗？我觉得有点奇怪。

更新：用 GlobalMaxPool() 替换 Flatten() 后，错误消失了，我可以顺利运行代码。但是，如果我进一步添加一个 GC 层，错误将出现在第一轮。因此，我想核心问题仍然存在......

UPDATE2：尝试将tf.Tensor 替换为tf.SparseTensor。成功但没用。还尝试设置 ML_Engine 的答案中提到的镜像策略，但看起来其中一个 GPU 占用率最高，并且 OOM 仍然出现。也许这是一种“数据并行”，因为我将batch_size 设置为 1，所以无法解决我的问题？

代码（改编自GCNG）：

from keras import Input, Model
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers import Dense, Flatten
from keras.optimizers import Adam
from keras.regularizers import l2
import tensorflow as tf
#from spektral.datasets import mnist
from spektral.layers import GraphConv
from spektral.layers.ops import sp_matrix_to_sp_tensor
from spektral.utils import normalized_laplacian
from keras.utils import plot_model
from sklearn import metrics
import numpy as np
import gc

l2_reg = 5e-7  # Regularization rate for l2
learning_rate = 1*1e-6  # Learning rate for SGD
batch_size = 1  # Batch size
epochs = 1 # Number of training epochs
es_patience = 50  # Patience fot early stopping

# DATA IMPORTING & PREPROCESSING OMITTED

# this part of adjacency matrix calculation is not important...
fltr = self_connection_normalized_adjacency(adj)
test = fltr.toarray()
t = tf.convert_to_tensor(test)
A_in = Input(tensor=t)
del fltr, test, t
gc.collect()


# Here comes the issue.

for test_indel in range(1,11):

    # SEVERAL LINES OMITTED (get X_train, y_train, X_val, y_val, X_test, y_test)
    
    # Build model
    N = X_train.shape[-2]  # Number of nodes in the graphs
    F = X_train.shape[-1]  # Node features dimensionality
    n_out = y_train.shape[-1]  # Dimension of the target
    X_in = Input(shape=(N, F))
    graph_conv = GraphConv(32,activation='elu',kernel_regularizer=l2(l2_reg),use_bias=True)([X_in, A_in])
    graph_conv = GraphConv(32,activation='elu',kernel_regularizer=l2(l2_reg),use_bias=True)([graph_conv, A_in])
    flatten = Flatten()(graph_conv)
    fc = Dense(512, activation='relu')(flatten)
    output = Dense(n_out, activation='sigmoid')(fc)
    model = Model(inputs=[X_in, A_in], outputs=output)
    optimizer = Adam(lr=learning_rate)
    model.compile(optimizer=optimizer,loss='binary_crossentropy',metrics=['acc'])
    model.summary()

    save_dir = current_path+'/'+str(test_indel)+'_self_connection_Ycv_LR_as_nega_rg_5-7_lr_1-6_e'+str(epochs)
    if not os.path.isdir(save_dir):
        os.makedirs(save_dir)
    early_stopping = EarlyStopping(monitor='val_acc', patience=es_patience, verbose=0, mode='auto')
    checkpoint1 = ModelCheckpoint(filepath=save_dir + '/weights.{epoch:02d}-{val_loss:.2f}.hdf5', monitor='val_loss',verbose=1, save_best_only=False, save_weights_only=False, mode='auto', period=1)
    checkpoint2 = ModelCheckpoint(filepath=save_dir + '/weights.hdf5', monitor='val_acc', verbose=1,save_best_only=True, mode='auto', period=1)
    callbacks = [checkpoint2, early_stopping]

    # Train model
    validation_data = (X_val, y_val)
    print('batch size = '+str(batch_size))
    history = model.fit(X_train,y_train,batch_size=batch_size,validation_data=validation_data,epochs=epochs,callbacks=callbacks)

    # Prediction and write-file code omitted
    del X_in, X_data_train,Y_data_train,gene_pair_index_train,count_setx_train,X_data_test, Y_data_test,gene_pair_index_test,trainX_index,validation_index,train_index, X_train, y_train, X_val, y_val, X_test, y_test, validation_data, graph_conv, flatten, fc, output, model, optimizer, history 
    gc.collect()

模型总结：

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_2 (InputLayer)            (None, 13129, 2)     0                                            
__________________________________________________________________________________________________
input_1 (InputLayer)            (13129, 13129)       0                                            
__________________________________________________________________________________________________
graph_conv_1 (GraphConv)        (None, 13129, 32)    96          input_2[0][0]                    
                                                                 input_1[0][0]                    
__________________________________________________________________________________________________
graph_conv_2 (GraphConv)        (None, 13129, 32)    1056        graph_conv_1[0][0]               
                                                                 input_1[0][0]                    
__________________________________________________________________________________________________
flatten_1 (Flatten)             (None, 420128)       0           graph_conv_2[0][0]               
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 512)          215106048   flatten_1[0][0]                  
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 1)            513         dense_1[0][0]                    
==================================================================================================
Total params: 215,107,713
Trainable params: 215,107,713
Non-trainable params: 0
__________________________________________________________________________________________________
batch size = 1

错误消息（请注意，此消息永远不会在重启和清除输出后的第一轮中出现）：

Train on 2953 samples, validate on 739 samples
Epoch 1/1
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-5-943385df49dc> in <module>()
     62     mem = psutil.virtual_memory()
     63     print("current mem " + str(round(mem.percent))+'%')
---> 64     history = model.fit(X_train,y_train,batch_size=batch_size,validation_data=validation_data,epochs=epochs,callbacks=callbacks)
     65     mem = psutil.virtual_memory()
     66     print("current mem " + str(round(mem.percent))+'%')

/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
   1237                                         steps_per_epoch=steps_per_epoch,
   1238                                         validation_steps=validation_steps,
-> 1239                                         validation_freq=validation_freq)
   1240 
   1241     def evaluate(self,

/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/keras/engine/training_arrays.py in fit_loop(model, fit_function, fit_inputs, out_labels, batch_size, epochs, verbose, callbacks, val_function, val_inputs, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq)
    194                     ins_batch[i] = ins_batch[i].toarray()
    195 
--> 196                 outs = fit_function(ins_batch)
    197                 outs = to_list(outs)
    198                 for l, o in zip(out_labels, outs):

/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/tensorflow/python/keras/backend.py in __call__(self, inputs)
   3290 
   3291     fetched = self._callable_fn(*array_vals,
-> 3292                                 run_metadata=self.run_metadata)
   3293     self._call_fetch_callbacks(fetched[-len(self._fetches):])
   3294     output_structure = nest.pack_sequence_as(

/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
   1456         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1457                                                self._handle, args,
-> 1458                                                run_metadata_ptr)
   1459         if run_metadata:
   1460           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[420128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node training_1/Adam/mul_23}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[metrics_1/acc/Identity/_323]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[420128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node training_1/Adam/mul_23}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

【问题讨论】：

2.15 亿个参数是巨大的。 Google BERT 通常有大约 100m 个参数，这需要一些强大的计算能力才能从头开始训练。尝试从 512 减少密集节点。尝试使用较小的数字，比如 32，然后从那里增加，直到找到平衡
@ML_Engine 感谢您的评论！老实说，我最初也对参数的数量感到震惊。节点的数量（细胞数量，13129）是固定的，很难改变，因为我正在处理的生物数据需要足够的分辨率，而整个区域也需要。但我实际上可以处理一轮（对于固定的test_indel），batch_size 高达 32... 这就是为什么我认为问题可能是循环内的函数生成的一些数据重复，而不是参数。
可以通过手动将test_indel 从 1 设置为 10 来生成最终结果并运行（在两者之间重新启动和清除输出）。我正在考虑用GlobalMaxPool() 替换Flatten()，因为后者似乎更常用于现代CNN，并且需要更少的参数来处理。
啊，我刚刚看到您可以访问多个 GPU。也许尝试使用分布式策略来确保 tensorflow 正在使用您所有的 GPU。我将在答案中添加一些代码来演示

标签： tensorflow keras graph neural-network conv-neural-network

【解决方案1】：

您可以利用 tensorflow 中的分布式策略来确保您的多 GPU 设置得到正确使用：

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
    for test_indel in range(1,11):
         <etc>

请参阅文档here

镜像策略用于在单个服务器上跨多个 GPU 进行同步分布式训练，这听起来像您正在使用的设置。还有更直观的解释in this blog。

另外，您可以尝试使用mixed precision，它可以通过更改模型中参数的浮点类型来显着释放内存。

【讨论】：

谢谢！今天或明天将测试此方法。
不客气。如果有帮助，请“接受”我的回答！
您好，我目前无法接受它，因为事实证明我还有一些相关的错误需要修复。我尝试了镜像策略，但 OOM 仍然弹出。看起来 TF 正在利用其他可用的 GPU（唯一的错误消息是关于 XLA_GPUs，示例消息：INFO:tensorflow:Device is available but not used by distribute strategy: /device:XLA_GPU:1）。不知道为什么会这样......我在 jupyter notebook 上使用nvidia-smi 运行代码时检查了 GPU 内存使用情况，但看起来只有GPU:1 正在使用（内存使用情况：10056 MiB / 10156 MiB）
（继续最后的评论）更准确地说，在使用 mirrorer_strategy 之后，我的一个 GPU (GPU:1) 占用了最多的内存，而其他 5 个 GPU，虽然被进程使用，仍有约 99% 的可用内存（内存使用量：147 MiB / 10156 MiB）。但是，通常 Jupyter Notebook 会利用多个可用的 GPU 来工作（完全是一个，而其他只是占用内存的一小部分），所以我不确定这是否表明镜像策略有效或根本不工作。
print(tf.config.list_physical_devices('GPU')) 的输出是什么？尝试将此添加到脚本的顶部