【问题标题】:GPU goes out of memory during training large datasetGPU 在训练大型数据集期间内存不足
【发布时间】:2021-11-14 03:22:09
【问题描述】:

我正在使用 Transformer 网络进行机器翻译,在模型训练期间,GPU 在大数据集期间内存不足,它适用于小数据。

这是self attention部分,错误是在计算矩阵时出现的。

import tensorflow as tf

class SelfAttention(tf.keras.layers.Layer):
    def __init__(self, embed_size, head):
        super(SelfAttention, self).__init__()
        self.head = head
        self.embed_size = embed_size
        self.head_dim = embed_size // head

        assert (self.head_dim * head == embed_size), 'size of head_dim is not matching'

        self.query = tf.keras.layers.Dense(self.head_dim, activation='linear', use_bias=False)
        self.value = tf.keras.layers.Dense(self.head_dim, activation='linear', use_bias=False)
        self.key = tf.keras.layers.Dense(self.head_dim, activation='linear', use_bias=False)
        self.fc_layer = tf.keras.layers.Dense(self.embed_size, activation='linear')

    def call(self, value, key, query, mask):
        # Number of training examples
        N = query.shape[0]
        query_len, value_len, key_len = query.shape[1], value.shape[1], key.shape[1]

        # Reshape according to the number of examples and words
        query = tf.reshape(query, (N, query_len, self.head, self.head_dim))
        value = tf.reshape(value, (N, value_len, self.head, self.head_dim))
        key = tf.reshape(key, (N, key_len, self.head, self.head_dim))

        query = self.query(query)
        value = self.value(value)
        key = self.key(key)

        # energy shape: (N, head, query_len, key_len) try to imagine the shape in mind
        energy = tf.einsum("nqhd, nkhd->nhqk", query, key)

        if mask is not None:
            energy = energy * mask
            energy = tf.where(tf.equal(energy, 0), -1e20, energy)

        attention = tf.keras.activations.softmax(energy, axis=3)

        # attention shape: (N, head, query_len, key_len)
        # value shape:(N, value_len, head, head_dim)
        # output: (N, query_len, head, head_dim)
        output = tf.reshape(tf.einsum("nhql, nlhd->nqhd", attention, value), (N, query_len, self.head*self.head_dim))

        output = tf.keras.activations.linear(output)

        return output

错误是

2021-09-20 11:51:49.615495: I tensorflow/core/common_runtime/bfc_allocator.cc:1036] 1 Chunks of size 35477760 totalling 33.83MiB
2021-09-20 11:51:49.615502: I tensorflow/core/common_runtime/bfc_allocator.cc:1036] 1 Chunks of size 40866304 totalling 38.97MiB
2021-09-20 11:51:49.615509: I tensorflow/core/common_runtime/bfc_allocator.cc:1036] 1 Chunks of size 47409664 totalling 45.21MiB
2021-09-20 11:51:49.615516: I tensorflow/core/common_runtime/bfc_allocator.cc:1036] 1 Chunks of size 47547136 totalling 45.34MiB

/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name)
   6860   message = e.message + (" name: " + name if name is not None else "")
   6861   # pylint: disable=protected-access
-> 6862   six.raise_from(core._status_to_exception(e.code, message), None)
   6863   # pylint: enable=protected-access
   6864 

/opt/conda/lib/python3.7/site-packages/six.py in raise_from(value, from_value)

ResourceExhaustedError: OOM when allocating tensor with shape[32,334,25335] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:BiasAdd]

我该怎么办?

【问题讨论】:

    标签: tensorflow keras deep-learning nlp


    【解决方案1】:

    您可以使用生成器将数据集的一部分加载到 GPU 内存中,这样您就可以使用您的模型进行训练。

    下面是一个简单的图像分类生成器示例,您需要根据自己对 NLP 的使用进行调整:

    
    class DataGenerator(keras.utils.Sequence):
        'Generates data for Keras'
        def __init__(self, list_IDs, labels, batch_size=32, dim=(32,32,32), n_channels=1,
                     n_classes=10, shuffle=True):
            'Initialization'
            self.dim = dim
            self.batch_size = batch_size
            self.labels = labels
            self.list_IDs = list_IDs
            self.n_channels = n_channels
            self.n_classes = n_classes
            self.shuffle = shuffle
            self.on_epoch_end()
    
        def __len__(self):
            'Denotes the number of batches per epoch'
            return int(np.floor(len(self.list_IDs) / self.batch_size))
    
        def __getitem__(self, index):
            'Generate one batch of data'
            # Generate indexes of the batch
            indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
    
            # Find list of IDs
            list_IDs_temp = [self.list_IDs[k] for k in indexes]
    
            # Generate data
            X, y = self.__data_generation(list_IDs_temp)
    
            return X, y
    
        def on_epoch_end(self):
            'Updates indexes after each epoch'
            self.indexes = np.arange(len(self.list_IDs))
            if self.shuffle == True:
                np.random.shuffle(self.indexes)
    
        def __data_generation(self, list_IDs_temp):
            'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
            # Initialization
            X = np.empty((self.batch_size, *self.dim, self.n_channels))
            y = np.empty((self.batch_size), dtype=int)
    
            # Generate data
            for i, ID in enumerate(list_IDs_temp):
                # Store sample
                X[i,] = np.load('data/' + ID + '.npy')
    
                # Store class
                y[i] = self.labels[ID]
    
            return X, keras.utils.to_categorical(y, num_classes=self.n_classes)
    

    然后传递给.fit

    
    params = {'dim': (32,32,32),
              'batch_size': 64,
              'n_classes': 6,
              'n_channels': 1,
              'shuffle': True}
    
    # Datasets
    partition = # IDs
    labels = # Labels
    
    # Generators
    training_generator = DataGenerator(partition['train'], labels, **params)
    validation_generator = DataGenerator(partition['validation'], labels, **params)
    
    model.fit_generator(generator=training_generator,
                        validation_data=validation_generator)
    
    

    【讨论】:

    • 我已经在使用 tf.data.Dataset.from_tensor_slices(train, label) 命令来处理数据集。
    • 你有多少 GPU 内存?如果您已经在使用一种方法将数据集的一部分加载到内存中,请尝试减少批量大小。看来您使用的批量大小为 32。减少句子长度或(现在是 334?)和字数(现在​​是 25335?)也会有所帮助。
    猜你喜欢
    • 2022-06-23
    • 2018-08-15
    • 1970-01-01
    • 2016-10-02
    • 1970-01-01
    • 1970-01-01
    • 2014-01-23
    • 2019-01-26
    • 2019-10-23
    相关资源
    最近更新 更多