【发布时间】:2020-10-09 14:16:12
【问题描述】:
我正在运行以下代码来微调 Google Colab 中的 BERT Base Cased 模型。有时代码第一次运行良好而没有错误。其他时候,使用相同数据的相同代码会导致“CUDA 内存不足”错误。以前,重新启动运行时或退出笔记本,回到笔记本,进行工厂运行时重启,然后重新运行代码可以成功运行而不会出错。就在刚才,我尝试了重新启动并重试了 5 次,每次都出现错误。
问题似乎不是我正在使用的数据和代码的组合,因为有时它可以正常工作。所以这似乎与 Google Colab 运行时有关。
有谁知道为什么会发生这种情况、为什么会出现间歇性和/或我能做些什么?
我正在使用 Huggingface 的 transformers 库和 PyTorch。
导致错误的代码单元:
# train the model
%%time
history = defaultdict(list)
for epoch in range(EPOCHS):
print(f'Epoch {epoch + 1}/{EPOCHS}')
print('-' * 10)
train_acc, train_loss = train_epoch(
model,
train_data_loader,
loss_fn,
optimizer,
device,
scheduler,
train_set_length
)
print(f'Train loss {train_loss} accuracy {train_acc}')
dev_acc, dev_loss = eval_model(
model,
dev_data_loader,
loss_fn,
device,
evaluation_set_length
)
print(f'Dev loss {dev_loss} accuracy {dev_acc}')
history['train_acc'].append(train_acc)
history['train_loss'].append(train_loss)
history['dev_acc'].append(dev_acc)
history['dev_loss'].append(dev_loss)
model_filename = f'model_{epoch}_state.bin'
torch.save(model.state_dict(), model_filename)
完整的错误:
RuntimeError Traceback (most recent call last)
<ipython-input-29-a13774d7aa75> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', "\nhistory = defaultdict(list)\n\nfor epoch in range(EPOCHS):\n\n print(f'Epoch {epoch + 1}/{EPOCHS}')\n print('-' * 10)\n\n train_acc, train_loss = train_epoch(\n model,\n train_data_loader, \n loss_fn, \n optimizer, \n device, \n scheduler, \n train_set_length\n )\n\n print(f'Train loss {train_loss} accuracy {train_acc}')\n\n dev_acc, dev_loss = eval_model(\n model,\n dev_data_loader,\n loss_fn, \n device, \n evaluation_set_length\n )\n\n print(f'Dev loss {dev_loss} accuracy {dev_acc}')\n\n history['train_acc'].append(train_acc)\n history['train_loss'].append(train_loss)\n history['dev_acc'].append(dev_acc)\n history['dev_loss'].append(dev_loss)\n \n model_filename = f'model_{epoch}_state.bin'\n torch.save(model.state_dict(), model_filename)")
15 frames
<decorator-gen-60> in time(self, line, cell, local_ns)
<timed exec> in <module>()
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask)
234 # Take the dot product between "query" and "key" to get the raw attention scores.
235 attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
--> 236 attention_scores = attention_scores / math.sqrt(self.attention_head_size)
237 if attention_mask is not None:
238 # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 7.43 GiB total capacity; 5.42 GiB already allocated; 8.94 MiB free; 5.79 GiB reserved in total by PyTorch)
【问题讨论】:
-
这意味着你试图一次在你的 GPU 上放置大量数据,减少批量大小可能有助于解决这个问题,或者如果你有可能需要增加 GPU 的内存
-
但我每次都提供完全相同的数据 - 有时会出错,有时不会。所以它似乎不是输入数据的大小。我想知道即使在重新启动/终止会话并重新连接之后,会话是否会保留上一个会话的一些数据?
-
有多个类似one、antoher one、...的帖子都在讨论过。似乎减少 batch_size 是一个常见的解决方案。但是您可以深入研究并尝试释放无用的内存
标签: python machine-learning pytorch google-colaboratory