【问题标题】:Loss Function in Multi-GPUs training (PyTorch)多 GPU 训练中的损失函数 (PyTorch)
【发布时间】:2020-04-15 02:25:48
【问题描述】:

我使用 Pytorch 和 BERT 来训练模型。一切都在一个 GPU 上运行良好,但是当我尝试使用多个 GPU 时出现错误:

ValueError                                Traceback (most recent call last)
<ipython-input-168-507223f9879c> in <module>()
     92         # single value; the `.item()` function just returns the Python value
     93         # from the tensor.
---> 94         total_loss += loss.item()
     95 
     96         # Perform a backward pass to calculate the gradients.

ValueError: only one element tensors can be converted to Python scalars

有人可以帮我解决我缺少的问题以及我应该如何解决它吗?

这是我的训练代码:

import random
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
loss_values = []
for epoch_i in range(0, epochs):

    t0 = time.time()
    total_loss = 0
    for step, batch in enumerate(train_dataloader):
        if step % 40 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)

        b_input_ids = batch[0].to(device).long() 
        b_input_mask = batch[1].to(device).long()
        b_labels = batch[2].to(device).long()
        model.zero_grad()        
        outputs = model(b_input_ids, 
                    token_type_ids=None, 
                    attention_mask=b_input_mask, 
                    labels=b_labels)

        loss = outputs[0]
        total_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
    avg_train_loss = total_loss / len(train_dataloader)            
    loss_values.append(avg_train_loss)
    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))

这是我的模型代码:

from transformers import BertForSequenceClassification, AdamW, BertConfig
model_to_parallel = BertForSequenceClassification.from_pretrained(
    "./bert_cache.zip", 
    num_labels = 2, 
    output_attentions = False,
    output_hidden_states = False,
)
model = nn.DataParallel(model_to_parallel,  device_ids=[0,1,2,3]) 
model.to(device) 

【问题讨论】:

    标签: python pytorch


    【解决方案1】:

    损失后loss = outputs[0]loss是一个多元素张量,大小是GPU的数量。

    您可以改用loss = outputs[0].mean()

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-12-11
      • 2019-11-12
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-08-30
      • 2019-09-01
      • 2019-06-25
      相关资源
      最近更新 更多