使用 map_location 加载 Pytorch DataParallel 模型答案

【问题标题】：Pytorch DataParallel model load with map_location使用 map_location 加载 Pytorch DataParallel 模型
【发布时间】：2021-09-17 20:37:44
【问题描述】：

保存的模型

net= Net()
model= torch.nn.DataParallel(net)
############################
# Training
############################

torch.save(model,'./model_shear_pre.pkl')

模型加载

net = Net()
model = torch.nn.DataParallel(net, device_ids=[0,1])
model = torch.load('./model_shear_finish.pkl',  map_location={'cuda:0':'cuda:0', 'cuda:1':'cuda:0', 'cuda:2':'cuda:1', 'cuda:3':'cuda:1'})

问题是我在训练时使用了 4 个 GPU 的机器，保存模型后，我想在只有 2 个 GPU 的新机器上进行测试。

加载保存的模型后，我预计模型的device_ids 将是[0,1]，但它仍然是[0,1,2,3]，这是旧设置。 保存或加载有什么问题吗？

【问题讨论】：

标签： pytorch

【解决方案1】：

您应该保存权重而不是整个模型。

net = Net()
model = torch.nn.DataParallel(net)
############################
# Training
############################

torch.save(model.state_dict(),'./model_shear_pre.pkl')

然后在移动到所有GPU之前将权重加载到CPU中

net = Net()

weights = torch.load('./model_shear_finish.pkl', map_location='cpu')
net.load_state_dict(weights)

model = torch.nn.DataParallel(net, device_ids=[0,1])

但是，如果您有一个已经训练过的模型，它使用整个模型而不是仅使用权重进行保存，这也可能有效

net = torch.load('./model_shear_finish.pkl', map_location='cpu')
model = torch.nn.DataParallel(net, device_ids=[0,1])

我仍然建议只保存权重。保存和加载整个模型真的会搞砸你，因为你必须在保存和加载时以 完全相同的方式 import 模型。很多时候这是一件棘手的事情。喜欢

train.py

from nets import Net
net = Net()
torch.save(net, './model_shear_finish.pkl')

推理.py

# this won't work
import nets
torch.load('./model_shear_finish.pkl', map_location='cpu')  

# this will work
from nets import Net
torch.load('./model_shear_finish.pkl', map_location='cpu')

【讨论】：