使用新标签微调模型的分类器层答案

【问题标题】：Fine-tuning model's classifier layer with new label使用新标签微调模型的分类器层
【发布时间】：2021-07-13 10:43:53
【问题描述】：

我想使用新数据集对已经微调的 BertForSequenceClassification 模型进行微调，该数据集仅包含 1 个以前模型未见过的附加标签。

由此，我想在模型当前能够正确分类的标签集中添加 1 个新标签。

此外，我不希望随机初始化分类器权重，我希望保持它们不变，并根据数据集示例相应地更新它们，同时将分类器层的大小增加 1。

用于进一步微调的数据集可能如下所示：

sentece,label
intent example 1,new_label
intent example 2,new_label
...
intent example 10,new_label

我的模型当前的分类器层如下所示：

Linear(in_features=768, out_features=135, bias=True)

我怎样才能实现它？
这甚至是一个好方法吗？

【问题讨论】：

标签： pytorch huggingface-transformers

【解决方案1】：

您可以使用新值扩展模型的权重和偏差。请看下面的注释示例：

#This is the section that loads your model
#I will just use an pretrained model for this example
import torch
from torch import nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("jpcorb20/toxic-detector-distilroberta")
model = AutoModelForSequenceClassification.from_pretrained("jpcorb20/toxic-detector-distilroberta")
#we check the output of one sample to compare it later with the extended layer
#to verify that we kept the previous learnt "knowledge"
f = tokenizer.encode_plus("This is an example", return_tensors='pt')
print(model(**f).logits)

#Now we need to find out the name of the linear layer you want to extend
#The layers on top of distilroberta are wrapped inside a classifier section
#This name can differ for you because it can be chosen randomly
#use model.parameters instead find the classification layer
print(model.classifier)

#The output shows us that the classification layer is called `out_proj`
#We can now extend the weights by creating a new tensor that consists of the
#old weights and a randomly initialized tensor for the new label 
model.classifier.out_proj.weight = nn.Parameter(torch.cat((model.classifier.out_proj.weight, torch.randn(1,768)),0))

#We do the same for the bias:
model.classifier.out_proj.bias = nn.Parameter(torch.cat((model.classifier.out_proj.bias, torch.randn(1)),0))

#and be happy when we compare the output with our expectation 
print(model(**f).logits)

输出：

tensor([[-7.3604, -9.4899, -8.4170, -9.7688, -8.4067, -9.3895]],
       grad_fn=<AddmmBackward>)
RobertaClassificationHead(
  (dense): Linear(in_features=768, out_features=768, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
  (out_proj): Linear(in_features=768, out_features=6, bias=True)
)
tensor([[-7.3604, -9.4899, -8.4170, -9.7688, -8.4067, -9.3895,  2.2124]],
       grad_fn=<AddmmBackward>)

【讨论】：

感谢您再次帮助我。我能够扩展分类层并验证权重是否完好无损。然而，随着随机初始化新的权重偏差对，模型的整体准确度显着（随机）下降。在使用如上所示的数据集对此类模型进行微调期间，一般分类能力似乎受到微调的高度影响，这是我没有预料到的，因为该模型之前已使用数千个示例进行微调，而这里我们得到了一对。您能否指出一些我应该更感兴趣的领域？
@coso 我对此并不感到惊讶。当您检查未微调的模型的结果时，您的句子可能都会被新类标记。线性层应用一个简单的变换y=xA^T+b，然后你应用argmax 之类的东西来选择你的句子类别。虽然其他类的权重在您的微调之后相当不错，但新引入的类不是，因此与其他类重叠或根本不存在。
@coso 如果您的主要目标是节省一些时间，您可以尝试冻结除分类头之外的所有层并微调此模型。
有一个关于冻结层用于微调基于变压器的模型的讨论，结论似乎是freezing isn't a good idea
@RameshArvind 说这不是一个好主意并不是 sgugger 的意思。他说，Transformer 通常经过全面培训以获得最佳结果，但这并不意味着冻结某些层不是一个好主意。例如，在没有冻结的情况下训练 mrpc 的 bert 需要 2:36 分钟，达到 88%，而只训练分类头需要 0:53 分钟，已经达到 81%。