BertForSequenceClassification 如何在 CLS 向量上进行分类？答案

【问题标题】：How does BertForSequenceClassification classify on the CLS vector?BertForSequenceClassification 如何在 CLS 向量上进行分类？
【发布时间】：2020-11-07 16:28:17
【问题描述】：

背景：

随着这个question 在使用 bert 对序列进行分类时，模型使用表示分类任务的“[CLS]”标记。根据论文：

每个序列的第一个标记总是一个特殊的分类令牌（[CLS]）。这个token对应的最终隐藏状态是用作分类的聚合序列表示任务。

查看Huggingfaces 存储库，他们的BertForSequenceClassification 使用了bert pooler 方法：

class BertPooler(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

我们可以看到他们采用第一个标记 (CLS) 并将其用作整个句子的表示。具体来说，他们执行hidden_states[:, 0]，这看起来很像从每个状态中获取第一个元素而不是获取第一个标记隐藏状态？

我的问题：

我不明白的是他们如何将整个句子中的信息编码到这个令牌中？ CLS 标记是一个常规标记，它有自己的嵌入向量，可以“学习”句子级别的表示吗？为什么我们不能只使用隐藏状态的平均值（编码器的输出）并用它来分类？

编辑：想了想：因为我们使用 CLS 令牌隐藏状态来预测，所以 CLS 令牌嵌入是否正在接受分类任务的训练，因为这是用于分类（因此是传播到其权重的误差的主要贡献者？）

【问题讨论】：

标签： python transformer huggingface-transformers bert-language-model

【解决方案1】：

CLS 标记是一个常规标记，它有自己的嵌入向量“学习”句子级表示吗？

是的：

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

clsToken = tokenizer.convert_tokens_to_ids('[CLS]') 
print(clsToken)
#or
print(tokenizer.cls_token, tokenizer.cls_token_id)

print(model.get_input_embeddings()(torch.tensor(clsToken)))

输出：

101
[CLS] 101
tensor([ 1.3630e-02, -2.6490e-02, -2.3503e-02, -7.7876e-03,  8.5892e-03,
        -7.6645e-03, -9.8808e-03,  6.0184e-03,  4.6921e-03, -3.0984e-02,
         1.8883e-02, -6.0093e-03, -1.6652e-02,  1.1684e-02, -3.6245e-02,
         ...
         5.4162e-03, -3.0037e-02,  8.6773e-03, -1.7942e-03,  6.6826e-03,
        -1.1929e-02, -1.4076e-02,  1.6709e-02,  1.6860e-03, -3.3842e-03,
         8.6805e-03,  7.1340e-03,  1.5147e-02], grad_fn=<EmbeddingBackward>)

您可以通过以下方式获取模型的所有其他特殊标记的列表：

print(tokenizer.all_special_tokens)

输出：

['[CLS]', '[UNK]', '[PAD]', '[SEP]', '[MASK]']

我不明白的是他们如何编码来自整个句子变成这个token？

和

因为我们使用 CLS 标记隐藏状态来预测，是 CLS 令牌嵌入正在接受分类任务的训练，因为是用于分类的令牌（因此是主要贡献者传播到其权重的错误？）

也是的。正如您在问题中已经说过的那样，BertForSequenceClassification 利用BertPooler 在 Bert 之上训练线性层：

#outputs contains the output of BertModel and the second element is the pooler output
pooled_output = outputs[1]

pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)

#...loss calculation based on logits and the given labels

为什么我们不能只使用隐藏状态的平均值（输出编码器）并用它来分类？

我一般不能真正回答这个问题，但你为什么认为这作为一个线性层会更容易或更好？您还需要训练隐藏层以生成平均映射到您的类的输出。因此，您还需要一个“平均层”来成为您损失的主要贡献者。一般来说，当你可以证明它比当前方法带来更好的结果时，没有人会拒绝它。

【讨论】：

谢谢，这完全有道理！我不认为它比使用线性层更容易或更好，我只是好奇我们可以用其他方式表示一个句子宽（或者在我的情况下是图像宽）表示