预训练的 BERT 模型的权重未初始化答案

【问题标题】：Weights of pre-trained BERT model not initialized预训练的 BERT 模型的权重未初始化
【发布时间】：2021-06-08 05:26:36
【问题描述】：

我正在使用 Language Interpretability Toolkit (LIT) 加载和分析我在 NER 任务上预训练的 BERT 模型。

但是，当我启动 LIT 脚本并将预训练模型的路径传递给它时，它无法初始化权重并告诉我：

    modeling_utils.py:648] loading weights file bert_remote/examples/token-classification/Data/Models/results_21_03_04_cleaned_annotations/04.03._8_16_5e-5_cleaned_annotations/04-03-2021 (15.22.23)/pytorch_model.bin
    modeling_utils.py:739] Weights of BertForTokenClassification not initialized from pretrained model: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
    modeling_utils.py:745] Weights from pretrained model not used in BertForTokenClassification: ['bert.embeddings.position_ids']

然后它只是使用bert-base-german-cased 版本的 BERT，当然它没有我的自定义标签，因此无法预测任何东西。我认为这可能与 PyTorch 有关，但我找不到错误。

如果相关，这里是我如何将我的数据集加载到 CoNLL 2003 格式（发现 here 的数据加载器脚本的修改）：

    def __init__(self):

        # Read ConLL Test Files

        self._examples = []

        data_path = "lit_remote/lit_nlp/examples/datasets/NER_Data"
        with open(os.path.join(data_path, "test.txt"), "r", encoding="utf-8") as f:
            lines = f.readlines()

        for line in lines[:2000]:
            if line != "\n":
                token, label = line.split(" ")
                self._examples.append({
                    'token': token,
                    'label': label,
                })
            else:
                self._examples.append({
                    'token': "\n",
                    'label': "O"
                })

    def spec(self):
        return {
            'token': lit_types.Tokens(),
            'label': lit_types.SequenceTags(align="token"),
        }

这就是我初始化模型并启动 LIT 服务器的方式（修改 simple_pytorch_demo.py 脚本发现 here）：

    def __init__(self, model_name_or_path):
        self.tokenizer = transformers.AutoTokenizer.from_pretrained(
            model_name_or_path)
        model_config = transformers.AutoConfig.from_pretrained(
            model_name_or_path,
            num_labels=15,  # FIXME CHANGE
            output_hidden_states=True,
            output_attentions=True,
        )
        # This is a just a regular PyTorch model.
        self.model = _from_pretrained(
            transformers.AutoModelForTokenClassification,
            model_name_or_path,
            config=model_config)
        self.model.eval()

## Some omitted snippets here

    def input_spec(self) -> lit_types.Spec:
        return {
            "token": lit_types.Tokens(),
            "label": lit_types.SequenceTags(align="token")
        }

    def output_spec(self) -> lit_types.Spec:
        return {
            "tokens": lit_types.Tokens(),
            "probas": lit_types.MulticlassPreds(parent="label", vocab=self.LABELS),
            "cls_emb": lit_types.Embeddings()

【问题讨论】：

脚本在哪里告诉您它没有使用您的模型？这些警告信息并不真正令人担忧。请添加这些消息以及_from_pretrained 方法的定义。

标签： tensorflow nlp pytorch bert-language-model huggingface-transformers

【解决方案1】：

这实际上似乎是预期的行为。 HuggingFace 团队在documentation of the GPT models 中写道：

这将发出警告，说明一些预训练的权重没有被使用，一些权重被随机初始化。那是因为我们丢弃了 BERT 模型的预训练头，用随机初始化的分类头代替它。

所以微调似乎不是问题。在我上面描述的用例中，尽管有警告，它也能正常工作。

【讨论】：