Google Colab 中的 BERT 多类文本分类答案

【问题标题】：BERT Multi-class text classification in Google ColabGoogle Colab 中的 BERT 多类文本分类
【发布时间】：2019-11-05 14:29:37
【问题描述】：

我正在研究一组社交媒体 cmets（包括 youtube 链接）作为输入特征，并将 Myers-Biggs 人格档案作为目标标签：

    type    posts
0   INFJ    'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1   ENTP    'I'm finding the lack of me in these posts ver...
2   INTP    'Good one _____ https://www.youtube.com/wat...
3   INTJ    'Dear INTP, I enjoyed our conversation the o...
4   ENTJ    'You're fired.|||That's another silly misconce...

但据我发现，BERT 希望 DataFrame 采用这种格式：

a   label   posts
0   a   8   'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1   a   3   'I'm finding the lack of me in these posts ver...
2   a   11  'Good one _____ https://www.youtube.com/wat...
3   a   10  'Dear INTP, I enjoyed our conversation the o...
4   a   2   'You're fired.|||That's another silly misconce...

生成的输出必须是对分成四列的 cmets 测试集的预测，每个列对应一个 Personality Profile，例如，'Mind' = 1 是 Extrovert 的标签。基本上将像INFJ这样的类型分为'Mind'，'Energy'，'Nature'，'Tactics'，就像这样：

    type    post    Mind    Energy  Nature  Tactics
0   INFJ    'url-web    0   1   0   1
1   INFJ    url-web 0   1   0   1
2   INFJ    enfp and intj moments url-web sportscenter n... 0   1   0   1
3   INFJ    What has been the most life-changing experienc...   0   1   0   1
4   INFJ    url-web url-web On repeat for most of today.    0   1   0   1

我已经安装了 pytorch-pretrained-bert 使用：

!pip install pytorch-pretrained-bert

我已导入模型并尝试使用以下方法标记“帖子”列：

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokenized_train = tokenizer.tokenize(train)

但收到此错误：

TypeError: ord() expected a character, but string of length 5 found

我根据 pytorch-pretrained-bert GitHub Repo 和 Youtube 视频尝试了这个。

我是一名数据科学实习生，完全没有深度学习经验。我只是想以最简单的方式来试验 BERT 模型来预测多类分类输出，这样我就可以将结果与我们目前正在研究的更简单的文本分类模型进行比较。我在 Google Colab 工作，结果输出应该是 .csv 文件。

我知道这是一个复杂的模型，并且围绕模型的所有文档和示例都很复杂（微调层等），但对于初学者数据科学家的简单实现（如果实际上有这样的事情）有任何帮助具有最少的软件工程经验，将不胜感激。

【问题讨论】：

标签： python pytorch data-science google-colaboratory bert-language-model

【解决方案1】：

我建议你从一个简单的 BERT 分类任务开始，例如遵循这个优秀的教程：https://mccormickml.com/2019/07/22/BERT-fine-tuning/

然后您可以通过以下方式进入多标签：https://medium.com/huggingface/multi-label-text-classification-using-bert-the-mighty-transformer-69714fa3fb3d

只有这样，我才会建议您在自己的数据集上尝试您的任务。

【讨论】：

虽然这些链接可能会回答问题，但最好在此处包含答案的基本部分并提供链接以供参考。如果链接页面发生更改，仅链接的答案可能会失效。

【解决方案2】：

更简单是一个主观术语。假设您愿意使用 Tensorflow 和 keras-bert，您可以使用 BERT 进行多类文本分类，如下所示：

n_classes = 20
model = load_trained_model_from_checkpoint(
  config_path,
  checkpoint_path,
  training=True,
  trainable=True,
  seq_len=SEQ_LEN,
)

# Add dense layer for classification
inputs = model.inputs[:2]
dense = model.get_layer('NSP-Dense').output
outputs = keras.layers.Dense(units=n_classes, activation='softmax')(dense)
model = keras.models.Model(inputs, outputs)

model.compile(
    RAdam(lr=LR),
    loss='sparse_categorical_crossentropy',
    metrics=['sparse_categorical_accuracy'],
)

history = model.fit(
    train_x,
    train_y,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_split=0.20,
    shuffle=True,
)

这里是完整教程的链接，其中包含 Google Colab GPU 实现 Multi-class text classification using BERT on 20 Newsgroup Dataset with Fine Tuning

看看！ https://pysnacks.com/machine-learning/bert-text-classification-with-fine-tuning/#multi-class-text-classification-using-bert

【讨论】：