【问题标题】:How to train Tensorflow's pre trained BERT on MLM task? ( Use pre-trained model only in Tensorflow)如何在 MLM 任务上训练 Tensorflow 的预训练 BERT? (仅在 Tensorflow 中使用预训练模型)
【发布时间】:2022-02-16 02:50:14
【问题描述】:

在任何人建议 pytorch 和其他东西之前,我只专门寻找 Tensorflow + 预训练 + MLM 任务。我知道,有很多 PyTorch 的博客和很多关于 Tensorflow 微调(分类)的博客。

谈到这个问题,我得到了一个语言模型,它是英语 + LaTex,其中文本数据可以表示物理、化学、数学和生物学中的任何文本,任何典型的例子都可以看起来像这样: Link to OCR image

"Find the value of function x in the equation: \n \\( f(x)=\\left\\{\\begin{array}{ll}x^{2} & \\text { if } x<0 \\\\ 2 x & \\text { if } x \\geq 0\\end{array}\\right. \\)"

所以我的语言模型需要理解 \geq \\begin array \eng \left \right 而不是英语,这就是为什么我需要先在预训练的 BERT SciBERT 上训练 MLM 以同时拥有两者。于是我就去网上挖了一些教程:

  1. MLM training on Tensorflow BUT from Scratch; I need pre-trained
  2. MLM on pre-trained but in Pytorch; I need Tensorflow
  3. Fine Tuning with Keras; It is for classification but I want MLM

我已经有了一个微调分类模型。部分代码如下:

tokenizer = transformers.BertTokenizer.from_pretrained('bert-large-uncased')

def regular_encode(texts, tokenizer, maxlen=maxlen):
  enc_di = tokenizer.batch_encode_plus(texts,  return_token_type_ids=False,padding='max_length',max_length=maxlen,truncation = True,)
  return np.array(enc_di['input_ids'])

Xtrain_encoded = regular_encode(X_train.astype('str'), tokenizer, maxlen=maxlen)
ytrain_encoded = tf.keras.utils.to_categorical(y_train, num_classes=classes,dtype = 'int32')

def build_model(transformer, loss='categorical_crossentropy', max_len=maxlen, dense = 512, drop1 = 0.3, drop2 = 0.3):
    input_word_ids = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]

    #Fine Tuning Model Start
    x = tf.keras.layers.Dropout(drop1)(cls_token)  
    x = tf.keras.layers.Dense(512, activation='relu')(x)
    x = tf.keras.layers.Dropout(drop2)(x)
    out = tf.keras.layers.Dense(classes, activation='softmax')(x)
    model = tf.keras.Model(inputs=input_word_ids, outputs=out)
    return model

我能得到的唯一有用的东西是HuggingFace那个

借助 TensorFlow 和 PyTorch 模型之间的紧密互操作性,您甚至可以保存模型,然后将其重新加载为 PyTorch 模型(反之亦然)

from transformers import AutoModelForSequenceClassification

model.save_pretrained("my_imdb_model")
pytorch_model = AutoModelForSequenceClassification.from_pretrained("my_imdb_model", from_tf=True)

所以也许我可以训练pytorch MLM,然后将其加载为tensorflow 微调分类模型? 有没有其他办法?

【问题讨论】:

    标签: tensorflow keras deep-learning nlp huggingface-transformers


    【解决方案1】:

    我想经过一番研究,我发现了一些东西。我不知道它是否会起作用,但它使用 transformerTensorflowXLM 。我认为它也适用于BERT

    PRETRAINED_MODEL = 'jplu/tf-xlm-roberta-large'
    from tensorflow.keras.optimizers import Adam
    import transformers
    from transformers import TFAutoModelWithLMHead, AutoTokenizer
    
    def create_mlm_model_and_optimizer():
        with strategy.scope():
            model = TFAutoModelWithLMHead.from_pretrained(PRETRAINED_MODEL)
            optimizer = tf.keras.optimizers.Adam(learning_rate=LR)
        return model, optimizer
    
    
    mlm_model, optimizer = create_mlm_model_and_optimizer()
    
    
    
    
    def define_mlm_loss_and_metrics():
        with strategy.scope():
            mlm_loss_object = masked_sparse_categorical_crossentropy
    
            def compute_mlm_loss(labels, predictions):
                per_example_loss = mlm_loss_object(labels, predictions)
                loss = tf.nn.compute_average_loss(
                    per_example_loss, global_batch_size = global_batch_size)
                return loss
    
            train_mlm_loss_metric = tf.keras.metrics.Mean()
            
        return compute_mlm_loss, train_mlm_loss_metric
    
    
    def masked_sparse_categorical_crossentropy(y_true, y_pred):
        y_true_masked = tf.boolean_mask(y_true, tf.not_equal(y_true, -1))
        y_pred_masked = tf.boolean_mask(y_pred, tf.not_equal(y_true, -1))
        loss = tf.keras.losses.sparse_categorical_crossentropy(y_true_masked,
                                                              y_pred_masked,
                                                              from_logits=True)
        return loss
    
                
                
    def train_mlm(train_dist_dataset, total_steps=2000, evaluate_every=200):
        step = 0
        ### Training lopp ###
        for tensor in train_dist_dataset:
            distributed_mlm_train_step(tensor) 
            step+=1
    
            if (step % evaluate_every == 0):   
                ### Print train metrics ###  
                train_metric = train_mlm_loss_metric.result().numpy()
                print("Step %d, train loss: %.2f" % (step, train_metric))     
    
                ### Reset  metrics ###
                train_mlm_loss_metric.reset_states()
                
            if step  == total_steps:
                break
    
    
    @tf.function
    def distributed_mlm_train_step(data):
        strategy.experimental_run_v2(mlm_train_step, args=(data,))
    
    
    @tf.function
    def mlm_train_step(inputs):
        features, labels = inputs
    
        with tf.GradientTape() as tape:
            predictions = mlm_model(features, training=True)[0]
            loss = compute_mlm_loss(labels, predictions)
    
        gradients = tape.gradient(loss, mlm_model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, mlm_model.trainable_variables))
    
        train_mlm_loss_metric.update_state(loss)
        
    
    compute_mlm_loss, train_mlm_loss_metric = define_mlm_loss_and_metrics()
    

    现在将其训练为train_mlm(train_dist_dataset, TOTAL_STEPS, EVALUATE_EVERY)

    Above ode is from this notebook and you need to do all the necessary things given exactly

    作者最后说:

    这个经过微调的模型可以像原始模型一样加载,以从中构建分类模型

    【讨论】:

      最近更新 更多