【问题标题】:How to preprocess a dataset for BERT model implemented in Tensorflow 2.x?如何为在 Tensorflow 2.x 中实现的 BERT 模型预处理数据集?
【发布时间】:2021-05-08 20:52:53
【问题描述】:

概述

我有一个用于分类问题的数据集。有两列,一列是sentences,另一列是labels(共:10 个标签)。我正在尝试转换此数据集以在为分类而制作的 BERT 模型中实现它,该模型在 Tensorflow 2.x 中实现。但是,我无法正确预处理数据集以将 PrefetchDataset 用作输入。

我做了什么?

  • Dataframe 平衡和洗牌(每个标签有 18708 个数据)
  • 数据框形状:(187080, 2)
  • from sklearn.model_selection import train_test_split 用于拆分数据帧
  • 80% 训练数据,20% 测试数据

训练数据:

X_train

array(['i hate megavideo  stupid time limits',
       'wow this class got wild quick  functions are a butt',
       'got in trouble no cell phone or computer for a you later twitter',
       ...,
       'we lied down around am rose a few hours later party still going lt',
       'i wanna miley cyrus on brazil  i love u my diva miley rocks',
       'i know i hate it i want my dj danger bck'], dtype=object)

y_train

array(['unfriendly', 'unfriendly', 'unfriendly', ..., 'pos_hp',
       'friendly', 'friendly'], dtype=object)

BERT 预处理 Xy_dataset

AUTOTUNE = tf.data.AUTOTUNE # autotune the buffer_size: optional = 1

train_Xy_slices = tf.data.Dataset.from_tensor_slices(tensors=(X_train, y_train))
dataset_train_Xy = train_Xy_slices.batch(batch_size=32)

输出

dataset_train_Xy
<PrefetchDataset shapes: ((None,), (None,)), types: (tf.string, tf.string)>


for i in dataset_train_Xy:
    print(i)
(
<tf.Tensor: shape=(32,), dtype=string, numpy=
array([b'some of us had to work al day',
       ...
       b'feels claudia cazacus free falling feat audrey gallagher amp thomas bronzwaers look ahead are the best trance offerings this summer'], dtype=object)>,
 
<tf.Tensor: shape=(32,), dtype=string, numpy=
array([b'interested', b'uninterested', b'happy', b'friendly', b'neg_hp',
       ...
       b'friendly', b'insecure', b'pos_hp', b'interested', b'happy'],
      dtype=object)>
)

预期输出(示例)

dataset_train_Xy
<PrefetchDataset shapes: ({input_word_ids: (None, 128), input_mask: (None, 128), input_type_ids: (None, 128)}, (None,)), types: ({input_word_ids: tf.int32, input_mask: tf.int32, input_type_ids: tf.int32}, tf.int64)>

观察/问题:

我知道我需要标记 X_trainy_train,但是当我尝试标记时出错:

AUTOTUNE = tf.data.AUTOTUNE # autotune the buffer_size: optional = 1

train_Xy_slices = tf.data.Dataset.from_tensor_slices(tensors=(X_train, y_train))
dataset_train_Xy = train_Xy_slices.batch(batch_size=batch_size) # 32

print(type(dataset_train_Xy))

# Tokenize the text to word pieces.
bert_preprocess = hub.load(tfhub_handle_preprocess)
tokenizer = hub.KerasLayer(bert_preprocess.tokenize, name='tokenizer')

dataset_train_Xy = dataset_train_Xy.map(lambda ex: (tokenizer(ex), ex[1])) #    print(i[1]) # correspond to labels
dataset_train_Xy = dataset_train_Xy.prefetch(buffer_size=AUTOTUNE)

追溯

<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-69-8e486f7b671b> in <module>()
     14 tokenizer = hub.KerasLayer(bert_preprocess.tokenize, name='tokenizer')
     15 
---> 16 dataset_train_Xy = dataset_train_Xy.map(lambda ex: (tokenizer(ex), ex[1])) #    print(i[1]) #labels
     17 dataset_train_Xy = dataset_train_Xy.prefetch(buffer_size=AUTOTUNE)

10 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py in wrapper(*args, **kwargs)
    668       except Exception as e:  # pylint:disable=broad-except
    669         if hasattr(e, 'ag_error_metadata'):
--> 670           raise e.ag_error_metadata.to_exception(e)
    671         else:
    672           raise

TypeError: in user code:


    TypeError: <lambda>() takes 1 positional argument but 2 were given

【问题讨论】:

    标签: python tensorflow tokenize bert-language-model


    【解决方案1】:

    工作样本 BERT 模型

    #importing neccessary modules
    import os
    import tensorflow as tf
    import tensorflow_hub as hub
    
    data = {'input' :['i hate megavideo  stupid time limits',
           'wow this class got wild quick  functions are a butt',
           'got in trouble no cell phone or computer for a you later twitter',
           'we lied down around am rose a few hours later party still going lt',
           'i wanna miley cyrus on brazil  i love u my diva miley rocks',
           'i know i hate it i want my dj danger bck'],
            'label' : ['unfriendly', 'unfriendly', 'unfriendly', 'unfriendly ',
           'friendly', 'friendly']}
            
    import pandas as pd
    df = pd.DataFrame(data)
    
    df['category']=df['label'].apply(lambda x: 1 if x=='friendly' else 0)
    
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(df['input'],df['category'], stratify=df['category'])
    
    bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
    bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")
    
    def get_sentence_embeding(sentences):
        preprocessed_text = bert_preprocess(sentences)
        return bert_encoder(preprocessed_text)['pooled_output']
    
    get_sentence_embeding([
        "we lied down around am rose", 
        "i hate it i want my dj"]
    )
    
    #Build model
    # Bert layers
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
    preprocessed_text = bert_preprocess(text_input)
    outputs = bert_encoder(preprocessed_text)
    
    # Neural network layers
    l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
    l = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l)
    
    # Use inputs and outputs to construct a final model
    model = tf.keras.Model(inputs=[text_input], outputs = [l])
    
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    
    model.fit(X_train, y_train, epochs=10)
    

    【讨论】:

    • 如果不安装tensorflow-text并导入tensorflow_text as text会报错
    猜你喜欢
    • 2019-07-23
    • 1970-01-01
    • 1970-01-01
    • 2023-03-15
    • 2019-08-31
    • 1970-01-01
    • 2020-10-19
    • 2021-01-06
    • 1970-01-01
    相关资源
    最近更新 更多