【发布时间】:2021-05-08 20:52:53
【问题描述】:
概述
我有一个用于分类问题的数据集。有两列,一列是sentences,另一列是labels(共:10 个标签)。我正在尝试转换此数据集以在为分类而制作的 BERT 模型中实现它,该模型在 Tensorflow 2.x 中实现。但是,我无法正确预处理数据集以将 PrefetchDataset 用作输入。
我做了什么?
- Dataframe 平衡和洗牌(每个标签有 18708 个数据)
- 数据框形状:(187080, 2)
-
from sklearn.model_selection import train_test_split用于拆分数据帧 - 80% 训练数据,20% 测试数据
训练数据:
X_train
array(['i hate megavideo stupid time limits',
'wow this class got wild quick functions are a butt',
'got in trouble no cell phone or computer for a you later twitter',
...,
'we lied down around am rose a few hours later party still going lt',
'i wanna miley cyrus on brazil i love u my diva miley rocks',
'i know i hate it i want my dj danger bck'], dtype=object)
y_train
array(['unfriendly', 'unfriendly', 'unfriendly', ..., 'pos_hp',
'friendly', 'friendly'], dtype=object)
BERT 预处理 Xy_dataset
AUTOTUNE = tf.data.AUTOTUNE # autotune the buffer_size: optional = 1
train_Xy_slices = tf.data.Dataset.from_tensor_slices(tensors=(X_train, y_train))
dataset_train_Xy = train_Xy_slices.batch(batch_size=32)
输出
dataset_train_Xy
<PrefetchDataset shapes: ((None,), (None,)), types: (tf.string, tf.string)>
for i in dataset_train_Xy:
print(i)
(
<tf.Tensor: shape=(32,), dtype=string, numpy=
array([b'some of us had to work al day',
...
b'feels claudia cazacus free falling feat audrey gallagher amp thomas bronzwaers look ahead are the best trance offerings this summer'], dtype=object)>,
<tf.Tensor: shape=(32,), dtype=string, numpy=
array([b'interested', b'uninterested', b'happy', b'friendly', b'neg_hp',
...
b'friendly', b'insecure', b'pos_hp', b'interested', b'happy'],
dtype=object)>
)
预期输出(示例)
dataset_train_Xy
<PrefetchDataset shapes: ({input_word_ids: (None, 128), input_mask: (None, 128), input_type_ids: (None, 128)}, (None,)), types: ({input_word_ids: tf.int32, input_mask: tf.int32, input_type_ids: tf.int32}, tf.int64)>
观察/问题:
我知道我需要标记 X_train 和 y_train,但是当我尝试标记时出错:
AUTOTUNE = tf.data.AUTOTUNE # autotune the buffer_size: optional = 1
train_Xy_slices = tf.data.Dataset.from_tensor_slices(tensors=(X_train, y_train))
dataset_train_Xy = train_Xy_slices.batch(batch_size=batch_size) # 32
print(type(dataset_train_Xy))
# Tokenize the text to word pieces.
bert_preprocess = hub.load(tfhub_handle_preprocess)
tokenizer = hub.KerasLayer(bert_preprocess.tokenize, name='tokenizer')
dataset_train_Xy = dataset_train_Xy.map(lambda ex: (tokenizer(ex), ex[1])) # print(i[1]) # correspond to labels
dataset_train_Xy = dataset_train_Xy.prefetch(buffer_size=AUTOTUNE)
追溯
<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-69-8e486f7b671b> in <module>()
14 tokenizer = hub.KerasLayer(bert_preprocess.tokenize, name='tokenizer')
15
---> 16 dataset_train_Xy = dataset_train_Xy.map(lambda ex: (tokenizer(ex), ex[1])) # print(i[1]) #labels
17 dataset_train_Xy = dataset_train_Xy.prefetch(buffer_size=AUTOTUNE)
10 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py in wrapper(*args, **kwargs)
668 except Exception as e: # pylint:disable=broad-except
669 if hasattr(e, 'ag_error_metadata'):
--> 670 raise e.ag_error_metadata.to_exception(e)
671 else:
672 raise
TypeError: in user code:
TypeError: <lambda>() takes 1 positional argument but 2 were given
【问题讨论】:
标签: python tensorflow tokenize bert-language-model