【发布时间】:2021-12-31 16:39:57
【问题描述】:
尝试使用相同的数据集按顺序评估一堆变形金刚模型,以检查哪个效果更好。
型号列表是这样的:
MODELS = [
('xlm-mlm-enfr-1024' ,"XLMModel"),
('distilbert-base-cased', "DistilBertModel"),
('bert-base-uncased' ,"BertModel"),
('roberta-base' ,"RobertaModel"),
("cardiffnlp/twitter-roberta-base-sentiment","RobertaSentTW"),
('xlnet-base-cased' ,"XLNetModel"),
#('ctrl' ,"CTRLModel"),
('transfo-xl-wt103' ,"TransfoXLModel"),
('bert-base-cased' ,"BertModelUncased"),
('xlm-roberta-base' ,"XLMRobertaModel"),
('openai-gpt' ,"OpenAIGPTModel"),
('gpt2' ,"GPT2Model")
在 'ctrl' 模型之前,它们都可以正常工作,它会返回此错误:
Asking to pad but the tokenizer does not have a padding token. Please select a token to use as 'pad_token' '(tokenizer.pad_token = tokenizer.eos_token e.g.)' or add a new pad token via 'tokenizer.add_special_tokens({'pad_token': '[PAD]'})'.
在对我的数据集的句子进行标记时。
标记化代码是
SEQ_LEN = MAX_LEN #(50)
for pretrained_weights, model_name in MODELS:
print("***************** INICIANDO " ,model_name,", weights ",pretrained_weights, "********* ")
print("carganzo el tokenizador ()")
tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)
print("creando el modelo preentrenado")
transformer_model = TFAutoModel.from_pretrained(pretrained_weights)
print("aplicando el tokenizador al dataset")
##APLICAMOS EL TOKENIZADOR##
def tokenize(sentence):
tokens = tokenizer.encode_plus(sentence, max_length=MAX_LEN,
truncation=True, padding='max_length',
add_special_tokens=True, return_attention_mask=True,
return_token_type_ids=False, return_tensors='tf')
return tokens['input_ids'], tokens['attention_mask']
# initialize two arrays for input tensors
Xids = np.zeros((len(df), SEQ_LEN))
Xmask = np.zeros((len(df), SEQ_LEN))
for i, sentence in enumerate(df['tweet']):
Xids[i, :], Xmask[i, :] = tokenize(sentence)
if i % 10000 == 0:
print(i) # do this so we can see some progress
arr = df['label'].values # take label column in df as array
labels = np.zeros((arr.size, arr.max()+1)) # initialize empty (all zero) label array
labels[np.arange(arr.size), arr] = 1 # add ones in indices where we have a value`
我已尝试按照解决方案告诉我的方式定义填充标记,但随后出现此错误
could not broadcast input array from shape (3,) into shape (50,)
排队
Xids[i, :], Xmask[i, :] = tokenize(sentence)
我也尝试过this solution,但也没有用。
如果你能读到这里,谢谢。
需要任何帮助。
【问题讨论】:
-
could not broadcast input array from shape (3,) into shape (50,)表示从tokenize返回的张量的形状是3,而Xids为形状为50的张量保留了空间。形状不匹配。当你做return tokens['input_ids'], tokens['attention_mask']时,确保两个张量的形状都是SEQ_LEN,如果不是pad them with zeros,或者剪掉它们。在使用 tensorflowreturn_tensors='tf'时,在 tensorflow 中找到一种方法。我只知道pytorch
标签: python tensorflow pytorch tokenize huggingface-transformers