TRANSFORMERS：要求填充但标记器没有填充标记答案

【问题标题】：TRANSFORMERS: Asking to pad but the tokenizer does not have a padding tokenTRANSFORMERS：要求填充但标记器没有填充标记
【发布时间】：2021-12-31 16:39:57
【问题描述】：

尝试使用相同的数据集按顺序评估一堆变形金刚模型，以检查哪个效果更好。

型号列表是这样的：

MODELS = [
      ('xlm-mlm-enfr-1024'   ,"XLMModel"),
      ('distilbert-base-cased', "DistilBertModel"),
      ('bert-base-uncased'     ,"BertModel"),
      ('roberta-base'        ,"RobertaModel"),
      ("cardiffnlp/twitter-roberta-base-sentiment","RobertaSentTW"),
      ('xlnet-base-cased'     ,"XLNetModel"),
      #('ctrl'                ,"CTRLModel"),
      ('transfo-xl-wt103'    ,"TransfoXLModel"),
      ('bert-base-cased'       ,"BertModelUncased"),
      ('xlm-roberta-base'     ,"XLMRobertaModel"),
      ('openai-gpt'           ,"OpenAIGPTModel"),
      ('gpt2'                 ,"GPT2Model")

在 'ctrl' 模型之前，它们都可以正常工作，它会返回此错误：

Asking to pad but the tokenizer does not have a padding token. Please select a token to use as 'pad_token' '(tokenizer.pad_token = tokenizer.eos_token e.g.)' or add a new pad token via 'tokenizer.add_special_tokens({'pad_token': '[PAD]'})'.

在对我的数据集的句子进行标记时。

标记化代码是

SEQ_LEN = MAX_LEN #(50)

for pretrained_weights, model_name in MODELS:

print("***************** INICIANDO " ,model_name,", weights ",pretrained_weights, "********* ")
print("carganzo el tokenizador ()")
tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)
print("creando el modelo preentrenado")
transformer_model = TFAutoModel.from_pretrained(pretrained_weights)
print("aplicando el tokenizador al dataset")

##APLICAMOS EL TOKENIZADOR##

def tokenize(sentence):
  
  tokens = tokenizer.encode_plus(sentence, max_length=MAX_LEN,
                               truncation=True, padding='max_length',
                               add_special_tokens=True, return_attention_mask=True,
                               return_token_type_ids=False, return_tensors='tf')
  return tokens['input_ids'], tokens['attention_mask']

# initialize two arrays for input tensors
Xids = np.zeros((len(df), SEQ_LEN))
Xmask = np.zeros((len(df), SEQ_LEN))

for i, sentence in enumerate(df['tweet']):
    Xids[i, :], Xmask[i, :] = tokenize(sentence)
    if i % 10000 == 0:
        print(i)  # do this so we can see some progress


arr = df['label'].values  # take label column in df as array

labels = np.zeros((arr.size, arr.max()+1))  # initialize empty (all zero) label array
labels[np.arange(arr.size), arr] = 1  # add ones in indices where we have a value`

我已尝试按照解决方案告诉我的方式定义填充标记，但随后出现此错误

could not broadcast input array from shape (3,) into shape (50,)

排队

Xids[i, :], Xmask[i, :] = tokenize(sentence)

我也尝试过this solution，但也没有用。

如果你能读到这里，谢谢。

需要任何帮助。

【问题讨论】：

could not broadcast input array from shape (3,) into shape (50,) 表示从tokenize 返回的张量的形状是3，而Xids 为形状为50 的张量保留了空间。形状不匹配。当你做return tokens['input_ids'], tokens['attention_mask']时，确保两个张量的形状都是SEQ_LEN，如果不是pad them with zeros，或者剪掉它们。在使用 tensorflow return_tensors='tf' 时，在 tensorflow 中找到一种方法。我只知道pytorch

标签： python tensorflow pytorch tokenize huggingface-transformers

【解决方案1】：

您可以使用add_special_tokens API 添加[PAD] 令牌。

tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

【讨论】：

答案对你有用吗？