【发布时间】:2021-11-29 16:30:54
【问题描述】:
我正在从this page 复制代码。我已将 BERT 模型下载到本地系统并获得句子嵌入。
我有大约 500,000 个句子需要句子嵌入,这需要很长时间。
- 有没有办法加快这个过程?
- 发送一批句子而不是一次发送一个句子会有帮助吗?
.
#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
corpa=["i am a boy","i live in a city"]
storage=[]#list to store all embeddings
for text in corpa:
# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"
# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)
# Map the token strings to their vocabulary indeces.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers.
with torch.no_grad():
outputs = model(tokens_tensor, segments_tensors)
# Evaluating the model will return a different number of objects based on
# how it's configured in the `from_pretrained` call earlier. In this case,
# becase we set `output_hidden_states = True`, the third item will be the
# hidden states from all layers. See the documentation for more details:
# https://huggingface.co/transformers/model_doc/bert.html#bertmodel
hidden_states = outputs[2]
# `hidden_states` has shape [13 x 1 x 22 x 768]
# `token_vecs` is a tensor with shape [22 x 768]
token_vecs = hidden_states[-2][0]
# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
storage.append((text,sentence_embedding))
######更新1
我根据提供的答案修改了我的代码。它没有进行完整的批处理
#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
batch_sentences = ["Hello I'm a single sentence",
"And another sentence",
"And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
storage=[]#list to store all embeddings
for i,text in enumerate(encoded_inputs['input_ids']):
tokens_tensor = torch.tensor([encoded_inputs['input_ids'][i]])
segments_tensors = torch.tensor([encoded_inputs['attention_mask'][i]])
print (tokens_tensor)
print (segments_tensors)
# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers.
with torch.no_grad():
outputs = model(tokens_tensor, segments_tensors)
# Evaluating the model will return a different number of objects based on
# how it's configured in the `from_pretrained` call earlier. In this case,
# becase we set `output_hidden_states = True`, the third item will be the
# hidden states from all layers. See the documentation for more details:
# https://huggingface.co/transformers/model_doc/bert.html#bertmodel
hidden_states = outputs[2]
# `hidden_states` has shape [13 x 1 x 22 x 768]
# `token_vecs` is a tensor with shape [22 x 768]
token_vecs = hidden_states[-2][0]
# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
print (sentence_embedding[:10])
storage.append((text,sentence_embedding))
我可以将 for 循环中的前 2 行更新到下面。但它们只有在标记化后所有句子的长度相同时才有效
tokens_tensor = torch.tensor([encoded_inputs['input_ids']])
segments_tensors = torch.tensor([encoded_inputs['attention_mask']])
而且在这种情况下outputs = model(tokens_tensor, segments_tensors) 失败。
在这种情况下我如何才能完全执行批处理?
【问题讨论】:
标签: python nlp huggingface-transformers bert-language-model huggingface-tokenizers