【发布时间】:2020-07-05 13:03:28
【问题描述】:
我无法使用 fast-bert 库微调 Camembert, 创建 LMDataBunch 时收到此错误消息。 有谁知道如何解决这一问题 ? 谢谢
使用 logging.getLogger() 初始化 Ps 记录器
databunch_lm = BertLMDataBunch.from_raw_corpus(
data_dir=DATA_PATH,
text_list=all_texts,
tokenizer='camembert-base',
batch_size_per_gpu=16,
max_seq_length=512,
multi_gpu=False,
model_type='camembert-base',
logger=logger)```
07/05/2020 14:50:31 - INFO - transformers.tokenization_utils_base - loading file
https://s3.amazonaws.com/models.huggingface.co/bert/camembert-base-sentencepiece.bpe.model from cache at C:\Users\Nawel/.cache\torch\hub\transformers\3715e3a4a2de48834619b2a6f48979e13ddff5cabfb1f3409db689f9ce3bb98f.28d30f926f545047fc59da64289371eef0fbdc0764ce9ec56f808a646fcfec59
07/05/2020 14:50:31 - INFO - root - Creating features from dataset file C:\Users\Desktop\Stage\Camembert\data\lm_train.txt
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-136-5e7363fcd4d6> in <module>
7 multi_gpu=False,
8 model_type='camembert-base',
----> 9 logger=logger)
~\anaconda3\lib\site-packages\fast_bert\data_lm.py in from_raw_corpus(data_dir, text_list, tokenizer, batch_size_per_gpu, max_seq_length, multi_gpu, test_size, model_type, logger, clear_cache, no_cache)
198 logger=logger,
199 clear_cache=clear_cache,
--> 200 no_cache=no_cache,
201 )
202
~\anaconda3\lib\site-packages\fast_bert\data_lm.py in __init__(self, data_dir, tokenizer, train_file, val_file, batch_size_per_gpu, max_seq_length, multi_gpu, model_type, logger, clear_cache, no_cache)
270 cached_features_file,
271 self.logger,
--> 272 block_size=self.tokenizer.max_len_single_sentence,
273 )
274
~\anaconda3\lib\site-packages\fast_bert\data_lm.py in __init__(self, tokenizer, file_path, cache_path, logger, block_size)
131 self.examples = []
132 with open(file_path, encoding="utf-8") as f:
--> 133 text = f.read()
134
135 tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
~\anaconda3\lib\codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 63: invalid continuation byte
【问题讨论】:
标签: python nlp bert-language-model