BertLMDataBunch.from_raw_corpus UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 49: invalid continuation byte答案

【问题标题】：BertLMDataBunch.from_raw_corpus UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 49: invalid continuation byteBertLMDataBunch.from_raw_corpus UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 49: invalid continuation byte
【发布时间】：2020-07-05 13:03:28
【问题描述】：

我无法使用 fast-bert 库微调 Camembert，创建 LMDataBunch 时收到此错误消息。有谁知道如何解决这一问题？谢谢

使用 logging.getLogger() 初始化 Ps 记录器

    databunch_lm = BertLMDataBunch.from_raw_corpus(
                    data_dir=DATA_PATH,
                    text_list=all_texts,
                    tokenizer='camembert-base',
                    batch_size_per_gpu=16,
                    max_seq_length=512,
                    multi_gpu=False,
                    model_type='camembert-base',
                    logger=logger)```


07/05/2020 14:50:31 - INFO - transformers.tokenization_utils_base -   loading file 
https://s3.amazonaws.com/models.huggingface.co/bert/camembert-base-sentencepiece.bpe.model from cache at C:\Users\Nawel/.cache\torch\hub\transformers\3715e3a4a2de48834619b2a6f48979e13ddff5cabfb1f3409db689f9ce3bb98f.28d30f926f545047fc59da64289371eef0fbdc0764ce9ec56f808a646fcfec59
07/05/2020 14:50:31 - INFO - root -   Creating features from dataset file C:\Users\Desktop\Stage\Camembert\data\lm_train.txt
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-136-5e7363fcd4d6> in <module>
      7                     multi_gpu=False,
      8                     model_type='camembert-base',
----> 9                     logger=logger)

~\anaconda3\lib\site-packages\fast_bert\data_lm.py in from_raw_corpus(data_dir, text_list, tokenizer, batch_size_per_gpu, max_seq_length, multi_gpu, test_size, model_type, logger, clear_cache, no_cache)
    198             logger=logger,
    199             clear_cache=clear_cache,
--> 200             no_cache=no_cache,
    201         )
    202 

~\anaconda3\lib\site-packages\fast_bert\data_lm.py in __init__(self, data_dir, tokenizer, train_file, val_file, batch_size_per_gpu, max_seq_length, multi_gpu, model_type, logger, clear_cache, no_cache)
    270                 cached_features_file,
    271                 self.logger,
--> 272                 block_size=self.tokenizer.max_len_single_sentence,
    273             )
    274 

~\anaconda3\lib\site-packages\fast_bert\data_lm.py in __init__(self, tokenizer, file_path, cache_path, logger, block_size)
    131             self.examples = []
    132             with open(file_path, encoding="utf-8") as f:
--> 133                 text = f.read()
    134 
    135             tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))

~\anaconda3\lib\codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 63: invalid continuation byte

【问题讨论】：

标签： python nlp bert-language-model

【解决方案1】：

我要关闭这个，我只需要将文件的编码更改为 utf-8

【讨论】：