【问题标题】:BertLMDataBunch.from_raw_corpus UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 49: invalid continuation byteBertLMDataBunch.from_raw_corpus UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 49: invalid continuation byte
【发布时间】:2020-07-05 13:03:28
【问题描述】:

我无法使用 fast-bert 库微调 Camembert, 创建 LMDataBunch 时收到此错误消息。 有谁知道如何解决这一问题 ? 谢谢

使用 logging.getLogger() 初始化 Ps 记录器

    databunch_lm = BertLMDataBunch.from_raw_corpus(
                    data_dir=DATA_PATH,
                    text_list=all_texts,
                    tokenizer='camembert-base',
                    batch_size_per_gpu=16,
                    max_seq_length=512,
                    multi_gpu=False,
                    model_type='camembert-base',
                    logger=logger)```


07/05/2020 14:50:31 - INFO - transformers.tokenization_utils_base -   loading file 
https://s3.amazonaws.com/models.huggingface.co/bert/camembert-base-sentencepiece.bpe.model from cache at C:\Users\Nawel/.cache\torch\hub\transformers\3715e3a4a2de48834619b2a6f48979e13ddff5cabfb1f3409db689f9ce3bb98f.28d30f926f545047fc59da64289371eef0fbdc0764ce9ec56f808a646fcfec59
07/05/2020 14:50:31 - INFO - root -   Creating features from dataset file C:\Users\Desktop\Stage\Camembert\data\lm_train.txt
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-136-5e7363fcd4d6> in <module>
      7                     multi_gpu=False,
      8                     model_type='camembert-base',
----> 9                     logger=logger)

~\anaconda3\lib\site-packages\fast_bert\data_lm.py in from_raw_corpus(data_dir, text_list, tokenizer, batch_size_per_gpu, max_seq_length, multi_gpu, test_size, model_type, logger, clear_cache, no_cache)
    198             logger=logger,
    199             clear_cache=clear_cache,
--> 200             no_cache=no_cache,
    201         )
    202 

~\anaconda3\lib\site-packages\fast_bert\data_lm.py in __init__(self, data_dir, tokenizer, train_file, val_file, batch_size_per_gpu, max_seq_length, multi_gpu, model_type, logger, clear_cache, no_cache)
    270                 cached_features_file,
    271                 self.logger,
--> 272                 block_size=self.tokenizer.max_len_single_sentence,
    273             )
    274 

~\anaconda3\lib\site-packages\fast_bert\data_lm.py in __init__(self, tokenizer, file_path, cache_path, logger, block_size)
    131             self.examples = []
    132             with open(file_path, encoding="utf-8") as f:
--> 133                 text = f.read()
    134 
    135             tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))

~\anaconda3\lib\codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 63: invalid continuation byte

【问题讨论】:

    标签: python nlp bert-language-model


    【解决方案1】:

    我要关闭这个,我只需要将文件的编码更改为 utf-8

    【讨论】:

      猜你喜欢
      • 2021-12-04
      • 2014-05-14
      • 1970-01-01
      • 1970-01-01
      • 2021-12-01
      • 2020-03-01
      • 1970-01-01
      • 2020-12-26
      • 2021-11-24
      相关资源
      最近更新 更多