【问题标题】:scikit-learn CountVectorizer UnicodeDecodeErrorscikit-learn CountVectorizer UnicodeDecodeError
【发布时间】:2017-02-25 21:18:56
【问题描述】:

我有以下代码 sn-p 我试图列出术语频率,其中 first_textsecond_text.tex 文档:

from sklearn.feature_extraction.text import CountVectorizer
training_documents = (first_text, second_text)  
vectorizer = CountVectorizer()
vectorizer.fit_transform(training_documents)
print "Vocabulary:", vectorizer.vocabulary 

当我运行脚本时,我得到以下信息:

File "test.py", line 19, in <module>
    vectorizer.fit_transform(training_documents)
  File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
    self.fixed_vocabulary_)
  File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab
    for feature in analyze(doc):
  File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 115, in decode
    doc = doc.decode(self.encoding, self.decode_error)
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 200086: invalid start byte

我该如何解决这个问题?

谢谢。

【问题讨论】:

    标签: python scikit-learn


    【解决方案1】:

    如果你能弄清楚你的文档的编码是什么(也许它们是latin-1),你可以将它传递给CountVectorizer

    vectorizer = CountVectorizer(encoding='latin-1')
    

    否则,您可以跳过包含有问题字节的标记

    vectorizer = CountVectorizer(decode_error='ignore')
    

    【讨论】:

      猜你喜欢
      • 2020-05-05
      • 2020-01-15
      • 1970-01-01
      • 2019-06-08
      • 1970-01-01
      • 2016-09-02
      • 2014-07-13
      • 2019-03-14
      • 2018-03-18
      相关资源
      最近更新 更多